notes_train.txt

HighLevelTaskRecommendUpdatesforSelectedEmail: Learning to Rank Neural (high level py lib) - http://docs.deeppavlov.ai/en/latest/components/neural_ranking.html
HighLevelTaskRecommendUpdatesforSelectedEmail: RankLib: Java impl of 8 LTR Algos: MART, RankNet, RankBoost , AdaRank , Coordinate Ascent, LambdaMART , ListNet, Random Forests
HighLevelTaskRecommendUpdatesforSelectedEmail: Elastic Search Plugin (with Tutorial) to integrate RankLib - https://medium.com/@purbon/learning-to-rank-101-5755f2797a3a
HighLevelTaskRecommendUpdatesforSelectedEmail: https://algorithmia.com Play with different algorithms for free
HighLevelTaskRecommendUpdatesforSelectedEmail:shanirevlon
HighLevelTaskRecommendUpdatesforSelectedEmail:dudu460
HighLevelTaskRecommendUpdatesforSelectedEmail:Given Email Body + Rcpts + Metadata (DateTime ...) --> return a ranked list of updates from Collage that are filtered by the extracted topics from selected email.
HighLevelTaskRecommendUpdatesforSelectedEmail:"Gremlin"
Gremlin: findByEmail - Get the collageUserId of user with a specific email
Gremlin:g.V().has('user', 'email', 'yehonathans@harmon.ie').values('userId')
Gremlin: Top Topics for Yehonathan.
Gremlin: Get vertices with label: 'actor' (not user) that also has a property 'name': 'Yehonathan Sharvit' (can be also 'email')
Gremlin: See has(label, key, value) in http://tinkerpop.apache.org/docs/current/reference/#has-step
Gremlin:  filter only mail actors (actors can participate also in SP)
Gremlin:g.V()
Gremlin:.has('system', 'mail')
Gremlin:.outE('performed')
Gremlin:.inV()
Gremlin:.out('about')
Gremlin:.not(has('bad', true))
Gremlin:.as('t')
Gremlin:.groupCount().by(select('t').values('topicId'))
Gremlin:.unfold()
Gremlin:.order().by(values, decr)
Gremlin:.limit(10)
Gremlin: Does property exist ?
Gremlin:g.V().hasLabel('artifact').properties().hasKey('textForTermsExtraction')
Gremlin: within - or clause for has (or query)
Gremlin: choose - conditional ($cond, if)
Gremlin:g.V().hasLabel('person').
Gremlin:__.in(),
Gremlin:__.out()).values('name')
Gremlin: optional - g.V('vadas').optional(out('knows'))
Gremlin:returns the result of the specified traversal (if vadas has outgoing edge knows - return the target node, else return vadas)
Gremlin: "D:\Collage\gremlin-console\bin\gremlin.bat"
Gremlin: Note: Had to place full path to my java, since it is not installed properly - not related to gremlin console
Gremlin: Connect to localhost:default-port gremlin server
Gremlin::remote connect tinkerpop.server conf/remote.yaml session
Gremlin:
Gremlin: __.out('contains') vs. .out('contains') - the __.out can be used as a static func - without object context
Gremlin: Ex: .coalesce(
Gremlin:__.out('about').has('topic', 'topicId', topicId),
Gremlin:....
Gremlin: gremlin.js example
Gremlin:node graph\gremlin.js --database desktop-eih0hb2 --query "g.V().has('topicId', 'docusign').in('artAbout').not(has('isMarketing', true)).as('art').inE('performed').has('date', gte(1533070800000)).has('date', lte(1535749140000)).outV().not(has('isAutomated', true)).select('art').dedup().values('artifactId')"
Gremlin:node ..\..\graph\gremlin.js --database desktop-eih0hb2 --query "g.V().has('topicId', 'docusign').in('artAbout').not(has('isMarketing', true)).as('art').inE('performed').has('date', gte(1533070800000)).has('date', lte(1535749140000)).outV().not(has('isAutomated', true)).has('email', 'yaacovc@harmon.ie').select('art').dedup().count()"
Gremlin:"Graph"
Gremlin:--------------------------------
Gremlin: artifact is a single email message with in-edges from conversation
Gremlin: Every email thread has a view for each user: Not all users sees the same artifacts in a single conversation, so every user has its own conversation with out 'contains' edges to artifacts
Gremlin: artifact has out edges to 'topic'
Gremlin: Each artifact
Gremlin:"Email Topics" "Drilldown"
EmailTopicsDrilldown: About edge.isPerson = true for this Email --> Tal topic node.
TopTopicsSprintStories: Roadmap:
TopTopicsSprintStories: Data
TopTopicsSprintStories:: Free edition: How we get data from different orgs / users for analysis and testing ?
TopTopicsSprintStories: Free edition
TopTopicsSprintStories: Privacy collect emails.
TopTopicsSprintStories: Reflect potential customers )From another Org). Risk: Dealing with small number of harmon.ie management staff emails may not reflect many types of customers.
TopTopicsSprintStories: Testsets: Construct Expected Topics, Expected Signatures, Expected Duplicates - so we can improve our algorithms and control changes.
TopTopicsSprintStories: Research:
TopTopicsSprintStories: Stream of New Reports and Analytics to explore new research questions, insights and intuitions
TopTopicsSprintStories: Machine Learning - to tune weights of formulas.
TopTopicsSprintStories: We do not want to accumulate large backlog of many months
TopTopicsSprintStories: Product Topic Logic should be factored such that Topic experiments (no docker containers restart, debugger, no-reindex with new update-fetching when logic changes, quickly import data from json and mongo)
TopTopicsSprintStories: UI support for Topics (2 months Algo + 2 month backend + UI)
TopTopicsSprintStories: General: We have 3 Main goals in this section
TopTopicsSprintStories: Assist the user in easily and quickly workaround limitations of current Topics collection and ranking (Add / Remove from Top-Topics)
TopTopicsSprintStories: Collect User Feedback to use with other users/orgs --> make Collage more intelligent and accurate.
TopTopicsSprintStories: Group Co-occurring Topics in Algo+UI:
TopTopicsSprintStories: Ex: Ram email thread: subject:RE: [S201804020253445112] looking for these APIs: File, CompressedFile , AddinCommands for Excel. from:ramt@harmon.ie
TopTopicsSprintStories: This is an important thread for ramt --> good for Top Topics
TopTopicsSprintStories: Ex: Subject:RE: RSPB as a Case Study for Harmon.ie --> both RSPB and Case Study are top10 (davidl) --> but they are not grouped together
TopTopicsSprintStories: Ex: Managed access to Microsoft Graph in Microsoft Azure Preview --> extract 'Managed access' and 'Microsoft Azure Preview'
TopTopicsSprintStories:Setup Managed access for Microsoft Graph in Microsoft Azure Preview step --> 'Setup Managed access' -->  'Managed access'.count -= 1
TopTopicsSprintStories: Ex:  ADF, Group of users, PAM approval, Ravenwood CopyActivity RunId (Noams report Aug 1 - 31) ranked 2, 7, 8, 9
TopTopicsSprintStories: excel, compressedfile, addincommands, apis --> top of Ram Mar-Apr
TopTopicsSprintStories: Duplicated subject (above)
TopTopicsSprintStories: It will take 3 out of 5 places in Top Topics.
TopTopicsSprintStories: Display top topic: Excel without related compressedfile and addincommands, apis doesn't help the user
TopTopicsSprintStories: Calculate Co-Occur (as part of related topics) and Display them in a group in UI.
TopTopicsSprintStories: This give a context to Excel for the user
TopTopicsSprintStories: Group Parent (Collage) with its Top Children (Collage Demo, Collage Design)
TopTopicsSprintStories: Display Topic Context
TopTopicsSprintStories: Problem: terms-processing 'Excel authentication' --> 'Excel', because 'authentication' is not a topic --> Excel becomes a too genenral Topic that occur in 2-3 contexts
TopTopicsSprintStories: Store +-3 words around each extracted Term to display in a Hover + DrillDown.
TopTopicsSprintStories: More relevant than per-email, because user has a small number of Top Topics (+ it is hard to rank them)
TopTopicsSprintStories: Note: 'Display Topic Context' feature above - helps user decide quickly if / how bad is this topic
TopTopicsSprintStories: Only Remove from Top Topics  (keep as a Topic)
TopTopicsSprintStories: Remove from-sender from TopTopics (ex: support@harmon.ie)
TopTopicsSprintStories: Display several other Topics contributed by this sender (First to Top Topics) to help decide if want to remove it.
TopTopicsSprintStories: Suggest/Auto remove certain from-senders -->
TopTopicsSprintStories: Based on other user's feedback that removed this sender.
TopTopicsSprintStories: if they contribute too many top-topics which have a high count, but penalized by our rank (ex: common words, ignored topics, duplicated topics)
TopTopicsSprintStories: Note: They are NOT isMarketingEmail (support@harmon.ie) but we can detect certain factors (mailing-list, repeating duplicate patterns, dictionary of 'support','sales','marketing' ... in their title/email) that make them suspicious
TopTopicsSprintStories: Motivation: Since we will not be 100% in 5-10 Top-Topics rank --> but there will be more real Top-Topics in rank 10-20. Allow user to select / add new Topics to their Top Topics.
TopTopicsSprintStories: Explore / Browse the ranked Top-Topics from rank 5-20
TopTopicsSprintStories: These 5-20 are may be displayed in a 'Tag Cloud' at the Bottom (more discoveability, but also noise), or displayed only in Add-Topic UI
TopTopicsSprintStories: Search Topics (similar to Old Collage), using auto-complete + ranking --> from all extracted Topics
TopTopicsSprintStories: Add New Topic (personal Dict)
TopTopicsSprintStories: Ranking (3-4 months):
TopTopicsSprintStories: One of the short Term Goals: Filter out 'General Terms' (sharepoint, outlook)
TopTopicsSprintStories: Solutions: TFIDF, NLP-Compound, 'Terms Processing' - see 'Average specificity'
TopTopicsSprintStories: TFIDF and stat based methods (2 months)
TopTopicsSprintStories: User Feedback (1 month for algo only - there is also backend-collection and UI)
TopTopicsSprintStories:Problem: There are  few filter/search by Topic. UserSelections (10, 25)
TopTopicsSprintStories: Bootstrap: Nobody uses Collage --> No feedback
TopTopicsSprintStories: Conc: Cannot rely on feedback in Collage UI to help ranking for new Orgs (at least until usage is high enough)
TopTopicsSprintStories: Conc: Still must let user Ignore/Remove from Top-Topics and log Topic Click/Search for later stats - when we can aggregate Affinity users
TopTopicsSprintStories: Ignored Topics
TopTopicsSprintStories: There are ~150 ignoredTopics per management user - could it be because of Demos ?
TopTopicsSprintStories: Add to Graph
TopTopicsSprintStories: Note: Already used in mongo  with isTopic = true, but only for this user ignoredTopics - not for other Affinity users
TopTopicsSprintStories: Topic Click/Search - Topic Filter counting + timestamp
TopTopicsSprintStories: Affinity
TopTopicsSprintStories: Add to Graph
TopTopicsSprintStories: Structural - Subject + First part of email
TopTopicsSprintStories: Score inSubject (boolean) + function of closer to start + agg across all updates with this  Topic
TopTopicsSprintStories: Dictionary Topics - rank higher SP / CRM / Dict (1 week)
TopTopicsSprintStories: Noise Reduction / "Cleanup (3 months so far - at least 4.5 months in total)
TopTopicsSprintStories: myContacts stats are per email address --> meaning emailStats are split between all person email addresses.
TopTopicsSprintStories: 'Urls, Files and SP Urls'
TopTopicsSprintStories: Q: Are Titles Top Topics / Drilldown Topics ? Many of the Topics are Director of/VP Marketing/Chief Economist of Wells Fargo/Chairman of Supervisory Board/Head of XXX
TopTopicsSprintStories: Investigate: Count How many times titles appear outside Signature (that we filter out anyway)
TopTopicsSprintStories: Regular expression (after removeSignature) in terms for VP/Director/Senior/...
TopTopicsSprintStories: Note: If we know how to detect Titles --> use it also in signature
TopTopicsSprintStories: PER invalidate Topics (2 weeks - advanced prototype)
TopTopicsSprintStories: Duplicate (1 week)
TopTopicsSprintStories: See 'Remove Duplicates
TopTopicsSprintStories: Dedup subject, but instead add a boost for subject topics in general + boost if replied a lot of times to emails with this subject.
TopTopicsSprintStories: isMarketingEmail (2 weeks - advanced prototype)
TopTopicsSprintStories: Run without/partial manual filter
TopTopicsSprintStories: Signatures (or non-Topics regex if difficult)
TopTopicsSprintStories: Improve (1 week)
TopTopicsSprintStories: Admin non-Topics dictionary (3 days)
TopTopicsSprintStories: harmon.ie can cleanup a new early adopter from Noise and too-General topics (if we decide )
TopTopicsSprintStories: isFocused = true
TopTopicsSprintStories: LOC ?
TopTopicsSprintStories:  Terms Processing NLP - PROPN  (2 months)
TopTopicsSprintStories: Compound - 'City of Brampton', 'Migration to SharePoint', 'SharePoint authentication' (1 week for fixes, 1.5 months for Terms Processing change)
TopTopicsSprintStories: Bug: Doesn't join bi-gram PROPN+NOUN (even if they are clearly a Topic using Mutual-Information measures + have other occur as PROPN+PROPN)
TopTopicsSprintStories: See ATE / ATR methods below
TopTopicsSprintStories: Currently terms-processing doesn't join those --> only X and Y (X's Y) --> sometimes extracts 'City' --> which is a non-topic
TopTopicsSprintStories: ATE / ATR methods (AutoPhrase)
TopTopicsSprintStories: Filter out non-words better
TopTopicsSprintStories: s201804020253445112
TopTopicsSprintStories: ORG > LOC:  We currently filter LOC (SNER: Bradford is LOC but 'City of Brampton' is an ORG - SF). Currently Dict (uses tokens) > LOC filter
TopTopicsSprintStories: Google NER, Investigated other NER systems.
TopTopicsSprintStories: Count
TopTopicsSprintStories: For Top-Topics: Count each update if contained in conversation that has a reply from last 14 days (even if latest reply doesn't mention the topic)
TopTopicsSprintStories: For Drilldown (Top Topics already have high counts)
TopTopicsSprintStories:: Depends on Similarity
TopTopicsSprintStories: Normalization: and <--> &, ltd, co.
TopTopicsSprintStories: Relax exact matching rules:
TopTopicsSprintStories: Structured Topics Sources (SP/SF): (2 months)
TopTopicsSprintStories: Dict (or certain Dict type) --> rank higher
TopTopicsSprintStories: SalesForce (SF) - rank higher
TopTopicsSprintStories: SharePoint
TopTopicsSprintStories: /used feed
TopTopicsSprintStories: Problem: Current Follow gets Yaacov all documet updates in Sales/Accounts - even if he doesn't care about them --> much less targeted (relevant) than Email in Inbox
TopTopicsSprintStories: Admin UI: Choose Termsets to consider (RSPB birds sociaty ) on newUpdates
TopTopicsSprintStories: Admin UI: Import Termset
TopTopicsSprintStories: CRM connector
TopTopicsSprintStories: ADF Linked Service (Copy Activity) - supports SF,Dynamics, ZenDesk ?
TopTopicsSprintStories: Domain mapping - add quality topic we do not have today
TopTopicsSprintStories: P2: Top Topics: SF API - Accounts (topics) user changed recently
TopTopicsSprintStories: Could bypass stanford PROPN mistakes on company name (many?) - current impl doesn't detect non-PROPN terms
TopTopicsSprintStories: People will not write in email the full long-form of a multi-word topic.
TopTopicsSprintStories: Unify counts of 'ms office' and 'microsoft office' --> better top topics, related topics, LM counts
TopTopicsSprintStories:Q/A "Report - Top Topics
Q/AReport-TopTopics: Measure Progress at a high level (Top Topics No. 6/10)
Q/AReport-TopTopics: Next: CI
Q/AReport-TopTopics: Diffs, Tracebility
Q/AReport-TopTopics: Next: Distribute Reports to Users to collect feedback on expected: top, good and bad topics
Q/AReport-TopTopics:Requirements
Q/AReport-TopTopics:- - - - - - -
Q/AReport-TopTopics: Several users (select)
Q/AReport-TopTopics: Several Date-Ranges (select)
Q/AReport-TopTopics: Summary:
Q/AReport-TopTopics: Preserve analysis comments between reports (copy from prev report)
Q/AReport-TopTopics: Drilldown from  to details
Q/AReport-TopTopics: Today: Search in editor
Q/AReport-TopTopics: Details include individual algorithms output (ex: duplidate pair)
Q/AReport-TopTopics: Do not rely on console.log at the middle of algorithms, only collection of diag json (duplicate pair) at report.js
Q/AReport-TopTopics: Keep reports history
Q/AReport-TopTopics: Where? git or mongo ?
Q/AReport-TopTopics: Diff of Ranking, factors, individual algorithms output (ex: duplicate changes)
Q/AReport-TopTopics: Summary Scores for Ranking (Average Ranking Measures)
Q/AReport-TopTopics: Expected Good vs. Bad Topic --> Per Org
Q/AReport-TopTopics: Trace changes back to algo code.
Q/AReport-TopTopics: Problem: Dev run report (for regression) on uncommitted changes --> no commit marker
Q/AReport-TopTopics:Reports Impl
Q/AReport-TopTopics: Save Algo eports to calc paths (user+date-range)
Q/AReport-TopTopics: console.log dup,sig --> replace with output collectors
Q/AReport-TopTopics: updates.json (stuctured + readable possible)
Q/AReport-TopTopics: Problem: [mongo-storage] - how to capture the query + connecting console.log (start of report)
Q/AReport-TopTopics: Output Summary.txt + all other files
Q/AReport-TopTopics: Serialize line-simple-json (sort by key name - stable diff)
Q/AReport-TopTopics: Per Topic: We already have that in arrTopTopics (except comment): count, rank, factors, comment, jArt - updates array
Q/AReport-TopTopics: Validation: updates.json must include every updateId occuring in report
Q/AReport-TopTopics: expected: top/good/bad --> Add to summary
Q/AReport-TopTopics: report --commit
Q/AReport-TopTopics: Note: Used from CI (even if there are no changes --> to mark that a certain report status is linked to latest build commit)
Q/AReport-TopTopics:: Deserialize Summary.txt - getFactorsFromText: lm:badTopic --> factors.lm.score < 0 ?
Q/AReport-TopTopics:- - - - - - - -
Q/AReport-TopTopics: Bug: Summary report (or topics ranking) doesn't use stable sort, so a group of topics with same rank
Q/AReport-TopTopics: Ex: (davidl, vegas rank : 11) --> change pos without changing rank --> moved : 15 but no diff in report.
Q/AReport-TopTopics: Implement 'Diff Flow' algo to not display moved for topics that were pushed down by a single topic that is now ranked 1st.
Q/AReport-TopTopics: Added moved to metaData
Q/AReport-TopTopics: Expected Backend:
Q/AReport-TopTopics: Central file/db for Expected:
Q/AReport-TopTopics: Problem: If a topic is deleted --> its Expected is also deleted
Q/AReport-TopTopics: Note: When a new report is added --> It is good to have Expected defaults, based on central db.
Q/AReport-TopTopics: Expected UI: 6/10 in Header (Summary + Summary Diff)
Q/AReport-TopTopics: Green/Yellow/Red small markers for Expected that are not in their proper rank
Q/AReport-TopTopics: badTopic in rank 2 --> red
Q/AReport-TopTopics: topTopic rank > 10 --> yellow
Q/AReport-TopTopics: goodTopic (but not topTopic) rank <= 10 --> yellow
Q/AReport-TopTopics: PreProcessing Algo output
Q/AReport-TopTopics: Problem: terms-processing code changed (ex: compound 'City of Brampton')--> 'City' deleted and 'City of Brampton' added in yc top report
Q/AReport-TopTopics:Q: How to traceback to terms processing code change ?
Q/AReport-TopTopics: When running new terms-processing --> generate the newDifferentTerms report --> commit it with the code changes
Q/AReport-TopTopics: UI: Each Topic is a link + Each factor is a link
Q/AReport-TopTopics: Q: Drilldown to Facrors algorithms output: How to locate duplicates output of a specific topic ?
Q/AReport-TopTopics: Same for sigs
Q/AReport-TopTopics: Details:
Q/AReport-TopTopics: Details for Deleted --> if topic.deleted --> use old SummaryReportId in the call to /detailed
Q/AReport-TopTopics:Highlight finish:
Q/AReport-TopTopics: Bug: Text Search for a string enron, the Terms, plus a lot of emails @enron ...
Q/AReport-TopTopics: Details of Topic + Children (checkbox) - if checked fetches all updates of topic (linkedin) + all its children.
Q/AReport-TopTopics: Highlight should work
Q/AReport-TopTopics: Add From: and optionally more metadata. (click to expand)
Q/AReport-TopTopics:: Bug: Need back twice to return from Details to Summary
Q/AReport-TopTopics: Q: How to render factors ? Simple text for now, but ...
Q/AReport-TopTopics: Q: How outlook.count is 19 but it has 15 detailed artifacts ?
Q/AReport-TopTopics:A: Because sig/per/automated prevent topic.artifacts.push
Q/AReport-TopTopics: Attribution/Tracebility of Summary/SummaryDiff to Code changes
Q/AReport-TopTopics: Q: How to insert a comment into Summary that describes the changes ?
Q/AReport-TopTopics: Problem: Changes may be spread across many commits in many repos (sig)
Q/AReport-TopTopics: Cont 1: Problem: Few other fixes that are considered minor --> meaning no new /report:/ comment --> Summary report changes (regression) but still with *same* comment
Q/AReport-TopTopics: Displays same report comment + number of commit since --> if hover comment --> ToolTip with all commit messages from last /report:/ - including.
Q/AReport-TopTopics: Summary Diff:
Q/AReport-TopTopics: Diff of the 2 commits of reports (Possibly with integration to Git webviewer)
Q/AReport-TopTopics: react-diff-view - renders WebUI for git diff output: https://www.npmjs.com/package/react-diff-view
Q/AReport-TopTopics: Datetime generated, date-range, org + user, change comment (ex: fix sig detector + sha1)
Q/AReport-TopTopics: Aggregate all attrs in a totals line (diff on sum of all counts, sum of dups)
Q/AReport-TopTopics: Only displays the number of deleted/added (at the top) and not which one was added/deleted - which is important.
Q/AReport-TopTopics: Which diff is it (3 last path components ramt/Apr-18../Summary.txt or commit comments+hash )
Q/AReport-TopTopics: Return in json and display
Q/AReport-TopTopics: Summary View (not Diff): Provide Title in presentation .json
Q/AReport-TopTopics: Stats: At the Top Summary: 2 moved, 3 added, 1 deleted
Q/AReport-TopTopics: metaData : { topicId : { ownCol : true }, rank: .. }
Q/AReport-TopTopics: Diff selection UI:
Q/AReport-TopTopics: Reports menu: Select a user+dateRange --> Display git log
Q/AReport-TopTopics: Mark important commits - by comments (report: or major:)
Q/AReport-TopTopics: Future: Allow Drop only 1 and select the other from reports menu (D&D from reports list ?)
Q/AReport-TopTopics: Dashboard: Aggregate stats from all / group of users
Q/AReport-TopTopics: History Perfromance Chart
Q/AReport-TopTopics: Q: How to manage the data collected from all Summary stats + expected K/10 ?
Q/AReport-TopTopics: A: Collect it to a history.json file with sha1 + timestamp + all summary stats / expected collected + copy from expected.json the topics and their values
Q/AReport-TopTopics: Problem: Mongo vs. Git: If we saved all summaries in mongo (Key: user+dateRange+sha1 of commit), wouldn't it be a simple query (avoiding duplication in history.json) ?
Q/AReport-TopTopics: Q: What if expected.json is changed ?
Q/AReport-TopTopics: Ex: Add expected topTopic to a term prev had no expected.
Q/AReport-TopTopics: Delete current history.json, replacing it (mutable) with new stats or keep old data for reference (to track changes) ?
Q/AReport-TopTopics: edit expected.json --> commit it --> new sha1 --> new history.json version with special comment (rebuild following expected changes) -->
Q/AReport-TopTopics: Q: What if base data is changed ? Adding emails, change dateRange, add/remove users (Ex: enron)
Q/AReport-TopTopics: Note: Bizportal stock sites do not show history charts beyond a major change (Investment Policy changes ...)
Q/AReport-TopTopics: Multiple Aggregations (queries in mongo stats)--> Multiple History Charts
Q/AReport-TopTopics: First: Name the dataset (harmonie management) and chart it
Q/AReport-TopTopics: When adding enron: Start a new config (Name: harmonie + enron) and start charting it in a new graph.
Q/AReport-TopTopics: Chart comments - If something is changed and we need to continue with this Chart --> Allow rendering comments (maybe from report: commit comments)
Q/AReport-TopTopics: UI:
Q/AReport-TopTopics: Checkbox for each Users + Groups + All Selection (server return data from config.users + userGroups)
Q/AReport-TopTopics: Checkbox for each dateRange
Q/AReport-TopTopics: Impl:
Q/AReport-TopTopics: Save stats at the bottom of each Summary (below sep -----, alog with topicMetaData)
Q/AReport-TopTopics: ReportsStorage.findStats(user/grp/all, dateRange) - iterates all users and dateRanges (may include several ranges) summary
Q/AReport-TopTopics: Study WebUI
Q/AReport-TopTopics: Table Component with Cell Editing, Filtering, Customizations ...
Q/AReport-TopTopics: Example React+Express App rendering tables and search interface
Q/AReport-TopTopics:https://github.com/fullstackreact/food-lookup-demo/blob/master/client/src/FoodSearch.js
Q/AReport-TopTopics: ReactRouter with back button + params: https://reacttraining.com/react-router/web/example/url-params
Q/AReport-TopTopics: Q: Share Topic Attributes Metadata to allow generic rendering ?
Q/AReport-TopTopics: https://caolan.org/posts/writing_for_node_and_the_browser.html
Q/AReport-TopTopics:"Summary Diff"
Q/AReport-TopTopics:- - - - - - - - -
Q/AReport-TopTopics: Display new Summary + Diff annotations per topic (pos: 72 -> 59)
Q/AReport-TopTopics: Ex: sharepoint  rank: 28 count: 19 factors: { fromMe: 11  automated: 2 (+1) reFilt: 2} comment: ranking(tf-idf)
Q/AReport-TopTopics: Problem: Compact json format --> sig is missing if 0 --> cannot differentiate between new-factor and current-zero-factor (sig)
Q/AReport-TopTopics: Keep old topic fields after summary separator (update code to stop parsing for topics at  ------------- )
Q/AReport-TopTopics: Impl below moved (diff in position) --> annotate 'moved 14 --> 1'
Q/AReport-TopTopics: Metadata field added/removed
Q/AReport-TopTopics: Added new factor or topic.comments --> ignore in diff if doesn't appear in both (but if rank is different - mention it)
Q/AReport-TopTopics: Impl: Store currentMetadata at the end of summary as a json island. When diff--> read it as oldTopicsMeta --> now we know which
Q/AReport-TopTopics:- - - - - - - - - - - - - - - - - - - - - - - - - -
Q/AReport-TopTopics:- - - - - - - -
Q/AReport-TopTopics:old
Q/AReport-TopTopics:- - -
Q/AReport-TopTopics:0 sharepoint
Q/AReport-TopTopics:1 owa
Q/AReport-TopTopics:3 office365
Q/AReport-TopTopics:new
Q/AReport-TopTopics:0 url
Q/AReport-TopTopics:1 sharepoint
Q/AReport-TopTopics:2 owa
Q/AReport-TopTopics:0 != 0 sharepoint != url : url 2->0
Q/AReport-TopTopics:0 == 1 sharepoint == sharepoint :
Q/AReport-TopTopics:Complicated 2 moves
Q/AReport-TopTopics:- - - - - - - - - - - -
Q/AReport-TopTopics:old
Q/AReport-TopTopics:- - -
Q/AReport-TopTopics:0 sharepoint
Q/AReport-TopTopics:1 owa
Q/AReport-TopTopics:3 office365
Q/AReport-TopTopics:new
Q/AReport-TopTopics:- - -
Q/AReport-TopTopics:0 url
Q/AReport-TopTopics:1 owa
Q/AReport-TopTopics:2 sharepoint
Q/AReport-TopTopics:Legend: <old idx> !=/== <new idx>
Q/AReport-TopTopics:0 != 0 sharepoint != url --> edit is not an option --> insert (or moved) url in new --> lookup url in oldTopics --> moved: 2->0
Q/AReport-TopTopics: Note: if not found  in oldTopics --> added
Q/AReport-TopTopics:0 != 1 sharepoint != owa --> insert (or moved) owa in new --> lookup owa in oldTopics --> didn't move (1-->1) --> no pos annotation for owa
Q/AReport-TopTopics:0 == 2 sharepoint == sharepoint --> moved 0->2
Q/AReport-TopTopics:owa,url - remaining suffix in old --> lookup in newTopics --> not found --> deleted, else do nothing (added and moved already handled above).
Q/AReport-TopTopics:"DONE Q/A Report
Q/AReport-TopTopics: Convert childTopics to topicId only, before writing to Detailed.json
Q/AReport-TopTopics: Reports menu
Q/AReport-TopTopics: On click report --> React Router /summary with reportId (same as for /details) --> summary parses the router params --> jsonFetch
Q/AReport-TopTopics: Click to Diff-Prev
Q/AReport-TopTopics: /api/reports --> Recurse fs under 'ReportsRoot' to discover all reports ramt/<DateRange>/Summary.txt
Q/AReport-TopTopics:Bug: David --dontUseDuplicateSubject --> changed dup but doesn't render dup n1-->n2
Q/AReport-TopTopics:Bug: Highligher doesn't do 'FSCP 2018' --> FSCP%202018
Q/AReport-TopTopics: Copy expected from pervSummary
Q/AReport-TopTopics: highlight diffed attrs (mark in json to highlight a property)
Q/AReport-TopTopics: pos info
Q/AReport-TopTopics: Highlight topic in detailed email (findTopicInText)
Q/AReport-TopTopics: Save jArt.about.text in Detailed.json - used by findTopicInText
Q/AReport-TopTopics: respond with details update.json read from mongo based on joined updatesIds
Q/AReport-TopTopics: use collageUserId (from config) + updateId (otherwise duplicated updateId for several users)
Q/AReport-TopTopics: Convert text concat inside <td> to an array of <span or <a>
Q/AReport-TopTopics: onClick (only <a>):
Q/AReport-TopTopics:<a href="#link" onClick={(e) => this.handleSort(e, 'myParam')}>
Q/AReport-TopTopics:handleSort = (e, param) => {
Q/AReport-TopTopics:e.preventDefault();
Q/AReport-TopTopics:console.log('Sorting by: ' + param)
Q/AReport-TopTopics:}
Q/AReport-TopTopics: Q: Dynamically create a react onClick event to a local function (with the reportPartId, topic,key as parameters)
Q/AReport-TopTopics: Nav (Router) to a DetailedComponent --> fetch emails from server updates.json  (Params: reportPartId, topic[key])
Q/AReport-TopTopics: Round float to 2 digits
Q/AReport-TopTopics:A: Only rank is a float
Q/AReport-TopTopics: Remove comments (from diff UI) if null (delete it in getSummaryView if null)
Q/AReport-TopTopics: A: webviewer
Q/AReport-TopTopics: Note: Textual diff with git default diff util (we can also customize it in the future)
Q/AReport-TopTopics: Cmdline (shortcut icon + explorer shell menu)
Q/AReport-TopTopics: calls git log --> display selection list  to specify 2 reports versions (opaque: sha1) --> output txt or csv (file name - containing the 2 reports user+dates+change_comment)--> editor refresh
Q/AReport-TopTopics: If passed 2 commit hashes - no need to display
Q/AReport-TopTopics: Specify a single report identifer --> compare latest report of this user to this report (unless the latest report was incorrectly specified)
Q/AReport-TopTopics: summaryDiffRoute returns presentation JSON (with metadata)
Q/AReport-TopTopics: Problems of textual Diff
Q/AReport-TopTopics: Solution Alt: Textual Report on which have moved/added/deleted + their factors diff
Q/AReport-TopTopics: Problem: New topic reaches no. 1 --> actuall diff is small (ex: bod moved from 15->1) --> all others are affected -1 in rankpos --> large noisy diff report ?
Q/AReport-TopTopics: Diff Flow algo below
Q/AReport-TopTopics: Added/Deleted
Q/AReport-TopTopics: Problem: It doesn't help analysis if a topic is deleted and we cannot see its new (low ranking) factors. Same for added: We want to review its prev factors
Q/AReport-TopTopics: topTopics cutoff at 100 --> 200
Q/AReport-TopTopics: Moved Ranking Measure - for all topics in a summary.
Q/AReport-TopTopics: Q: Do we need Exepcted good/bad for this to work ?
Q/AReport-TopTopics: Q: If a good topic (ex: bod) moved from 15->1 --> Inc goodness ranking measure -->
Q/AReport-TopTopics: Problem: The current 10 toptopics (say all are good) were all moved down a pos --> is it a penalty to ranking measure ?
Q/AReport-TopTopics: Analysis Comments
Q/AReport-TopTopics: A: Simple: Edit Summary.txt and commit
Q/AReport-TopTopics: A: During report generation - reports.js calls git package to extract the prev comments and add
Q/AReport-TopTopics: Deserialize Summary.txt
Q/AReport-TopTopics: Goal: Parses array of nested json objects from a middle of a text file. Each obj is a line
Q/AReport-TopTopics: Bug: addFldsCommas: incorrectly adds a comma before first fld in the nested factors : { ,fromMe: 1, }
Q/AReport-TopTopics: pass regex to start line (or start after line ------- ) --> lib parses Summary.txt and decides where to Start
Q/AReport-TopTopics: Config -
Q/AReport-TopTopics: Ex: Run Mar-Apr on david,ram
Q/AReport-TopTopics: Several Date-Ranges (select named periods mar-apr, )
Q/AReport-TopTopics: Database to get updates and models from (future: updates from Graph?)
Q/AReport-TopTopics: dontUseXXX - ablation tests: compare reports turning off some of the algorithms
Q/AReport-TopTopics: Validation for config/cmd-line selections
Q/AReport-TopTopics: Write reports (ReportsStorage) based on selection user+date-range
Q/AReport-TopTopics: Problem: [mongo-storage] - how to capture the query + connecting console.log (start of report) ?
Q/AReport-TopTopics: The report is not piped > file.txt anymore --> console.log will output to screen (not collected)
Q/AReport-TopTopics: report.js will explicitly collect diag return values (also from mongoStorage getUpdates)
Q/AReport-TopTopics: Q: Where do we keep history ?
Q/AReport-TopTopics: Detailed Report - first is the Summary
Q/AReport-TopTopics: Q: Diff - Machine Readable ?
Q/AReport-TopTopics: Q: How to find prev report for latest Diff ?
Q/AReport-TopTopics: A: Timestamp
Q/AReport-TopTopics: Q: Build CI machine ?
Q/AReport-TopTopics: Problem: Q: 20000 files repo each one 7MB ?
Q/AReport-TopTopics: Summary + details pointers updateIds + their factors
Q/AReport-TopTopics: updates.json for this report (formatted?)
Q/AReport-TopTopics:Problem: Terms chagnge (Compound) --> updates.json changes
Q/AReport-TopTopics: terms.json separate file (treat terms as algo output)
Q/AReport-TopTopics:"Language Model"
LanguageModel:Productization
LanguageModel: If total < MIN_STAT (as we do in PER) --> return score=0.5 --> cannot disqualify it.
LanguageModel: mongo_diff
LanguageModel:cd D:\views\Collage.Topics\Reports\helpers
LanguageModel:node --max_old_space_size=3500000 mongo_diff.js --collectionOld languagemodel --dbUrlOld mongodb://localhost:27099/collage_new --collectionNew languagemodel --dbUrlNew mongodb://localhost:27017/collage --key gName > output\lm_collage_new_vs_prod.txt
LanguageModel: Report diffs:
LanguageModel: Q: How come microsoftteams was changed from m: badLMTopic->undefined but its rank is not change ?
LanguageModel: Q: How come microsoftteams was badLMTopic, when it is a bigram ?
LanguageModel: Move expectedTopicsTest + main --> unitTest in __tests__ --> create the output output/expectedTopicsTest.json  --> file commit to git (to see track changes when changed)
LanguageModel: gName: 'london ae candidate' - total : 3 (new) and 17 (old - Copy_of_languagemodels)
LanguageModel: A: Seems new code is correct.
LanguageModel: extractTerms node process doesn't exit --> --noLM doesn't repro --> Which promise in LM / framework ?
LanguageModel: lmConversations bug ?:
LanguageModel: There are few added and many
LanguageModel: 'connecting software' bigram --> old total: 2 (correct), new total: 4
LanguageModel: Correct behavior ()
LanguageModel: lmConversations count: 10103 (same as inmem-norm-subject set)
LanguageModel: Re
LanguageModel: On second run, the diff is very small a lot less - all of diff tokens are common subject tokens (re, fw, re :)
LanguageModel: see Diff D:\views\Collage.Topics\Reports\batchExtractor\outputs\diff_new_old_lm_samecount_798497_gte_2.txt
LanguageModel: Ex: the new returned the same number of gNames as old (798497_gte_2) + 'connecting software' bigram --> total Now changed 4-->2 (correct)
LanguageModel: eslint config for all Collage.Topics/Reports - as was done for terms-processing
LanguageModel:gName --> id (filter in updateRecs)
LanguageModel: getGramsLMScore - adjust to new schema if needed.
LanguageModel: deleteMany - copy to GenericStorage
LanguageModel: basic connect and incremental update stats.
LanguageModel: Q: Keep ? and !
LanguageModel: Delete total : 1 && updateAt < 2 weeks ago.
LanguageModel: extractTerms report-log --> include toggled lmBadTopic (at the updateId when they change) - same info for enriched.
LanguageModel: No point including all tokens that have their stat change (too much log output)
LanguageModel: Redis-Lua
LanguageModel: After Terms-Processing --> send alpha-numeric tokens (not Terms) to Redis-Lua + idxTokenStartBody
LanguageModel: Lookup ConversationId (see below) --> ignore Subject tokens if found
LanguageModel: Update each token stats
LanguageModel: Problem: Online: Do not count again in the same duplicate subject (same email-thread)
LanguageModel: Store a Set of seen ConversationIds in Redis (separate from LangModel)
LanguageModel: If Race --> doesn't matter --> another worker has updated a gram from the subject of same Conversation AFTER current worker started processing it -->
LanguageModel: Impact: gram stat is +1 or +2 from correct count
LanguageModel: Clear ConversationId Set when extractTerms starts: require('langModel') --> langModel.init, unless -u <updateId> flag (incremental update)
LanguageModel: Note: only Clear at start and Update if --save and No flag --dontUseLangModel
LanguageModel: Opt: Conversation Affinity - all updates from same conv are routed to same Online worker.
LanguageModel:  Lua to incr stats
LanguageModel: Do not filter out Duplicate
LanguageModel: We cannot calc Duplicate for every unigram (all tokens - not only Terms)
LanguageModel: Do not update grams inside Url --> since they are not a real eng language
LanguageModel: Depends on other features - not estimated here
LanguageModel: Opt: single word (filter out non-word tokens): No need for bi-grams as they are not currently used.
LanguageModel:: Daily:
LanguageModel: Reads from Redis all dirty grams (stats changed but not yet processed)
LanguageModel: Calc bad/good : langModel.getGramsLMScore(<array of dirty grams>)
LanguageModel: If toggle (ex: good->bad) --> Update Topic node Graph. lmBadTopic = true.
LanguageModel: TopTopics: uses the lmBadTopic from Graph to filter out / discount.
LanguageModel: Cleanup sources:
LanguageModel: Subject - should we it
LanguageModel: If Dup, Urls/Emails/Paths , Sig (lali - keep) --> do not use it for language model, because it is not an english sentence.
LanguageModel: Lali: Test very short sentence (Thanx,) ?
LanguageModel: Do not disqualify bi-gram using lower/upper (currently only can score < 0 unigrams):
LanguageModel: A/AN Bug: a Sharepoint file/guy/conference/migration project --> high count of 'a' before unigram 'SharePoint' --> when we need to count 'a' before 'SharePoint File'
LanguageModel: Possible unigram - if we can determine it is a unigram and not part of bi-grams (a sharepoint <something>) --> maybe the a/an rule does work for those cases
LanguageModel: Rerun the report with a/an on (some threshold) --> examine diff - for unigrams (that we manually know are unigrams in expectedTopics).
LanguageModel: Note: Not even sure we like to rule-out 'SharePoint Migration' as a Topic - it is not a strict-Propernoun
LanguageModel: Problem: 'management buyout', 'proxy configuration' in upper and lower (it occurs almost only in lower) --> same Topic !
LanguageModel: If 'management' appears as a prefix of several good (bad) compound topics --> score higher (lower) compound topics starting with 'management'.
LanguageModel: good / bad will be computed from other source - such as ignored topics
LanguageModel: Con (of not using lm for bi-grams): Bigrams that could be disqualified based on lowerRatio : We do not disqualify now: 'web page', 'user name', 'proxy configuration', 'online meetings', 'digital marketing', 'direct line' (not confident),
LanguageModel: How to disqualify bi-grams in the future ?
LanguageModel:Q: Only disqualify Topics which are common language phrase (e.g 'web page', 'user name', 'on board')  - can use Google NGrams + Email stat
LanguageModel:Q: In bi-gram
LanguageModel: Conc: Cannot disqualify bi-gram based only of high web freq, because 'Board of Directors' has high freq (maybe it is not a real PROPN Topic - general term)
LanguageModel: badTopicsCorrect bi-grams: web page, user name, practical guide, online meetings, digital marketing, direct line
LanguageModel: * management buyout - goodTopicsIncorrect
LanguageModel: total: 23,allLower: 20,startsUpper: 3,startsUpperFirstInSentence: 2,afterMidUpperSU: 1,	afterPos: 0,afterPosSU: 0,	afterAn: 5,
LanguageModel: Difficult Topic, since in body, it appears almost always lowercase (Subject in uppercase, but current LM remove duplicate subjects)
LanguageModel: Subject -
LanguageModel: LM remove duplicate subjects --> count of Upper from the thread is 1, but still count the lowercase from the bodies of these emails.
LanguageModel: We need to weight subject
LanguageModel: When lowerRatio shouldn't be used ?
LanguageModel: Problem: same ngram is used lower and upper interchangeably in the same context (wordvectors) or lower and upper co-occur.
LanguageModel: WordVectors: Both lower and upper forms occur around the same words context --> they are the same
LanguageModel: Consider large window (not 5 words, but the whole email)
LanguageModel: board, teams - goodTopicsIncorrect. lowerCaseRatio: 0.90476
LanguageModel: After Lali change lowerCaseRatio >= 0.8 --> certain badTopic --> we lost Board
LanguageModel: Many of the lowercase are actually the Topic 'Board' - (Stanford: only Upper are NNP, lower - NN) --> surrounded by same words as the upper.
LanguageModel: Run board.context also for lower - to prove that.
LanguageModel: Does board (lowercase) appears in same context as 'Board' (Ex: board of directors, ) --> some board will go to Compound (board meeting)
LanguageModel:and some will be in same context as Board --> meaning they are the same and board is a topic.
LanguageModel:decrease board unigram count.
LanguageModel: Without afterMidUpperSU - board and teams are correct (good)
LanguageModel: Short-form (Similarity): If 'Board' co-occur 'Management Board'
LanguageModel: Batch getLMScore --< fast !
LanguageModel: Print standalone 'Teams' occurrences - is it enough we have several dozens of these (maybe from diverse source/authors)to declare it as a Topic - without counting ratio ?
LanguageModel: Problem: some badTopics will pass:
LanguageModel: 'view' - startsUpper: 3188 - startsUpperFirstInSentence : 2042 - afterMidUpperSU : 224 > 800 legit-startsUpper --> will make it a good topic
LanguageModel: 'register' - 250 legit-startsUpper
LanguageModel: 'below'
LanguageModel: Use afterMidUpperSU in score.
LanguageModel: Q: Does Dups resolves that ?
LanguageModel: allLower: 93, startsUpper: 60 (only 3 of the 60 in sentenceStart)
LanguageModel: many counts within other Topics: 'User Story', 'Science of a Story'
LanguageModel: Report:
LanguageModel: Add Counter Inbox vs sentItems, isFocused vs non-Focused (to report only)
LanguageModel: Add score -0.5
LanguageModel: Add measure for confidence - correct-confident incorrect-confident
LanguageModel: Note: When adding new factors (afterMidUpperSU) we want to fix incorrects, but also have the classifier less senstive --> less 0.5,-0.5 and more -1,1
LanguageModel: Corrl: We want correlation between confidence and correctness (also divided to good and bad)
LanguageModel: Export LM and mailUpdates --> Yheonathan - New Train/Test Set of many Good and Bad Topics
LanguageModel: ML scikit-learn model to separate Good vs. Bad Topic based on LangModel features
LanguageModel: 5000-freq words - tie breaker if 1.2 > lowerCaseRatio > 0.8
LanguageModel: Buy the 100,000 list with n-grams
LanguageModel:: Duplicates: We want to count unique Linguistic contexts (sentences)
LanguageModel: Dups:
LanguageModel: Extension: 701 - General\r\nGrasshopper #: (Voice Mail generated )
LanguageModel: Titles: Assistant General Counsel, General Partner/Manager
LanguageModel: Simple (Exact match duplicate) unigram cur token --> If tri-gram from prev to next token already exists in model --> this unigram token is duplicated
LanguageModel: Q: If tri-gram 'User Story' is a topic and occur 5 times - should we count its unigram 'Story' as duplicated ?
LanguageModel: Q/A:
LanguageModel: Ram + Yaacov
LanguageModel: Unit Tests
LanguageModel: Expected - *Good* and Bad Topics
LanguageModel:: Q: Should we count words we want to eliminate from topics
LanguageModel: duplicate-subjects
LanguageModel: const helpers = require('./nlp-helpers');
LanguageModel: Problem: How to store the nbrs for each token ? There are many such nbrs
LanguageModel: Terms will have their nbrs tracked (future terms-processing)
LanguageModel: signature
LanguageModel: marketing emails
LanguageModel: Compare lowercase/upper case with simple common-words (5000 freq-words) lookup
LanguageModel:: Paging - maybe will not be able to read large amounts of text into memory
LanguageModel:LM conclusions and problematic Topics
LanguageModel:- - - - - - - - - - - - - -  - - - - -
LanguageModel: Disqualify bi-grams ? (Currently only disqualify unigrams)
LanguageModel: Problem: 'management buyout', 'proxy configuration' in upper and lower (it occurs almost only in lower) --> same Topic
LanguageModel: Problem: common words (build,word) which are (when NNP capitalized) - are topics in harmon.ie email context.
LanguageModel: Org Dictionary should take precedence over LangaugeModel - Word, Workplace - Topics which are also common words in LM
LanguageModel: contexual (surrounding terms in email) Wikipedia popular Entities can also serve as generic Dictionary --> in context (similar to opencalais Social Tags)
LanguageModel: When lowerRatio shouldn't be used ? (see above)
LanguageModel: Problem: same ngram is used lower and upper interchangeably in the same context (wordvectors) or lower and upper co-occur.
LanguageModel: allLower / (startsUpper - startsUpperFirstInSentence - afterMidUpperSU)
LanguageModel:total: 346,
LanguageModel:notFound: 71,
LanguageModel:lowStats: 32,
LanguageModel:correct: 187 --> 191,
LanguageModel:badTopicsCorrect: 61 --> 67,
LanguageModel:badTopicsIncorrect: 47 --> 41,
LanguageModel:goodTopicsCorrect: 126 --> 124,
LanguageModel:goodTopicsIncorrect: 9 --> 11
LanguageModel: 6 badTopics were gained (badTopicsCorrect)	+ confidence of lowerRation is much stronger.
LanguageModel: 3 good topics were lost
LanguageModel: board - Management Board, Advisory Board, Endgame Board (company name), Job Board.
LanguageModel: teams - Microsoft Teams
LanguageModel: Conc: While 'Teams' is a short-form for Microsoft Teams, it is not so evident from lowerRatio stats, as 'Teams' appears ~ 823 - 9 - 582 as upper and 573 as lower.
LanguageModel: explorer - Internet Explorer, IBM File Explorer
LanguageModel: notFound: 107/349 - do not appear in 8500 mails --> 71/349 do not appear in 20000 emails (4 users 9/17-5/18)
LanguageModel: We can say that if a token doesn't occur in email at all, we do not need LM to decide if it is a bad/good topic.
LanguageModel: A: A problem with our test-topics (not in LM)
LanguageModel: Ex: ticket id, disa tem .... - most of them are junk - we should cleanup.
LanguageModel: We can increase the number of detected good topics (maybe contribute to ranking)
LanguageModel: Right now, we can only disqualify badTopics --> so it doesn't help much to inc the number of good topics.
LanguageModel: Filter out isFocused : true --> notFound += 71 --> 91 most of them Good Topics
LanguageModel: Stats went down (106 GoodCorrect / 9 GoodIncorrect ), but in reality, it still identifies Good and Bad Topics the same.
LanguageModel: Conc: Since most isFocused : false emails are marketing emails --> filtering better isMarketingEmail from LM (as opposed to TopTopics report) --> will not help much.
LanguageModel: Only if 1.1 <lowerRatio < 1.5 --> lookup freqWords list
LanguageModel: Only fixed 2 incorrectBadTopics - ok and notice.
LanguageModel: Consider if we want to risk using it if it helps so little (2 fixed/348)
LanguageModel: general - Dups, isMarketingEmail/generated email (see below)
LanguageModel: view - isMarketingEmail
LanguageModel: Ex: Business Intelligence,Artificial Intelligence, 2nd Intelligence Analytics Summit
LanguageModel: Correctly detected Bad Topics, which are borderline (sensitive)
LanguageModel: network - sub-topic - appears a lot in upper inside 'C-Suite Networks' - similar to 'User Story'
LanguageModel: Remove Sub-Topics  (cx network - 50, c-suite network - 36)
LanguageModel: Good Topics incorrectly (detected as Bad)
LanguageModel: 'klipse' - more lower than upper (9 total)
LanguageModel: Filter out urls (http://blog.klipse.tech/assets/yehonathan_profile.jpeg)
LanguageModel: 'word' - 'Word' as a shortcut for MS Word vs. word (high lowerRatio)
LanguageModel:
LanguageModel: Org Dictionary + contextual Wikipedia (Social Tags)
LanguageModel: 'build' - the Build Conference vs. the verb to-build
LanguageModel: Is a common  english word (position 409/5000)
LanguageModel: 1300 (allLower) / 622 (startsUpper - startsUpperFirstInSentence)
LanguageModel: Conc: build is context senistive, mostly not a topic but sometimes (Conference) is.
LanguageModel: 'groups' - 321/130
LanguageModel: Note: Stanford thinks all occurrences are NNS (PROPN) ! even those in lower case ('Meetup groups').
LanguageModel: 'harmon.ie mobile' - appears 32 times, but doesn't starts with Upper (hence not an LM Topic )	- harmon.ie Mobile/harmon.ie mobile
LanguageModel: 'ios' - total: 463, allLower:13, startsUpper: 7, startsUpperFirstInSentence: 3
LanguageModel: 'iOS' - not all lower, but doesn't starts with upper either.
LanguageModel: 'machine learning' - lower 164 / upper 110. It is used a lot in a middle of a sentence with lower.
LanguageModel: 'deep work' - 34/(32-3)=1.17 loweratio total: 66, allLower : 34, startsUpper : 32, startsUpperFirstInSentence: 3,
LanguageModel: Starts with upper (SU) > 2 times after a, an or possive --> makes it badTopic
LanguageModel: Non LM issues:
LanguageModel: english - the lang-name (English) is a real PROPN
LanguageModel: Ex: excellent English skills, English message follows, a typo in English in
LanguageModel: president - 'total': 260, 'allLower': 35,'startsUpper': 225,'startsUpperFirstInSentence': 21,
LanguageModel: President (by itself) may not be a topic (may be in some particular context), but it does appear almost always in Upper (part of a title)
LanguageModel: Q: Can we count grams that usually do not standalone (in Upper) --> conclude President or VP are too general ?
LanguageModel: Note: Different from 'Network' and 'Story' in that 'President' is not preceded by Upper
LanguageModel: vp -  title. not a real topic (by itself), but always upper.
LanguageModel: Similar to president above.
LanguageModel: Good Topics that are currently correct, but problematic (close to boundary - model parameter sensitivity implies problems with the model)
LanguageModel:"DONE LM
LanguageModel: Kickoff presentation
LanguageModel:Yehonathan reports.js
LanguageModel: Change --prod
LanguageModel: Use topics collection
LanguageModel: Refactor: Move /5000_most_frequent_english_words_lemmas.csv --> isNonPerson (email-util)
LanguageModel: Cleanup at End: disconnect from DB:
LanguageModel: Change every extractor/update/enricher to be an object (not a function) containing .extract/.update/.enrich + .cleanup
LanguageModel: Iterate modules keys (extractors) --> concat all arrays of modules --> call cleanup (Promise.all style)
LanguageModel: at each cleanup:"await langModelStorage.disconnect();
LanguageModel: lmConversations - is updated even without --save (because otherwise it will extract from duplicate conversation subjects)
LanguageModel: Ops: When deleting languagemodels collection need also to delete lmConversations
LanguageModel:: -t '<text>' - Add extractors and enrichers to output
LanguageModel: Postpone: Only LM requires pure-text. The other requires metadata. -u is a good workaround
LanguageModel: Problem: PER extractor requires rcpts --> extractPersonStats({ artifact }) --> but artifact is undefined
LanguageModel: <text> --> tokens --> passed to personExtractor
LanguageModel: --include/--exclude --> will also affect -t
LanguageModel: Tests and Q/A:
LanguageModel: First run and debug of undefined stuff (mongoUrl, extractor is not a function ...)
LanguageModel: Collection: Updater will corrupt original languageModels collection (because GenStorage is not yet impl) --> Copy_languageModels
LanguageModel: Compare results with current langModels collection (even several manually + some aggregate stats) --> ensure 0 diff
LanguageModel: Found { gName: /missing followed sites/} - new total : 110, old total : 80
LanguageModel: { gName: 'support request' } - total old: 3 new:5
LanguageModel: mongo_diff - change query
LanguageModel:diff = await diff2Collections({ ...options, queryNew : { total : { $gte : 2} } });
LanguageModel: New has fewer recs: 318,052 { total : { $gte : 2} }, old: 819,336 (after garbageCollect removed total < 2)
LanguageModel: Problem: Same 1 day query: Still 4721 (new - { total : { $gte : 2} }) vs. 4705 old
LanguageModel: Create diff in single update -u <> (in mongo term, then revert)
LanguageModel: maybeSave is passed logStr by value --> need byRef so it can += append to log inside
LanguageModel: Bug: Super Slow updateRecChunks
LanguageModel: Seems to be fixed by Yehonathan commit that eliminate the memory leak (results.push<everything>)
LanguageModel: Index ?  Is it a function of langmodels collection size ?
LanguageModel: Test: Try new extractTerms - with langmodels already filled.
LanguageModel: Memory ? Why does exractTerms process takes 2 GB when it is paging + online ?
LanguageModel: Test: --noPer --noLM --save: Does it still take so much memory ?
LanguageModel: If LM - try without bulk-update (without --save)
LanguageModel: If updateRecChunks --> try append instead (a different write DB api)
LanguageModel: Monogoose taking all memory ? Replace it with another lib ?
LanguageModel: Test: Try --max_old_space_size=3500000
LanguageModel: A: Very fast: most updateRecChunks takes < 25 ms but sometimes it pauses a little (updateRecsChunk durationMs: 5189)
LanguageModel: ConversationId: genStorage = require --> new GenericStorage
LanguageModel: Q/A: tokens are mutated by lm countStats --> normText --> written to update.token.allLower, startsUpper ... - bug
LanguageModel: .map all tokens to tokenX at the beginning of countStats --> then use tokenX array instead of tokens array.
LanguageModel: termsDiff: Remove console.log(`******* updated
LanguageModel: Change it to deep (but efficient)
LanguageModel: diff JSON.stringify of 2 tokens / terms --> meaning very exact diff (Even order of keys matters !!!)
LanguageModel: Note: Term now includes nested occur - array of objects
LanguageModel: Bug: Why 'Terms Review' newItem doesn't have occurs[0].insideBrackets	?
LanguageModel: Create topics collection and use it in Mongo-storage-worker
LanguageModel: --noPER --noLM
LanguageModel: Motivation: Test in isolation only the part you work on (saves time and doesn't change DBs). Also turn off temp-broken code.
LanguageModel: Default: Terms + All algos (we change it if something breaks and we need to disable)
LanguageModel: Q/A: --save: If not --save --> do not write to mongo. If --save: need to delete the whole myContacts collection
LanguageModel: mongoDB config (options ?)
LanguageModel: Change in Enricher and Updater
LanguageModel: --save is broken.
LanguageModel: fixed maybeSave(options.save, ... ) --> maybeSave(options, ... )
LanguageModel: Fix broken unitTest of termsExtractor
LanguageModel: Extractor: countStats
LanguageModel: Predictor: getGramsLMScore
LanguageModel: A: Incremenetal mongodb stat updateds $inc --> no need --> increased memory to 3.5GB for now.
LanguageModel: Test with few records query
LanguageModel:Debug:Restore pageSize --> 5000
LanguageModel: updateRecs - $inc
LanguageModel: garbageCollectLowStatsFromDb
LanguageModel: Bug: automatic: total 130 in new (.tokens) and total 96 in prev langModel - how ?
LanguageModel:A: Counted all tokens.originalText' : /^automatic$/i --> extactly 130, while if counting old bodyTokens + subjectTokes /^automatic$/i --> 97 (~96)
LanguageModel: Use in reports.js
LanguageModel: adjustTopicScore --> nlp-helpers: Lookup normText(topicText) --> If num bads > num goods (Upper...) -->
LanguageModel: Add to Top Topics Summmary: dup,sig,lm-bad/good/lowstat/nf
LanguageModel: 20000 mailUpdates --> new LM.
LanguageModel: A: Incremenetal mongodb stat updateds $inc --> no need --> increased memory to 3.5GB for now.
LanguageModel: Change test3 barrier to 8 months --> delete tokens + refresh --> import more mailUpdates for LM.
LanguageModel:1003BFFDA43EFAAA
LanguageModel: Convert to array
LanguageModel: Mongoose LangModel (collection)
LanguageModel: Possesive - can we trust it to kill a Topic ? Maybe as uncertain factor only (ML) ?
LanguageModel: My MS Graph subscription, Accelerate your GDPR compliance
LanguageModel: Seems that it cannot disqualify 'General' (startsUpper >> allLower)
LanguageModel: Fix bugs - generate report Good/Bad on 8500
LanguageModel: Count stats
LanguageModel: unigram: lower / upper n-gram: all upper / first word upper
LanguageModel: upper following possessive
LanguageModel: upper following number
LanguageModel: Too large model
LanguageModel: Memory should suffice for 15000-30000 emails --> at the end, do not write to DB the bigram,unigram that only have 1 occurrence
LanguageModel: Say 15000 emails with Avg 200 tokesn each -->  100000 unigram (with most stat), 200 bi-gram per email --> 15000 * 200 = 3M --> 6M tokens + 3M trigrams --> addditoinal 9M tokens
LanguageModel: Since we only want to query the LM with existing Terms - Let's only build it for uni/bi/tri-gram matching terms (case-insensitive)
LanguageModel: Problem: This will work for reports, but not for terms-processing, where it finds new terms every second --> LM don't yet have them.
LanguageModel:
LanguageModel: Lazy LM: FullText search (or regexp search for /(porche|qa|sharepoint|story|select)/) --> tokenize --> loop every result subject+body searching for the hit (can do with same regex)
LanguageModel: bi-gram terms --> split to 2 single --> tokens regex --> check result that the 2 query tokens are consequtive.
LanguageModel: Merge dict-topics with topTopics1 to new-branch topicsExpr and do not delete update.tokens
LanguageModel: Neural LM ? How does it work ?
LanguageModel: LM Toolkits?
LanguageModel: http://www.speech.sri.com/projects/srilm/papers/icslp2002-srilm.pdf
LanguageModel::Bug: 'Accepted' is a Topic according to LM: 80 allLower, 194 startsUpper, 194 startsUpperFirstInSentence
LanguageModel: many template emails with subject 'Accepted: Sprints and Stories Review'
LanguageModel: Note: Not a duplicate subject (the whole subject is 'Accepcted: <different text>' --> left nbr is empty and right  nbr is not duplicated
LanguageModel: Detect duplicated patterns at subject prefix.
LanguageModel:Register:
LanguageModel: Report
LanguageModel: Add Good Topics + Refactor expected
LanguageModel: Add Accuracy + Accuracy in Bad + Accuracy in Good
LanguageModel: afterPos (without su), total
LanguageModel: Add POS histogram. / PROPN vs. Non-PROPN
LanguageModel: Cleanup Bad Topics (Account Executive, office online, online services, microsoft way ...)
JVMJavaScriptEngineforPorting: Mvn build + Deploy  --> embbed as build resoure in topics-job.jar
JVMJavaScriptEngineforPorting: Add npm run build-report --> webpack + babel on report --> generates terms_processing/report/dist/topTopicsEs5.js --> commit --> subtree split to collage_stable
JVMJavaScriptEngineforPorting: topics-job fetch and load .js
JVMJavaScriptEngineforPorting: Download .js
JVMJavaScriptEngineforPorting: Sparse checkout mvn plugin: https://github.com/gastaldi/git-checkout-plugin
JVMJavaScriptEngineforPorting: To have mvn add the .js file to the .jar, copy .js into src/main/resources.
JVMJavaScriptEngineforPorting: See https://maven.apache.org/guides/getting-started/index.html#How_do_I_add_resources_to_my_JAR
JVMJavaScriptEngineforPorting: mvn lingo: part of generate-resources phase in the build lifecycle. a phase is a list of goals. specify phase will also cause exec of all phases preceding it
JVMJavaScriptEngineforPorting: Idea Pre Launcher UI --> Add above maven goal
JVMJavaScriptEngineforPorting: Load stream of .js file from .Jar
JVMJavaScriptEngineforPorting: Keep ScriptEngine in per-worker global variable (ScriptEngine is not serializable and we do not need it to transfer between processes)
JVMJavaScriptEngineforPorting: If not --> x2-x6 slowdown.
JVMJavaScriptEngineforPorting: Refactor Report to be called both from research and Spark
JVMJavaScriptEngineforPorting: Special mode, that is different from research and from prod
JVMJavaScriptEngineforPorting: Sig: If Graph input (as opposed to mongo Input) --> options.dontRemoveSignature = true --> as it was already removed in terms-processing
JVMJavaScriptEngineforPorting: Duplicate: Refactor to use terms nbr (will be avail in both Graph and mongo)
JVMJavaScriptEngineforPorting: isPerson, isAutomated -
JVMJavaScriptEngineforPorting: Input: array of join-lines per-user
JVMJavaScriptEngineforPorting: Research wrapper will query mongo, create joined-artifacts
JVMJavaScriptEngineforPorting: Person, badLMTopic and isAutomated are provided in Input (not loaded from mongo)
JVMJavaScriptEngineforPorting: Output: per-user ranked topics + factors
JVMJavaScriptEngineforPorting: Port to Nashorn / GraalVM
JVMJavaScriptEngineforPorting: Motivation: We need a single source for Ranker and Algorithms --> so we need it either in JS (as we have today) --> Nashorn + later GraalVM, or rewrite all Report + Algos in Java.
JVMJavaScriptEngineforPorting: Same test but with GraalVM: https://amarszalek.net/blog/2018/06/08/evaluating-javascript-in-java-graalvm/
JVMJavaScriptEngineforPorting: Pass array of strings + array of arrays in InvokeFunction
JVMJavaScriptEngineforPorting: See Array.asList(1,2,3,4) + Foo extends AbstractJSObject - https://stackoverflow.com/questions/30571711/seamlessly-pass-arrays-and-lists-to-and-from-nashorn
JVMJavaScriptEngineforPorting: Babel report.js (or some subset of topTopics.js, without mongo or fs calls) --> examine if portable to Nashorn.
JVMJavaScriptEngineforPorting:Problem: Nashorn is deprecated https://openjdk.java.net/jeps/335
JVMJavaScriptEngineforPorting: ? jjs tool is not in JDK 11 at all ? We will upgrade Spark to newer libs requiring newer JDK soon ...
JVMJavaScriptEngineforPorting: Nashorn successor is GraalVM - an advanced Oracle VM combining JVM with many other Programming languages (JS, Python, R ....). It has --nashorn-compat flag
JVMJavaScriptEngineforPorting: Problem: Its community edition is free, but its Enterprise edition costs money (call us for pricing ...)
JVMJavaScriptEngineforPorting: Problem: Not sure our version of Spark (compat with JDK 8, but not with JDK 11) will run on GraalVM.
JVMJavaScriptEngineforPorting: findTopicInText --> change special regex '(?<!\\w)' + escapedTopicText + '(?!\\w)' --> not supported.
JVMJavaScriptEngineforPorting: escapedTopicText = '(?:^|\\W)' + escapedTopicText + '(?!\\w)';
JVMJavaScriptEngineforPorting: webpack + babel
JVMJavaScriptEngineforPorting: Refactor nlp-helpers to move fs-extra functions to another util file
Maven:set JAVA_HOME=D:\Program Files\Java\jdk1.8.0_191
Maven:mvn install
Maven: mvn deploy
Maven:Deploy to local repo after build.
Maven: mvn -B archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.harmonie.topics -DartifactId=duplicate-detector
Maven: Add local repo (commit .jar in git)
Maven: https://maven.apache.org/plugins/maven-deploy-plugin/usage.html
Maven:<project>
Maven:...
Maven:<distributionManagement>
Maven:<id>internal.repo</id>
Maven:<name>Java Algorithms Internal Repository</name>
Maven:</repository>
Maven:</distributionManagement>
Maven:"Milestone 2"
Milestone2:? checkout feature/topics_spark_integration and cherrypick 2 commits form topics_sql_integration
Milestone2:: Revert Aug only change +
Milestone2: Merge Collage.Topics: develop --> master
Milestone2:: Eliyahu: Schedule topicStats Job every 3 days + AdvancedTopicsProcessor instead of basic.
Milestone2:: SQL integration
Milestone2: Merge from topics_spark_integration + develop --> diffs --> pull request
PostMilestone: nonEuropean - merge / complete test
PostMilestone: Signature fixes (sent from my + Get Outlook for XXX  ...) --> update package.json of terms-processing --> consume new version of signature from github.
PostMilestone: YS: signature token enricher
PostMilestone: Yair: dict-topic boost
PostMilestone: Jul:
PostMilestone:node reports --user davidl -d jul --prod --userDataDbURL mongodb://localhost:27099/july
PostMilestone:: Merge from develop BEFORE PostgreSQL: 0f0db960a361d441f95d3fa4396489281d749e1d (New Scaling)
PostMilestone:  node mongo_diff --collectionOld topics --dbUrlOld mongodb://localhost:27099/july --collectionNew topics --dbUrlNew mongodb://localhost:27017/collage --projection "{\"_id\": false, \"id\": 1}" > output\topics_july_rearch_vs_prod_diff.txt
PostMilestone:: Why large diff between july_master.topics.count = 3748 and july.topics.count = 2381 in number of  ?
PostMilestone: Ex: master 'Invoice INV-0998' vs. develop split to term1 Invoice and term2: INV-0998
PostMilestone: { updateId: '<DB6PR0601MB232618066EAD0367499C335AAC4C0@DB6PR0601MB2326.eurprd06.prod.outlook.com>'}
PostMilestone: -t --> both master and develop extract 'Invoice INV-0998' --> doesn't repro
PostMilestone:node ..\batchExtractor\extractTerms.js -t "RE: Invoice INV-0998 from Fifty Five and Five Ltd for harmon.ie"
PostMilestone: Small databse july (Hodaya July.json):
PostMilestone: token filter - deleted encoded R&D !== R%26D --> split vp R&D (research) to vp (prod)
PostMilestone: <DB5PR06MB156054DD1CEA4D62644AA606AF330@DB5PR06MB1560.eurprd06.prod.outlook.com> - has vp topic in Graph
PostMilestone:<DB5PR06MB1560ABA165F6CF271BE65DE6AF330@DB5PR06MB1560.eurprd06.prod.outlook.com>
PostMilestone: Mongo - terms 'VP R&D'
PostMilestone:"Postmortem"
Postmortem: Diff to prev Milestone - keep in devfiles a Milestone2 folder with results of 2 Spark Jobs, Mongo prod + research collections exports
Postmortem: Add createdAt in addition to updatedAt (Ex: Lavi is created and updated at a very short interval)
Postmortem: lmEnricher - write in topics collection debuginfo to understand why badLMTopic : true/false - langModel[gName].total, allLower, startsUpper, ...
Postmortem: Env: Everybody should have all env (maybe except special production env)
Postmortem: Start early Q/A / build --> deployment to Azure --> sreaming datasets --> live Data
Postmortem: Do not wait for Algorithms Dev-complete.
Postmortem: Timestamp batchExtractor
Postmortem: Saved Queries - SQL Query Tool (complex queries with lots of joins)
Postmortem: One change at a time - easier to explain diffs
Postmortem:Diff between Mongo prod (terms-processing) and Mongo Research (batchExtractor)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Serializable (1 worker) in Production --> Slower but eliminate Concurrency diffs
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Signature is not removed in research --> topics.count in research is larger
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):1) Details: The whole longer compound terms appear in sig, but only Microsoft appear in both sig in in body outside sig, so when token filter (only in prod terms-processing)   filters out all sig tokens except Microsoft.
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): topics collection - sig topics only in research (~1500 in collage_new)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): languageModels collection - sig tokens only in research
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): findTopicInText has bugs: Subject: RCC_<someting> --> doesn't find topic RCC --> incorrectly assume sig += 1
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Duplicate: Doesn't care about Sig --> so more Duplicates inside Sig text in research
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Token filter - May create diffs in dups (affect rightNbr of office365 term in <AM2PR06MB0612B07237D56F14B98D3597DD4F0@AM2PR06MB0612.eurprd06.prod.outlook.com>)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Concurrecny in writes to mongo - only in Prod there are multiple readers/writers
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): isAutomated total = 2.
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Parent - Children containedTopicsTopicKeys (Concurrecny)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Same topicKey: 'spexpo': 'SP Expo' has containedTopicsTopicKeys 'SP' vs. 'SPExpo' (has Zero containedTopicsTopicKeys)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Graph as a single global topic node for 'spexpo'. It creates containedTopicsTopicKeys in the topic node first time (depending if it got it from SPExpo or SP Expo)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Concurrecny: Depend on which worker created topic node - the one with 'SP Expo' or the one with 'SPExpo'
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Duplicate topic in Subject Conversations bug
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Research Report removes duplicates (same email in several inboxes), but doesn't necessarilly take the duplicate with the current report user conversationId --> subject topics
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):can be duplicate / non-dup
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): isMarketing / autoEmail: JS - if isAutomated --> continue --> never reaches autoEmail --> meaning the first filter to catch the <artifact,topicId> hides the others
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):Spark: isMarketing is counted seperatedly from isAutomated --> meaning isMarketing : 1, isAutomated: 1 --> maybe the same filtered artifact
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Date range: Prod / Spark uses 1/3 month ago --> today --> need to fix that to the same dateRange as JS report
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierFrom = 1533070800000L; //AUG 01-Aug-2018 00:00:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierTo = 1535749140000L; //AUG 31-Aug-2018 23:59:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierFrom = 1527800400000L; //JUN 01-Jun-2018 00:00:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierTo = 1530478740000L; //JUN 31-Jun-2018 23:59:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierTo = 1533070740000L; //JUL 31-Jul-2018 23:59:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):How to run
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):0) Prepare mailUpdates: node extractTerms.js --save --userDataDbURL mongodb://localhost:27099/july
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):1) Delete Graph / SQL DB
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):3) node convertFormats.js --userDataDbURL mongodb://localhost:27099/july --tokens
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Working with extractTerms when k8s is up (and listens to port 9000)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):set NLPServiceURL=http://collage-dev-nlp.westeurope.azurecontainer.io:9000
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):"Duplicate"
Duplicate: office365 dup diff:
Duplicate:davidl Jun (hodaya sent files in chat)
Duplicate:dup1 in research vs 0 in prod
Duplicate: vp subject topic: count + dup diff: yc aug --prod
Duplicate: Latest change to not allow subject topics newLine (retry if failed dup) --> threshold 0.35 (easier) --> resulted in 12 dups, which are now gone.
Duplicate:Research new (my machine):  vp rank: 35.6 count: 18 fromMe: 5 sig: 3 childRank: 15.6 childCount: 11
Duplicate: totalSentByAutomatedActor": 0,
Duplicate:totalIsMarketing: 0,
Duplicate:rank: 34.2,
Duplicate:childCount: 10,
Duplicate:dup: 4,
Duplicate:countChildRanks: 9,
Duplicate:totalArtifacts: 17
Duplicate: ESOP - dup 3 in research and dup : 0 in prod
Duplicate: Q: How come 'Noam .' is 11 chars (MIN_LEN_COMPARED = 10 for body)
Duplicate: Research:  esop rank: 4.3 count: 7 dup: 3
Duplicate:"totalSentByAutomatedActor": 0,
Duplicate:"general": false,
Duplicate:"topicId": "esop",
Duplicate:"totalIsMarketing": 0,
Duplicate:"rank": 7.0,
Duplicate:"childRank": 0.0,
Duplicate:"childCount": 0,
Duplicate:"totalSentByMe": 0,
Duplicate:"dup": 0,
Duplicate:"countChildRanks": 0,
Duplicate:"totalArtifacts": 7
Duplicate: Duplicate bug: topicId trit yc jul is dup : 1 in JS and dup : 0 in Java
Duplicate: text-duplcate-detector --> commit to github + update package.lock of report.js
Duplicate:
Duplicate:1) leftNbr includes part of 'Accepted' --> ted: Harmon.ie and
Duplicate:2) inSubject: First calcNbrDist duplicate = false --> calcNbrDist with DUPLICATE_THRESHOLD_NL = 0.35 (incorrect - shouldn't be called for Subject)
Duplicate:- - - - - - - - -
Duplicate: Phase 3
Duplicate: Port getDuplicateScore + new code --> Java
Duplicate: Port duplicate.js --> Java and prepare call from topics-job
Duplicate: Spark infra calls EmailsDuplicate.java (which is )
Duplicate: Spark cutoff at 100 topTopics --> prepare Row objects  --> call EmailsDuplicate.java --> handles Conversations + Artifact related --> call DuplicateDetector algo --> rerank)
Duplicate: Problem: reCalcTopicRank --> adjustTopicScore (which we do not have in java package)
Duplicate: Solution Alt: implement reCalcTopicRank outside package (but we anyway need adjustTopicScore)
Duplicate: node reports --user ramt -d jul --prod --userDataDbURL mongodb://localhost:27099/collage_test
Duplicate: g.V().hasLabel('user').as('user').out('owns').has('email','ramt@harmon.ie').select('user')
Duplicate: ram puser = 026da82f-3dcc-41cf-b13b-1d3582024ef5
Duplicate:davidl puser = e4c91e9b-4a9c-4173-b268-9dc0c8e73c89
Duplicate:yaacovc puser = 468a410e-7bb7-416e-8bd5-e6a037c6b5f7
Duplicate: Build new small Graph (dekel-dev)
Duplicate: Change COMPUTERNAME in storage-worker index.js
Duplicate: extractTermsInner: Change query to -d jul
Duplicate: node extractTerms.js --save --userDataDbURL mongodb://localhost:27099/collage_test
Duplicate: Change query in convertFormats.js
Duplicate: node convertFormats.js --userDataDbURL mongodb://localhost:27099/collage_test --tokens
Duplicate: Change Spark dateBarrierFrom / To
Duplicate: Robustness fix in duplicate.js (+ java) - may not have nbrs due to some physcial
Duplicate: Graal regression tests for Algo change (+ new unit tests)
Duplicate: Graal interop from report --> replace duplcate.js with EmailDuplicate.java
Duplicate: Topic.artifacts --> populate with a list of JsonNode + ensure artifact.filtered is marked.
Duplicate: Commit: "Production:
Duplicate: Do not filter out tokens between endOfSubject and bodyStart indexes
Duplicate: Integration Test: Generate Tiny Graph with nbrs of few topics --> replace Mock
Duplicate: Phase 1 -> as in JS --> pass unitTests
Duplicate: Complete numDplicates / getDuplicatesScore
Duplicate: Add Wrapper to text-duplicate-detector::index.js (commited in github and requires original JS) --> to call Java.isDuplicate instead of JS isDuplicate / numDuplicates
Duplicate: reports.js GraalVM integration with JS to replace github/text-duplicate-detector
Duplicate: Test Chrome debugger - can it debug also Java ?
Duplicate: Mocha: Add node_modules/mocha/bin/mocha -inspect-brk (and not after $GRAALVM_HOME/bin/node !!!)
Duplicate: Commit Port to mocha - package.json devDependecies + port commented test() jest specific calls
Duplicate:export GRAALVM_HOME=~/graalvm-ce-1.0.0-rc9/
Duplicate:$GRAALVM_HOME/bin/node --jvm --polyglot --jvm.cp=$dupPath  node_modules/mocha/bin/mocha __tests__/duplicate.test.js
Duplicate: Note: jest seems to hang - port tests to mocha
Duplicate: Problem: TypeError: Access to host class Main.DuplicateDetector is not allowed or does not exist.
Duplicate: Rebuild .jar without errors - it is not currently built.
Duplicate:Bug:  Pattern.quote(topicText) --> Adds \QDekel Cohen\E --> \QDekel( {1,2}|%20 )Cohen middle 2-space RE is also escaped --> replace with JS escaping code
Duplicate: Q/A: [ and all other specials
Duplicate: Changes Java vs. JS
Duplicate: Proto now explicitly has topic1 and topic2 for 2 different topic about text (JS topic - can be string or array of 2)
Duplicate: GraalVM integration with JS to replace github/text-duplicate-detector
Duplicate: Ex:
Duplicate: $GRAALVM_HOME/bin/node --polyglot --jvm server.js
Duplicate: cd ~/graalvm-ce-1.0.0-rc9/graalvm-demos/polyglot-javascript-java-r
Duplicate: Phase 2
Duplicate: Conversation pair datastruct for inSubject topics
Duplicate: Bug: Us Submission: conversationPair causes only of nbrOccur to get duplicated  - why only 12 ?
Duplicate: Invitation: dup: 3->undefined - Why? It appears in subjects of several conversations
Duplicate: Q/A: nbrDistShortInSubject (Us Submission)
Duplicate: THRESHOLD_SCORE_DUP_IN_SUBJECT --> 3 ?
Duplicate: Q/A:
Duplicate: Compare to reports before changes: old version: 9ef7cbdfdade36930c2c3d9dca003e2954775dc4
Duplicate:Bug: YC Jul - missing duplicates in diff
Duplicate: azuredatafactory rank: 2.4->4.2 count: 5 fromMe: 1 dup: 4->2
Duplicate: Run reports and commit in branch (duplicate_occur) --use Conversation Ids
Duplicate: occurs: findTopicInText (first in body only and if appear in both subject and body, take the subject) -->
Duplicate: New: all occur of topic are considered
Duplicate: No dups between .occur of same artifact
Duplicate: New: Subject topics are compared only against other subject topics
Duplicate: conversationId vs. normalized subject
Duplicate: Take max dupScore of a single  artifact.occur[<any>] --> If above threshold (3) --> all the artifact is discounted (for this topic)
Duplicate: Update topicRank
Duplicate: getNewlineNbr when the diff of normal nbrs --> no duplicate
Duplicate: use left,right nbr from .occur array (no findTopicInText) --> inTitle
Duplicate: { nbrLeft, nbrRight } --> { left, right }
Duplicate: artsReShaped = topic.artifacts.map -->  artifact.about.occur -->
Duplicate:[artifact, { inSubject, leftNbr, rightNbr }]
Duplicate: keepAndConvertRelevantArtifacts --> !isSameArtifact + !isSameConversation (already exist)
Duplicate: Diff Algo
Duplicate: https://github.com/java-diff-utils/java-diff-utils
Duplicate: https://github.com/google/diff-match-patch
Duplicate: https://github.com/google/diff-match-patch/wiki/Language:-Java
Duplicate: Productization:
Duplicate: Test Perf: Does query to about edges of topicId=='Google Calendar', filtered by last 100 (ordered by timestamp) --> Expensive scan (use explain) or indexed fast query ?
Duplicate: Alt A: Redis stores List key=topicId last 100 (see LTRIM)
Duplicate: Q: How much storage memory required ?
Duplicate:A: 250MB. Assume 20000 distinct Terms --> 5000 after filter out long tail (count =< 3) --> each of 5000 terms has 50 occurs on Avg --> Each Term-occur requires 1KB with nbr
Duplicate: Online: Terms Processing - Detect Duplicate of new Term occur against the last 100 occur of this Term.
Duplicate: Keep JS code
Duplicate: More Complex
Duplicate: Less Context Sensitive: If a user duplicates-same-nbr a Term (Google) and it also occur in many other user's emails (but not same-nbr) --> last 100 may not be enough
Duplicate:--> other users nbrs push out the duplicated nbrs
Duplicate: Offline: TopTopics: rank += 1/5 or filterOut for each of the Dup edges
Duplicate: Simpler - similar to today's logic
Duplicate: More Context Aware: Per User / Per Affinity
Duplicate: Dup --> port to Spark Java
Duplicate: sent from my Smasung Galaxy smartphone
Duplicate: Problem: Not enough support from  left side (diff 70 chars on left and only 12 are identical)
Duplicate: If diffRatio is near threshold, but not low enogh (0.3) --> return a duplicate probabilty score (in addition to duplicate=false)
Duplicate:--> require higher count
Duplicate: Problem: left nbr match is very small (12 chars)
Duplicate: In addition to diff score, return also the matched text chunks ('sent from my') from func that compare 2 to func that compares array of N
Duplicate:Bug: Should not count Duplicates in SP Urls
Duplicate: Note: SP Urls shouldn't occur in generated paragraphs --> so low risk of missing a duplicate.
Duplicate: Ex: projectvenice dist: {"left":0,"right":1,"duplicate":true} inSubject:true sub: RE: Harmon.ie/Project Venice ("Euclid") sync oSub: RE: Harmon.ie/Project Venice sync
Duplicate:Q: Increase inSubject min dupCount to ~ 5 ? --> we want only spammers
Duplicate: maybe should increase threshold > 2 ?
Duplicate: Still catches noise such as 'Industry News'
Duplicate: Do not kill important Topics that repeats in 3,4 emails
Duplicate:: Why moving tpStat.artifacts.push(jArt); changes dup of outlook ?
Duplicate:: Are all topics in sigs are bad ?
Duplicate: Detect dups after removing signature --> otherwise 'Product Strategy' (david sig) --> detected as dup
Duplicate:"DONE Duplicate
Duplicate:--------------------
Duplicate: ussubmission (Subject:Contact Us Submission) --> ussubmission in many different threads (Contact is blacklist)
Duplicate:--> but rightNbr is empty (end of subject) --> duplicate = false
Duplicate: Count too short matches as maybe Duplicate (0.5) --> require 6 matches instead of 6.
Duplicate: 9511 Extract neighborhood for each term
Duplicate: Problem: tokens indices are in original tokens array (not in subjectTokens and bodyTokens)
Duplicate: Problem: getNormalizedSubject - how to (re) implement using tokens only ?
Duplicate: return the result of getBody
Duplicate: Pass it to duplicateEnricher
Duplicate: Problem: It contains RE: (need to normalize ) + contains subjectEndToken (need to remove or to stop)
Duplicate: body topics: minOffset
Duplicate: tokens[idxTokenStartBody].characterOffsetBegin
Duplicate: subject topics: maxOffset
Duplicate: tokens[idxSubjectEndToken].characterOffsetEnd
Duplicate: Trim leftNbr in subject using getNormalizedSubject
Duplicate: Problem: tokens cached mode - where to get body from ?
Duplicate:
Duplicate: Q/A: Bug: left,right are incorrect --> tokens
Duplicate: token.before sometimes missing (Crash) ?
Duplicate: Ex: Privacy Statement <DB5PR06MB156054DD1CEA4D62644AA606AF330@DB5PR06MB1560.eurprd06.prod.outlook.com>
Duplicate: Drafts: node --inspect-brk extractTerms.js --userDataDbURL mongodb://localhost:27099/collage_new -u "<AM5PR0601MB24347F1674046465748CD223C52C0@AM5PR0601MB2434.eurprd06.prod.outlook.com>" --noLM --noPer
Duplicate:tokens[i].originalText + "---" + body.substr(tokens[i].characterOffsetBegin, 70)
Duplicate:for (i = 0; i < 184; ++i) {
Duplicate:if (body.substr(tokens[i].characterOffsetBegin, tokens[i].originalText.length) !== tokens[i].originalText) { console.log(i); }
Duplicate:}
Duplicate: writes to terms.occur (array instead of cell level)
Duplicate: Note: Not a blocker, if can mock an array of topic nbrs as an input to getDuplicates in Spark
Duplicate: Duplicate-Subject: Why don't we use ConversationId ?
Duplicate: VIP access:
Duplicate: Bryan Oct-Nov has 67 mails with subject 'VIP access', of which 27 isAutomated and 40 were forwarded or replied
Duplicate: Problem: Sender is not Automated, but Subject was created by automated systems --> therefore very common
Duplicate: Remaining 40 in 12 conversations
Duplicate: Move Reports/java --> Reports/terms_processing/java
Duplicate: npm run collage_stable
Duplicate: Remove branch + remote branch collage_stable_all
Duplicate: run npm_install_all
Duplicate: commit new package.json + .lock file
Duplicate: Add the /path/to/sparse-checkout/repo to <repositories> - see https://devcenter.heroku.com/articles/local-maven-dependencies
Duplicate: Git sparseCheckout the repo from Collage repo topics-job
Duplicate:A: Not needed. Replaced by package.json terms-processing now containing the java algorithms.
Duplicate: maven plugin + goal to git sparse checkout the repo
Duplicate: Problem: Which repositry 'git-sparse-checkout-plugin' will reside in so topics-job can use it ?
Duplicate: It is only in Github
Duplicate: Can we easily publish it to central ?
Duplicate: Use Sparse checkout to fetch the Collage.Topics/Report/java/repo BEFORE the build phase
Duplicate: Sparse checkout mvn plugin: https://github.com/gastaldi/git-checkout-plugin
Duplicate: Add the /path/to/sparse-checkout/repo to <repositories> - see https://devcenter.heroku.com/articles/local-maven-dependencies
Duplicate: Add normal <dependency> on duplicate-detector
Duplicate: mvn lingo: part of generate-resources phase in the build lifecycle. a phase is a list of goals. specify phase will also cause exec of all phases preceding it
Duplicate: Idea Pre Launcher UI --> Add above maven goal
Duplicate: Package: development branch, add depdency in Reports/package.json, link.bat, refactor and run reports.js, src and tests, Read.me
Duplicate: spna (Bad):
Duplicate: subject:Accepted: SPNA - Day 2 Post show meeting
Duplicate: Solution ?: nlp-helpers.getNormalizedSubject --> already removes RE: FW: --> should also remove common Calendar prefixes Accepted: Cancelled: Declined:
Duplicate: Note: MS Graph has "meetingMessageType": "meetingAccepted" - to mark calendar items
Duplicate:Bug: EULA and EULAs are duplicated - should duplicate only same exact Topic
Duplicate: eula (Bad): It finds
Duplicate: sub: Updated EULA
Duplicate: oSub: Updated EULAs
Duplicate: Find indexOf whole word
Duplicate:"Signatures
Signatures: Passed Collage.Topics Tests
Signatures:: Only Sig Email --> Sig starts at first line
Signatures: Problems:
Signatures:1) Currently, Sig trigger starts at 2nd line () --> Bypass Thanks, ..... Yhonathan, .... --> incorrectly thinks the inclusive Sig is the correct one
Signatures:2) 'Elina with heb inclusive ___ incorrect sig' fails anyway today
Signatures: Bug: Half of 'Outlook' TopTopic (davidl-July) are the below:
Signatures: Get Outlook for Android<https://aka.ms/ghei36>
Signatures: Get Outlook for iOS<https://aka.ms/o0ukef>
Signatures: TDOO:Bug: Elina + shortlines prefix - prefers Elina + several short + long lines (0.5 in total) + 1 senderName over the correct sig - staring with Elina Maimon
Signatures: senderName +1 for inclusive (incorrect) sig but is not counted for scoring of correct sig (only count AFTER mayBeSig trigger)
Signatures: Ex: Andrei Malacinski with forward/quote - updateId: '<OF8AD5CC83.C1AD6F69-ON85258224.005D0997-85258224.005DE6A3@notes.na.collabserv.com>'
Signatures: Another incorrect inclusive but due to forward - not the same example but score is 5.5 and the real sig at the bottom is only 1.25 (passing on ratio 1.25/4)
Signatures: Our current solution for inclusive will not catch this - need to handle forwards specifically.
Signatures: The right sigs (Elina, Chris) are:
Signatures: Use Array.sort(comparator func)
Signatures: filter out candidates that do not pass minimal sig conditions: = 4 || 0.3 || sig-at-end
Signatures: Triggers are far apart
Signatures: score is a little lower, but score / sig-lines is much higher
Signatures: SenderName: +1 for inclusive candidate, but not for real (included)
Signatures: Problem: We do want to +1 all scores triggered by senderName,
Signatures: Do not. Only consider it when comparing 2 inclusive-included candidates (so it doesn't bias towards inclusive).
Signatures: Solution Alt: Weaken SenderName trigger (do not allow in > 10 words line ?) + Weaken ___ trigger - require name or Thanks right after
Signatures: Note: For Chris and Elina - not for all inclusive, but it will help not only for inclusive
Signatures: Problem: chris.curate (skype) is an incorrect trigger followed by url (+1)
Signatures: It is far apart from real trigger (sender - Chris)
Signatures: score = 1 (only url) --> not close in score to the 2 other candidates.
Signatures: Wrong +1 also for senderName.
Signatures: Problem: What about +1 for ___ (incorrect sig) trigger, and other triggers (maybe except Thanks ....) ?
Signatures:contribute score to the inclusive if appearing at the middle of a sig, only senderName does.
Signatures:___
Signatures:וזה הזמן לסקר היומי....
Signatures:סקר:
Signatures:עוף   - 1-6, לא אכלתי
Signatures:הערות:
Signatures:Elina Maimon
Signatures:Administration Manager
Signatures:Main: +972-8-9781325|Cell: +972-52-3284900|Fax:+972 8 9219389
Signatures:[Description: Description: http://harmon.ie/sites/default/files/harmonhex2012_logo.png]<http://harmon.ie/>
Signatures:Please consider the environment before printing this e-mail
Signatures: Problem: Doesn't work if inclusive-incorrect that also starts with senderName (I'm Chris, My name is Matt). Also got several regressions with +1 for senderName
Signatures: Ex: inclusive-incorrect that also starts with senderName (My name is Matt)
Signatures:----------------------------------updateId: <CA+FGVwX=HfrwhJA6BgsmPNMB1p4wyBJ=+4KP_BFLCir-kwejzg@mail.gmail.com>
Signatures:My name is Matt, and I’m a supervisor here at Toptal. I saw your inquiry come through and am interested in learning more about your project!
Signatures:Can we connect for a few minutes sometime to see if I can get you set up with some awesome designers?
Signatures:Click this link<https://www.toptal.com/track/U20wRklWSER5c045SFpUTTNMUXZZRHJ2bVhLK1pJMEVvd2R2OU5JTXpJRlF2Mm1raXg3VFo0THd3ZllrTHFCcXc4Y3JYR2J0NkNzRWNRaFBPMnE4K0t3MDlTT0FxQUJlUWE4NHVnQjMvUmwzL241eEF4VkpVKysyN0IxaDNPUmgreS9zdFE0Tk1PQWxUNk8yQnhuSWhRVFgwcXdxMUdFVnJyUHJqOWFyUGtROW1RT2RDc1lzbGRDOTMvR1B0OWg5aFpNRmVpKzMvRTRXd09jZWJ4ZmVET1lEY3hZU1F2RE13dU5pL2JJb0RQWm0vUW5QODlncGNRNDZ0QkhjdytyVDM5a1htQjRmSVFyc0YxUHBHSElqNUthdzJCR3RETUhVbTJQeHBXMlhHMFFVMm9wZjFBRHdHMS8yUVBncGJhQXlxdlQzTmhjUEVyaDc0ZlViSTFRSjBRPT0tLUtXVy85RmJYM0xIbFY0LzJTckY5NVE9PQ==--93bff4697dcb864f6c6b4b603442a6ac79cbd877> to schedule a call or simply contact me via my information below.
Signatures:I look forward to speaking with you soon!
Signatures:Best,
Signatures:Matt
Signatures:Matt Sikora, Sales Engineer || Toptal<https://www.toptal.com/track/V1lmNEUzM3F4YmFxMVM5ZW1uRlkxZy85T1NXZWI1a0NieDNqbHVpS0lSRzNhSHMyVlJWN0FlNElZOEw1TE5KZm82TlFJdERMNTY0a3hJVXJrc2cyQ3JxRDlNVWdadzF5aEsyaHFuaWcxTXBRdDZKazRwR0Voajlhd04rVk5HQk1WZENBMWRqV29JdkxSa2VNU2E4TWVVU1RyeXNESW93QklSTGwyalQ3NGtlSzNPMEFDR3cwSHpqK1Y2R2xzSUo1SW4zZEczeC8xWHczZ05ITVZnWnJHNmNNZWZsY1dUeDR6SnhidWRSTEh6WT0tLXkxRS9wNDEwSzhwdkdIRWd6RCsvU0E9PQ==--a62a4e9c91746071b3f69126d90f2d24aaac171e> || +16506141787 (office) || matt@toptal.com<mailto:matt@toptal.com> (skype)
Signatures:[https://api.segment.io/v1/pixel/track?data=eyJ3cml0ZUtleSI6ImpuUzRRc0hPQ0FPZUc2WHZNRENqRDluOWJBY1E1M01iIiwidXNlcklkIjo3NzEzMzgsImV2ZW50IjoiRW1haWxPcGVuZWQiLCJwcm9wZXJ0aWVzIjp7InRlbXBsYXRlTmFtZSI6IltEZXNpZ25lcl0gSW5pdGlhbCBDbGFpbSBFbWFpbCIsInRyYWNraW5nVVVJRCI6IjUwNDM0NTFmNzQ2MmY3OGQ3YTBlZDUxMWJhOGY0MDg4In19]
Signatures:----------------------------------updateId: <c7d8dc3d836349fbaac5ff1d8374922b@Exchange2K13.robic.ca>
Signatures:Best regards,
Signatures:David Kirouac, IT Director | ROBIC, L.L.P. | LAWYERS, PATENT AND TRADEMARK AGENTS
Signatures:....
Signatures: Move email-utils + signature detector to harmoniedev user + change git urls in 2x package.json
Signatures: pwd: 6QIHFC-1xxsQCLjGMZ50w-oTg
Signatures: Ex: David Lavenda<mailto:davidl@harmon.ie>
Signatures: code checks it is a real name and not an email: isSender = !maybeEmail(line) &&  getSenderScore(line,arrSenderTok,true) >= 1
Signatures: Problem: It is a real name: David Lavenda --> but it is followed by an email
Signatures: filter out email from line before extracting sender names
Signatures: Counter Examples of ___ trigger without followed by another trigger:
Signatures:----------------------------------updateId: <HE1PR06MB32108B3DCBB3AF7FCB0D0FA1DD420@HE1PR06MB3210.eurprd06.prod.outlook.com>
Signatures:David Lavenda<mailto:davidl@harmon.ie>
Signatures:Vice President, Product Strategy
Signatures:Tel: +1-845-913-7240<tel:1-845-913-7240>
Signatures:Blog: http://ow.ly/gqVTU
Signatures:[cid:image001.jpg@01CEA712.15640750][cid:image002.jpg@01CEA712.15640750]<https://twitter.com/dlavenda>
Signatures: Several non-personal signatures in both marketing and notifications emails
Signatures: ----------------------------------updateId: <30272259.20180227070201.5a950269c89a69.97384899@mail135-14.atl141.mandrillapp.com>
Signatures:Download Blink
Signatures:For desktop:   Windows <http://desktopapps.joinblink.com/latest/Blink.exe>  •  Mac <http://desktopapps.joinblink.com/latest/Blink.zip>
Signatures:For mobile:   iPhone <https://itunes.apple.com/gb/app/blink-workplace-messaging/id1046583493>  •  Android <https://play.google.com/store/apps/details?id=com.usekimono.android&referrer=utm_source%3DBlink%26utm_medium%3Dweb%26utm_content%3Dfooterlink>
Signatures:Super Smashing Limited (or Blink as you know us) is registered in England & Wales (08817286)
Signatures:71 Fanshaw Street, London, N1 6LA
Signatures: ----------------------------------updateId: <30914381.20180313202107.5aa832b39e3d52.74495200@mail128-128.atl41.mandrillapp.com>
Signatures:________________________________
Signatures:Should you have any inquiry, please do not hesitate to Contact Us.<https://mandrillapp.com/track/click/30914381/www.dealounge.net?p=eyJzIjoiRi1ieFJld3NSOUdqRXB1bk9TNzdGTUZFXzk0IiwidiI6MSwicCI6IntcInVcIjozMDkxNDM4MSxcInZcIjoxLFwidXJsXCI6XCJodHRwOlxcXC9cXFwvd3d3LmRlYWxvdW5nZS5uZXRcXFwvaW5kZXgucGhwXFxcL2NvbnRhY3R1c1wiLFwiaWRcIjpcIjcyMzJlY2IzMzMyMTRmZGNhZWE3MjFmMzE0MTVmZmZhXCIsXCJ1cmxfaWRzXCI6W1wiNzcyOTZhNGUzNzViZmY4OGEzZjcwMzViOGUxZmMxMmIxZmM1YmM2Y1wiXX0ifQ>
Signatures:CONTACT
Signatures:2500 Broadway Suite F-125
Signatures:Santa Monica, CA 90404
Signatures:Office: (310) 957-2064
Signatures:Visit us online: www.dealounge.net<https://mandrillapp.com/track/click/30914381/www.dealounge.net?p=eyJzIjoiQXZiMVpYMHYxZ3BZd29Lei02SjllZTBOS2M0IiwidiI6MSwicCI6IntcInVcIjozMDkxNDM4MSxcInZcIjoxLFwidXJsXCI6XCJodHRwOlxcXC9cXFwvd3d3LmRlYWxvdW5nZS5uZXRcIixcImlkXCI6XCI3MjMyZWNiMzMzMjE0ZmRjYWVhNzIxZjMxNDE1ZmZmYVwiLFwidXJsX2lkc1wiOltcIjlkZDI2Y2Q5NmQ2MzRmZDQ5ZDA2ZjMzZWM3OTU4MzBjZDJmNzBiM2NcIl19In0>
Signatures: diffSignature may miselead (Tim Morin)
Signatures: A duplicate sig is only printed for one of its updateIds
Signatures: Problem: After a fix (Tim Cook), the sig in update1 is now the same as update2 (which comes later and overrides) --> it looks as if it wasn't fixed -->
Signatures: Report install instructions
Signatures: yarn + yarn.lock
Signatures: Merge to master (see 'MPM public release processes'  below)
Signatures: Problem: Now detects new Signatures --> exposed (but is not the cause) of Elina and Chris inclusive bugs
Signatures: SenderName trigger - not in vert long lines (numWords >= 10)
Signatures:: Devenv multirepo: npm install <new package> in a consumer (Collage.Topics\Reports) that was linked to a package repo (d:\views\github\email-signature-detector) messes
Signatures: RFC for new link in npm7 - https://github.com/npm/npm/issues/19421
Signatures: Discussion: See https://github.com/npm/npm/issues/17287
Signatures: Add p:, c:, Telephone: to maybePhone
Signatures: Add [cid: - maybeSigInlineImage
Signatures: Merge from development to master in all repos --> switch to master
Signatures: git status
Signatures:git checkout master
Signatures:npm publish
Signatures:: Inline emails quotes (Forwarded):
Signatures: fromMe scoreBoost factor is incorrect (works only for the first)
Signatures: isAutomated - doesn't work if forward a marketing email.
Signatures: Impl: See old Collage exchangeAdapter.js removeReplyElementsFromDocument
Signatures:updateIds:  <5L3PWZ8OQ4_5ace1a3352ca1_99ee3f98204cb98412782d_sprut@zendesk.com>
Signatures:found=true idxStartFinalSig=141 idxStartSig=132 score=6 candStartSig.length=3
Signatures:sig:
Signatures:----------
Signatures:Marco Hinniger
Signatures:Sent: Monday, April 09, 2018 10:39 AM
Signatures:To: IFOK IT <it@ifok.de<mailto:it@ifok.de>>; harmon.ie Support <support@harmon.ie<mailto:support@harmon.ie>>
Signatures:Subject: RE: [harmon.ie] Re: Browser authentication and SSO
Signatures:I followed your instructions but it still ask for a password.
Signatures:The password I had been given does not work anymore.
Signatures: Ex:
Signatures: ---------- Forwarded message ---------- or 'Begin forwarded message:' or ----- Forwarded by Andrei Malacinski/Raleigh/IBM on 02/20/2018 08:18 AM -----
Signatures:-------- Original message -------- -------- הודעה מקורית --------
Signatures: On Apr 23th 2018, 2.23pm, Maayan Levy maayan@everthere.co<mailto:maayan@everthere.co> wrote:
Signatures: From:, To: Sent: Subject: Cc:
Signatures: davidl_15_Mar_27_Apr_detailed_newsig.txt (good) davidl_15_Mar_27_Apr_detailed_latest_sig.txt (worse)
Signatures:Bug: GDPR topic is gone --> incorrectly detected as sig
Signatures: Problem: communicationssw@domain.com --> communication is already 5000 freq non-person --> BUT communicationssw is nto splitted --> no match.
Signatures: marketing and PER matches also substring for important (no risk of false matches )
Signatures: Titles lines in sigs
Signatures: SNER can detect Titles (Vice President) Legal Advisor is not detected
Signatures: See List of many Job Titles (+Python package to look up): https://github.com/fluquid/find_job_titles/blob/master/src/find_job_titles/data/titles_combined.txt.gz
Signatures: WordVectors (but with a much shorter ONET list) https://github.com/afshinrahimi/jobdescription2jobtitle
Signatures:Addresses in Signatures: Can we identify address ? Use our locatios.json + commaa ? How common it is in sigs and outside sigs (mongo regex search)
Signatures: libpostal
Signatures: Problem: Heavy weight
Signatures: Test ligher weight (but less accurate) python (NLTK based - porting to JS?) - https://github.com/ushahidi/geograpy
Signatures: If cannot be almost confident - at least do not count it as a long-line
Signatures: Ex:
Signatures: The Lodge, Sandy, Bedfordshire SG19 2DL
Signatures: 48 Menachem Begin Road, Tel Aviv 66180, ISRAEL
Signatures:Head Office: Amot Bituach House Building B
Signatures: Liberty Wharf | 250 Northern Avenue |Suite 300
Signatures: 202, Galore Tech IT Park, Bavdhan, Pune - 411021, INDIA
Signatures:San Francisco. Chicago. Toronto. Johannesburg. Dubai. Sydney
Signatures: New York | Hamptons | San Francisco | Los Angeles
Signatures: The headquarter address is 8912 NE Alderwood Rd., #25994, Portland OR 97220, USA.
Signatures: Typical Long Lines in signatures
Signatures: Opt Out (marketing/spam)
Signatures:  We invest much effort to reach the right people, therefore we believe our service is beneficial for you. Not the case?
Signatures: receive emails,
Signatures: unsubscribe
Signatures: Reply “Remove” to Opt-Out.
Signatures: This email message and any attachments thereto are confidential and/or privileged and/or subject to privacy laws and are intended only for use by the addressee(s) named above. If you are not the intended addressee, you are hereby kindly notified that any dissemination, distribution, copying or use of this email and any attachments thereto is strictly prohibited. If you have received this email in error, kindly delete it from your computer system and notify us at the telephone number or email address appearing above. The writer asserts in respect of this message and attachments all rights for confidentiality, privilege or privacy to the fullest extent permitted by law.
Signatures: This e-mail message, including any attachments, is for the sole use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Gartner makes no warranty that this e-mail is error or virus free.
Signatures: The contents of this email and any attachments are sent for the personal attention of the addressee(s) only and may be confidential. If you are not the intended addressee, any use, disclosure or copying of this email and any attachments is unauthorized - please notify the sender by return and delete the message. Any representations or commitments expressed in this email are subject to contract. Adaptive Business Group.
Signatures::Q/A:
Signatures: Regression: Continue from the middle: diff D:\views\Collage.Topics\Reports\my_reports\signature\23_5_new_sender_phone_email_longlines.txt D:\views\Collage.Topics\Reports\my_reports\signature\9_5_2f532903832a.txt
Signatures:let ret = { signature : '', found : false };
Signatures: console.log(`found=${ret.found} idxStartFinalSig=${idxStartFinalSig} idxStartSig=${idxStartSig} score=${score} candStartSig.length=${candStartSig.length}\nsig:\n----------\n${ret.signature}`);
Signatures: Coverage: Compare yc and ramt reports with score > 1.5 threshold vs score = 4 (as today)
Signatures:: Convert hardcoded non-mocha tests to testData.json (can find and copy their mailUpdate)
Signatures: Test supporting evidence for each signature hint (e.g for each sender name and thanks,) --> take highest but prefer last if close by score.
Signatures: ex: Influencer marketing programs delivered an average of $11.20 for every dollar spent in 2015. That figure is up 63% from the previous year, and continues to grow.
Signatures: ex: If you are a resident of Canada: You are receiving this because you opted in or attended a similar event.
Signatures: Q/A:  place sender name + hints above and see it doesn't extract wrong sig (that include part of email)
Signatures:: Fix missing from: in updates-fetching (hundrends of updates)
Signatures:: Fix regressions at my_reports\signature\diff_23_5_vs_9_5.txt
Signatures:: dup-sig: Real signatures are often duplicated
Signatures: report.js - first pass collect in Map <from.email+sig,count>
Signatures: Even below score = 4 (confidence), but above certain minimal threshold (not all candidates)
Signatures: jArt.sig = <extracted sig>
Signatures: Second Pass - If Map[from.email+jArt.sig] >= 3 --> treat all topics as sig
Signatures: First filter rankedCands by minimal threshold (4 or 30%)
Signatures: Then
Signatures:ARRIVAL:   NEW YORK, NY (JOHN F KENNEDY INTL), TERMINAL 4        11 MAY 05:10
Signatures:Vegas-EWR 23 May (Leaving after 2PM)
Signatures:FLIGHT     LY 004 - EL AL ISRAEL AIRLINES                     THU 24 MAY 2018
Signatures:DEPARTURE: NEW YORK, NY (JOHN F KENNEDY INTL), TERMINAL 4        24 MAY 10:30
Signatures:ARRIVAL:   TEL AVIV YAFO, IL (BEN GURION INTL), TERMINAL 3       25 MAY 04:00
Signatures:FLIGHT BOOKING REF: LY/Q29GCZ
Signatures:Let’s discuss internal flights
Signatures:________________________
Signatures:David Lavenda<mailto:davidl@harmon.ie>
Signatures:Vice President, Product Strategy
Signatures:Tel: +1-845-913-7240<tel:1-845-913-7240>
Signatures:Blog: http://ow.ly/gqVTU
Signatures:"DONE Signature
Signatures: Arina - sender name at the end
Signatures::Q/A:
Signatures: Script to find long sigs with relatively long lines --> review manually by Q/A --> open bugs
Signatures: score url less if at longlines ?
Signatures: if num words (except url) > 15 (not 5 as regular long lines) + !isListLine (which may be long but is a good line)
Signatures: Detect disclaimer + confidential + unsubscribe + other common in a simple url - do not discount for such lines and don't let them disqualify urls either.
Signatures: Ex: len = 11 [Twitter Robic]<http://www.twitter.com/ROBICCanada>  [Facebook Robic] <https://www.facebook.com/ROBICCanada>   [LinkedIn Robic] <http://www.linkedin.com/company/robic>
Signatures: Count urls (more complex than regex ?) - allow 8 for first url + 5 for each subsequent url
Signatures: Problem: Caption of url may itself contain multiple words - we would like to ignore it while counting the long line ...
Signatures: Why duplicate console output when getSignature is called once ?
Signatures:A: Non issue. Multiple candidates
Signatures: maybeEmail - do not detect sick@home
Signatures: Added [cid: for embedded images with score += 0.5
Signatures::Bug:
Signatures: wrong sig: score=4.25 - takes several long lines (with numbers , etc) above signature  - why ?
Signatures:
Signatures: Restrict maybeEmail /e:/ --> DEPARTURE:
Signatures: Restrict maybePhone
Signatures: Detect matches to /\d+(([ \-.()/]{1,4})*\d+)*/g --> generalized digits(delim+digits)*
Signatures: Scan array to find conseq groups of digits
Signatures: Remove leading + and * from first group.
Signatures: stop when not digits 617-986-5038 some-text
Signatures: last group may contain both digits followed by non digits (ex: +1-845-913-7240<tel:1-845-913-7240> ex: 64203565ext 203)
Signatures: Every group 1-14 --> middle values (3-7) the highest score
Signatures: 1 <= num groups <= 6 (can be a single group 6450234)
Signatures: Total length 6 - 15
Signatures: If leading + --> allow longer (not more than 15)
Signatures: Assumption: Do not support 911 and 106 - short numbers that do not often occur in sigs
Signatures: If leading * --> allow shorter
Signatures: Separators - do not allow > 1 hyphen or dot or bracket conseq - only spaces (<=4)
Signatures: can only be avg of 3 * (Num Groups-1) - 3 seps between each of the groups
Signatures: libphonenumber-js::findPhoneNumbers found 311 in 'DEPARTURE: TEL AVIV YAFO, IL (BEN GURION INTL), TERMINAL 3       11 MAY 00:30'
Signatures: Only if one of array is possible:true (better valid:true)
Signatures: possible:false for true numbers:
Signatures:(089) / 636-48018 German domestic
Signatures: After result 08963648018 possible:false --> trim leading 0 --> possible:true
Signatures: RE: See extractPossibleNumber VALID_PHONE_NUMBER_ using VALID_PUNCTUATION and VALID_DIGITS in https://github.com/googlei18n/libphonenumber/blob/064990d919cd8ed1a14ab23965bee4082af5cb7e/javascript/i18n/phonenumbers/phonenumberutil.js
Signatures: /(\+|\d|-|\(|\)){5,}/ --> '-----------------------------------------------------------------------------'
Signatures: /o:/ --> any XXXo:
Signatures: phone + 49 30 3080 8556
Signatures: (Office) 0203 805 7791
Signatures: 617-986-5038
Signatures: +49 692 5736 7211
Signatures: +44 330 221 0088
Signatures: +1-845-913-7240<tel:1-845-913-7240>
Signatures: (089) / 636-48018
Signatures: 19-49-89-636-48018
Signatures: +49-89-636-48018
Signatures: 6641234567
Signatures: Should not: manetch is a global business education platform for managers with offices in Berlin, Germany, and Santiago, Chile. The headquarter address is 8912 NE Alderwood Rd., #25994, Portland OR 97220, USA.
Signatures: Q: Why didn't it find the true signature (Best,) in this case (with score >= 4) ?
Signatures:A: Probably becuase of a bug: Best was capitalized in regex when it should have been lowercased (fixed)
Signatures: long-lines should be > 0.5 penalty if close to mayBeSigStartIdx - see below email example (impl: maxScore is lower than threshold) and it should be 0.5 if maxScore > threshold (meaning several advertisment long lines after the signature)
Signatures:Vick,
Signatures:Does the team have availability Thursday morning for a call to discuss the product roadmap? We would like to get a couple of members from Trivantis (portfolio company) do join the call and get a demo of the technology.
Signatures:Matt Hathaway, CFA
Signatures:Vice President, Mergers & Acquisitions
Signatures:StoneCalibre
Signatures:2049 Century Park East, Suite 2550
Signatures:Los Angeles, CA 90067
Signatures:Direct: +1 (310) 774-0377
Signatures:stonecalibre.com<http://www.stonecalibre.com/>
Signatures: Yaarit: Add to sigdata.json: score = 2.5 - Thanks --> short email (2-3 long lines) --> long signature (many good sig lines) --> may cut entire email !!!
Signatures:Thanks
Signatures:If you want I could do the interview with Per.
Signatures:If so I could receive your questions and what you would like to see in such a case study.
Signatures:Let me know what you think,
Signatures:Vänliga hälsningar / Kind Regards
Signatures:Sales Consultant
Signatures:emelie.lundin@affecto.com<mailto:emelie.lundin@affecto.com>
Signatures:Mobile: +46 761 01 50 64
Signatures:Johan på Gårdas gata 5A
Signatures:412 50 Göteborg
Signatures:[cid:image001.jpg@01D2E5E0.393C8BF0]<https://contactmonkey.com/api/v1/tracker?cm_session=9d4cf442-2066-4545-b7a0-8c42d56ac8a6&cm_type=link&cm_link=cee22122-bafd-451b-a69a-7e8d219c83b8&cm_destination=http://www.affecto.se/?utm_source=mailfooter>
Signatures:Affecto is now a part of CGI.
Signatures:[cid:image002.jpg@01D2E5E0.393C8BF0]<https://contactmonkey.com/api/v1/tracker?cm_session=9d4cf442-2066-4545-b7a0-8c42d56ac8a6&cm_type=link&cm_link=c69549f2-a1ca-4328-b165-d85450dd71aa&cm_destination=https://www.linkedin.com/company/affecto>[cid:image003.jpg@01D2E5E0.393C8BF0]<https://contactmonkey.com/api/v1/tracker?cm_session=9d4cf442-2066-4545-b7a0-8c42d56ac8a6&cm_type=link&cm_link=c28d7ca2-cea3-426d-88fa-97b58a02965b&cm_destination=http://www.facebook.com/affecto.se> [cid:image004.jpg@01D2E5E0.393C8BF0] <https://contactmonkey.com/api/v1/tracker?cm_session=9d4cf442-2066-4545-b7a0-8c42d56ac8a6&cm_type=link&cm_link=aaadb214-1264-44c6-b80b-3d6ba1088e49&cm_destination=https://www.instagram.com/affecto.se>  [cid:image005.jpg@01D2E5E0.393C8BF0] <https://contactmonkey.com/api/v1/tracker?cm_session=9d4cf442-2066-4545-b7a0-8c42d56ac8a6&cm_type=link&cm_link=23cd2a91-b03a-40cb-9f70-7b9af8468964&cm_destination=http://twitter.com/affectosverige>  [cid:image006.jpg@01D2E5E0.393C8BF0] <https://contactmonkey.com/api/v1/tracker?cm_session=9d4cf442-2066-4545-b7a0-8c42d56ac8a6&cm_type=link&cm_link=c7853681-f68e-4469-99ae-c07621c17193&cm_destination=https://youtu.be/9eLih93ignU>  [cid:image007.jpg@01D2E5E0.393C8BF0] <https://contactmonkey.com/api/v1/tracker?cm_session=9d4cf442-2066-4545-b7a0-8c42d56ac8a6&cm_type=link&cm_link=cfe323cc-2e22-4b17-9864-ae4fded9f18a&cm_destination=https://vimeo.com/affectoswe>
Signatures: Yaarit: score = 1.5 ? Why ?
Signatures:A: No from ?
Signatures:Best regards,
Signatures:Shane
Signatures:Shane Lyons   +353 91 514 502
Signatures:shane@sharepointeurope.com<mailto:shane@sharepointeurope.com>
Signatures:[ESPC18]
Signatures:Yaarit:: score=2.25
Signatures:Sam Gowing
Signatures:Writer
Signatures:[email]
Signatures:Digital Marketing for Microsoft Partners
Signatures:@takefiftyfive<http://t.sidekickopen06.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XYg3Lr2NCW1ptzHg3VWsM0F3B6ZzM-8sZdLL4c603?t=http%3A%2F%2Fwww.twitter.com%2Ftakefiftyfive&si=6686242426585088&pi=3f141cec-afe1-4dcd-8e8a-a596baacde71>
Signatures:LM11.2.1, The Leather Market, Weston Street, London, SE1 3ER
Signatures:T: 020 3743 7897
Signatures:score = ?
Signatures:--
Signatures:Thanks & Regards,
Signatures:Janet Green | Marketing Campaign Manager
Signatures:Appending | De-Duping |SEO| Content Management
Signatures:We respect your privacy, if you want to remove from this list. Please reply back with the subject line as "Leave Out".
Signatures: skype:|skype\s{0,5}id:|\(skype\)
Signatures: updateId : <AM5PR06MB305995A3AFA95D660F7589C6AF880@AM5PR06MB3059.eurprd06.prod.outlook.com>,
Signatures:Kobi Carlebach [2:32 PM]:
Signatures:Let me know whether we are using Skype or 333
Signatures: fromEmail - use to detect person name (see contains_sender_names in https://github.com/mailgun/talon/blob/master/talon/signature/learning/helpers.py)
Signatures: Use from as both mayBeSigStart + scoring factor (even if there is another mayBeSigStart, such as Thanks,)
Signatures: Sender Matcher
Signatures: Rank higher capitalized (Ram)
Signatures: Rank higher > 1 word match
Signatures: Not always available - keep it optional
Signatures: Replace non-alpha with spaces - /^[a-z'.]+$/i
Signatures: Split on space
Signatures: Ex: 'Howard, Anne' , 'Annie Conza - Add iT Accountants', 'Davies-Sond, Arron'
Signatures: Filter out len < 2
Signatures: stopwords (the ...) - maybe use freq-5000 ?
Signatures: Intersect 5000 with large name list before using it: guy,mark, green
Signatures: Q: What if the displayName includes ALSO the org ?
Signatures: Andrei Malacinski/Raleigh/IBM --> we do not want to use IBM
Signatures: Ex: Vikram Jayaram Kadri (3P)
Signatures: Remove duplicates at the end
Signatures: Problem: Internal Mailing list: Welfare Team, harmon.ie Support
Signatures: Ex: displayName: 'harmon.ie Support' - and it appears alot inside the email. It is Person, OrganizationUser
Signatures: Filter out both support (freq-words) and harmon.ie - matches the domain part of email
Signatures: Q: What if the displayName is org ('Application Insights')
Signatures:A: The whole email should be removed by isMarketingEmail. Even if don't - we don't care about cutting too much signature in such emails.
Signatures: Shane Lyons [ESPC], GoToWebinar Team, Couchbase, Citrix Ready
Signatures: -- <Sender Name>
Signatures: Ron\n\nRon Johnsen\nVP XXX ...
Signatures: Motivation: Can help if no startSig re (simply <Sender Name>, sig line 1, sig line 2) - can also help to remove 'Thanx' - see below.
Signatures: Currently we do not aim at deleting very short Sigs (that do not contribute bad topics, except for PER). We need to incorporate our PER detection (also for sigs)
Signatures:Thanks,
Signatures:Hila
Signatures: Problem: Thanx, David --> Very short but contributes very bad and most common topic - 'Thankx'
Signatures: mocha based code to process all files in folder, parse json and body text
Signatures:: Talon: ML lib for signature parsing: https://github.com/mailgun/talon - its dataset is in https://github.com/mailgun/forge
Signatures: Worked in a simple signature
Signatures: HP9 2JH
Signatures:  <mailto:....> section
Signatures: Ex: E-Mail:   karin.wolok@neotechnology.com<mailto:karin.wolok@neotechnology.com>
Signatures: Andrei longer lines were very partially extracted
Signatures: Only last line (out of 4) were extracted: E-mail: malacins@us.ibm.com; Phone: (919) 254-1474
Signatures: Workaround: Split (newline) longer line: Development Manager, IBM Connections.  (IBM Collaboration Solutions) --> Works
Signatures: Maybe strictier (less recall, more precise) patterns: https://www.thebalance.com/email-message-closing-examples-2061895
Signatures: in short line (max 5 tokens)
Signatures: Very close to the end-of-line (Thanks, Thanks!)
Signatures: Require that some of the removed lines contains the following (add some score):
Signatures: short lines
Signatures: email \S@\S
Signatures: weak may-be-phone - (\+|\d|-|\(|\)){5,}|Phone:|Tel:|Mobile:|Cell:
Signatures: url - http[s]*:\/\/
Signatures: See examples: https://github.com/mailgun/talon/blob/master/talon/signature/learning/helpers.py
Signatures: Remove youtube - only occurs in one sender sig
Signatures: Packaging
Signatures: Github repo
Signatures: Lerna
Signatures: Create npm_install.bat to call npm install + npm update <iter all git url packages in package.json that do not have a #commit or specific version/tag>
Signatures: Yaarit: Split Tests to Public/Private
Signatures: package.json of Collage.Topics --> github new development branch urls
Signatures: Readme.md
Signatures: Change all places importing signature and per (inc tests)
Signatures: Delete from Collage.Topics: phone.js signature.js marketing.js per.js, Data/freq_words_no_person.json
Signatures: Dependecies: per/per.js (parse from.mail) + marketing  --> email-nlp-util npm. phone.js  --> in email signature detector repo
Signatures: Problem: License: freq_words_no_person.json --> based on https://www.wordfrequency.info 5000 free list (but he doesn't premit to copy it to another site)
Signatures:A: Mark gave us permission to publish his 5000  on github, with attribution
Signatures: Backend + Front End - webpack
Signatures: Story: Unit Tests for Sig sub algo isPhone + Bug fixes.
Signatures: Story: Asses Sig Algo Quality.
Signatures: Story: Define Automate Report process.
Signatures:Progress Report (5 days)
Signatures: Signature
Signatures: It is still 1st by far, because it is not used only in sigs (syn to Thanks) and is fromMe (written by David)
Signatures: pr in old report was not detected in sig, so dup took care of it, and in new report - detected by sig 45 times, so rank is lower (25 --> 21)
Signatures:old: rank: 25.70000000000002 count: 74 fromMe:12, sig:21,dup:63
Signatures:new: rank: 21.2 fromMe:12 sig:66 dup:18
Signatures: ram, tal and twitter took a hit - now detected in many more sigs
Signatures: Some incorrectly sigs are now correctly not detected (but we need to fix inline emails and regressions).
Signatures: Thousands of new sigs are detected senderName, New maybePhone, maybeEmail, best
Signatures: Started PER - requires isMarketingEmail (to filter out non-per from.displayName)
Signatures: Admin UI - Q/A continued all week with merges and 2 bug fixes.
Signatures: Roadmap - some articles and slides for Wed
Signatures:Signatures Examples
Signatures:---------------------------------------
Signatures:Andrei Malacinski
Signatures:E-mail: malacins@us.ibm.com; Phone: (919) 254-1474
Signatures:- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Signatures: Graph query to get this email: g.V().has('artifact','artifactId','<AM3PR06MB090117C6F5B42D86EDAD8008D8F40@AM3PR06MB0901.eurprd06.prod.outlook.com>').outE('about')
Signatures:Many thanks,
Signatures:Mobile:   +44 7733 296159
Signatures:Tel:           +44 1494 358342
Signatures:E-Mail:   richardp@harmon.ie<mailto:richardp@harmon.ie>
Signatures:Web:      http://harmon.ie<http://harmon.ie/>
Signatures:[Description: harmon]
Signatures:Metropolitan House
Signatures:20 London End
Signatures:Beaconsfield
Signatures:Buckinghamshire
Signatures:HP9 2JH
Signatures:- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Signatures:--
Signatures:הילה יונתן.
Signatures:Hila Yonatan, UX Designer | Consultant
Signatures:Phone: +972-509-849605
Signatures:האתר הרשמי | My Website<http://hilayonatan.co.il/>
Signatures:לינקדאין | LinkedIn<https://www.linkedin.com/in/hilayonatan>
Signatures:טוויטר | Twitter<https://twitter.com/HilaYonatan>
Signatures:פייסבוק | Facebook<https://www.facebook.com/HilaYonatan>
Signatures:דריבל | Dribbble<https://dribbble.com/HilaYonatan>
Signatures:Thanks,
Signatures:------Jean Buchnik
Signatures:- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Signatures:Thanks & Regards
Signatures:URL:  wbdatatech.com
Signatures:Application/Product Development Company\nData Science | Machine Learning | Predictive Analytics\nPython | PHP | iPhone | Android | .Net | JAVA
Signatures:Virus-free. www.avast.com
Signatures:- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Signatures:Thanks, Yaacov!
Signatures:Gina Devine
Signatures:Account Manager | fama
Signatures:PR.
Signatures:Liberty Wharf | 250 Northern Avenue |Suite 300 | Boston, MA 02210.
Signatures:How are you?
Signatures:________________________
Signatures:David Lavenda
Signatures:Vice President, Product Strategy
Signatures:Blog: http://ow.ly/gqVTU
Signatures:- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Signatures:Silvia
Signatures:Silvia Cîmpurean
Signatures:Support Engineer
Signatures:Azure Big Data & Analytics
Signatures:Customer Services and Support
Signatures:Office: +40 356 632 203
Signatures:Working Hours: Mon-Fri 09-18 (EEST)
Signatures:- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Signatures:Elina Maimon
Signatures:Main: +972-8-9781325|Cell: +972-52-3284900|Fax:+972 8 9219389
Signatures:[Description: Description: http://harmon.ie/sites/default/files/harmonhex2012_logo.png]<http://harmon.ie/>
Signatures:Please consider the environment before printing this e-mail
Signatures:"EWS "Updates Fetching
Signatures:- - - - - - - - - - -
Signatures: How to fetch new mailUpdates from test3
Signatures: Or "C:\Program Files\TortoiseGit\bin\TortoisePlink.exe" -L 8333:10.227.176.15:8083 -L 8334:10.227.176.15:8086 -L 15333:10.227.176.15:15672 -L 27333:10.227.176.15:27017 -N admin@test3-internal.colla.ge
Signatures: ssh admin@test3-internal.colla.ge
Signatures:default port: 22
Signatures: No need for pwd - we have private key
Signatures: Restart docker services
Signatures:https://test3.colla.ge/static/index.html?mode=browser&version=5
Signatures: refresh specific user Id
Signatures: delete tokens - collageUserId + sourceId * 2 {Inbox, sentItems}
Signatures: https://test3.colla.ge/api/user-data/refreshUserUpdates/10037FFE840A566E
Signatures: Examine udpate-fetching, terms-processing, user-data logs and rabbit Queues
Signatures:docker ps --> get container id (c37497d078a9)
Signatures:docker logs -f --tail 100 c37497d078a9
Signatures: Make changes / fixes to feature/topTopics1
Signatures: Edit Locally, build and deploy
Signatures: Build: Automatic build + push to test3
Signatures: Releases tab --> Select 'Release pipeline' named: 'Collage Deployment - updates-fetching'
Signatures: Select latest build (red/green?)
Signatures: Deploy (if need to redeploy)
Signatures: Click on Test3 (AWS) to see detailed Deployment logs
Signatures: Or: Change directly in test3
Signatures: docker exec -it 7746861ed25b /bin/sh
Signatures: Diff new fetch with prev one -->
Signatures: How many prev-fetch emails (until end of Apr or beginning of May) are missing from new-fetch ?
Signatures: A: davidl mar-apr 1380 old --> 410 new (deleted)
Signatures:ramt mar-apr 443 old --> 383 new (deleted)
Signatures:yaacovc mar-apr 1331 old --> 837 new (deleted?)
Signatures:dekelc (sanity) mar-apr 493 in both (I don't delete)
Signatures: If so, consider deletions. If not deletions - why ?
Signatures: Anyway backup original new-fetch, before modifiying it.
Signatures: For prev-fetch dateRange: new-fetch should not have emails that do not exist in prev-fetch
Signatures: Deleted Emails: If prev-fetch has a lot more emails (compared to new-fetch) for ramt in mar-apr dateRange --> Is it possible he deleted ?
Signatures: If so - run ramt report on new-fetch mar-apr --> Diff --> is it better (because
Signatures: Q: Didn't he already deleted mar-apr before prev-fetch ?
Signatures::Bug: Why didn't fetch new dekelc ()
Signatures::Bug: Seems MS Graph message is null ? This is the same message without internetMessageId
Signatures:Failed getSourcePager collageUserId: 10033FFF840B2EA6 durationMs: 6772 error: TypeError: Cannot destructure property `content` of 'undefined' or 'null'.
Signatures:1|updates- |     at Object.adaptMessage (/service/lib/systems/mailAdapter.js:17:13)
Signatures::Bug: 1|updates- | 2018-05-02T10:38:31.054Z [mailPager] error: Failed getSourcePager collageUserId: 10033FFF840B2EA6 durationMs: 4176 error: TypeError: Cannot read property 'trim' of undefined
Signatures: nternetMessageId may be undefined - m.internetMessageId.trim()
Signatures: Bypass in log and adaptMessage.
Signatures: Log subject in this case to try to see why MS Graph didn't return this.
Signatures: Take rcpts directly from msgrah message --> no call to /persons anymore (like with Euclid)
Signatures: Conc:
Signatures: ex: anhoh@microsoft.com
Signatures: https://graph.microsoft.com/v1.0/me/people?$filter=scoredEmailAddresses/any(x: x/address eq 'anhoh@microsoft.com') --> empty value :[]
Signatures:"value": []
Signatures: After fix --> refetch emails --> will it update them with .from - or not write them since they already exist ?
Signatures: Ex: Email to dekelc https://graph.microsoft.com/v1.0/me/mailFolders/inbox/messages?$filter=contains(subject,'Admin Gazetteer Story')
Signatures: internetMessageId: "<DB6PR0601MB215052D8E94011D004A2B300CFED0@DB6PR0601MB2150.eurprd06.prod.outlook.com>",
Signatures: 'collageUserId' : '10037FFE840A566E'  -- dekelc
Signatures: Note: There is a second mail with same subject  updateId: '<AM5PR0601MB233997CAABCF9F4F5B6C690CBBEC0@AM5PR0601MB2339.eurprd06.prod.outlook.com>'
Signatures: It does have a good .from !
Signatures:  'from': {
Signatures:'name': 'Mini Conference room',
Signatures:'address': 'ilsmconfroom@harmon.ie'
Signatures:}
Signatures:},
Signatures: Note: ramt (the same update for dekelc had a from)
Signatures: Graph Explorer - there is a 'From' and 'sender'. mailUpdate - from doesn't exist (to and cc - do exist)
Signatures:"from": {
Signatures:"name": "Planio GmbH",
Signatures:}
Signatures: Note: We have other emails with mailUpdate.from.mail = 'support@plan.io'
Signatures: Run:
Signatures:: Get all fields (XXXSchema) -
Signatures: Focus - inferenceClassification --> Separate request to REST
Signatures: Workaround:
Signatures: From Display Name,
Signatures: isRead
Signatures: Flagged
Signatures: Category
Signatures: Exchange Email and Conversation Ids / Index + InternetMessageId.
Signatures: Attachments Names (no need for the Blob) / HasAttachments
Signatures:: The collection has changed while paging. Some results may be missed.
Signatures: Q: Was anything really missed ?
Signatures: A: No. It was a bug in ItemId - which has a bug in their Id != anchorItem.Id
Signatures: Compare Ids changed_while_paging.txt (pageSize=10, break after first page-changed warning) --> to another fetch - this time with page size 200, such that cannot intervene between the first 2 pages
Signatures:Count - 5900 --> doesn't match the below 7853
Signatures:7853 - findstr Subject: dekel.txt | D:\Util\UnixUtils\gnuwin32\bin\wc.exe -l
Signatures: Last email in dekel.txt: [Connections - Task #44210] (Closed) Fix Automated Tests for Skipping Links
Signatures:4/24/2017, 5:40 PM
Signatures: Exception while paging results: The server cannot service this request right now. Try again later
Signatures: Does the fetch gets tired ? How much emails are fetched after 30 min ?
Signatures:A: No.
Signatures: Focus: Can we use the Bearer we get from exchangeSuperuser ?
Signatures: Why Ram results so different from Graph ?
Signatures: sharepoint 24 in ram_feb_mar_sig.txt --> only 12 in mongo (198 mailUpdates in 11-Feb - 11-Mar range)
Signatures: reports.js --> Mongo Client
Signatures: { 'terms.topicKey' : 'sharepoint' }
Signatures: Problem: {"collageUserId":"1003BFFD840EBE10", $and:[{ createdAt:{ $gte:"2018-02-11T21:59:00.000Z"}},{createdAt:{ $lte:"2018-03-11T13:15:17.000Z"}}]}
Signatures:0 results
Signatures: Switch require ../../ --> Move to Collage.Topics repo.
Signatures: ramt test3 (4 month) - only 834 emails ?
Signatures: Print $count to logs for first request
Signatures:A: Failed - do not know why, but it seems that @data.count is not present in the response (need to investigate in POSTMAN with Berear)
Signatures:: If Exception (networkError) propagates to handleFetchNextUpdatesPage catch --> meaning getSourcePages does NOT call handleNextPage --> no next paging --> pause until next trigger (5 min or refreshUser url - to resume)
Signatures: Q: How to make progress if a nextLink (using skipToken) request fails ?
Signatures: If network error - catch it (in a loop of 3 retries with delay in basePager::getSourcePages - only if pager.retryNetworkError property is true)
Signatures: Why test3 mailUpdates yaacovc has oldest mailUpdate in 2018-02-08, when Barrier is 4 months (starting 2018-04-26 --> should include until 2017-12-26, like dekelc does) ?
Signatures: Why only yaacov doesn't mail token ?
Signatures: A: mailPager return saveState only if !nextLink --> meaning only at the end of the fetch --> if it fails in the middle (exception) --> no token is saved.
Signatures: Why some inbox of dekel local-dev tokens are not written ? sentitems is written ...
Signatures: 1|updates- | 2018-04-26T17:04:47.457Z [mailPager] error: Failed getContactsByEmails {"networkError":true} collageUserId: 10033FFF840B2E92
Signatures:1|updates- | 2018-04-26T17:04:47.458Z [mailPager] error: Failed getSourcePager collageUserId: 10033FFF840B2E92 durationMs: 375 error: {"networkError":true}
Signatures: Right after getNewMessages also fails
Signatures:1|updates- | 2018-04-26T16:56:51.466Z [mailPager] error: Failed getNewMessages {"networkError":true} collageUserId: 10033FFF840B2E92
Signatures: Meaning getContactsByEmails got exception related to network error (doesn't seem like batch update failed, becuase that one was not exception)
Signatures: What is the error: 2018-04-26T06:55:16.496Z	"10033FFF840B2E92"	"903"	"{"networkError":true}"	"mailPager"	"error"	"Failed getSourcePager"		"updates-fetching"
Signatures: Ex: This yc email from 2018-02-09T04:15:12Z was not fetched (greped uf logs), but it does appear in yc ms-graph query
Signatures:subject: Feds' new I-9 enforcement push: 4 ways to protect your company
Signatures: ms-graph: http://localhost:4501/api?url=https://graph.microsoft.com/v1.0/users/yaacovc@harmon.ie/mailFolders/inbox/messages?$filter=ReceivedDateTime%20ge%202018-02-09%20and%20receivedDateTime%20le%202018-02-10
Signatures: Fix @data.context insert Mongo error
Signatures: Field name from contacts (not orig message) with @ and dots is not acceptable by Mongo {"@odata.context":"https://graph.microsoft.com/v1.0/$metadata#users('49cd7a06-ae93-4213-b214-d28ef3a7a6bf')/people(id,displayName,scoredEmailAddresses,personType)
Signatures: Why count doesn't change after MailUpdatesStorage refactor ?
Signatures: Change to call getNewMessages directly (and add $select fields)
Signatures: Compare Mongo mailUpdates (between dates) + only Inbox Filter with system:mail logs with EWS results
Signatures: Excellent!: 606 { folderName: 'Inbox', collageUserId: '6b6b6ba8aebc1df8366f4382da4e2aba0184539078dfd13ac58578adaafc0fe8' , $and : [{ 'createdAt' : { $gte: ISODate("2018-03-15T00:00:00.000+0000")  }   }, {'createdAt' : { $lte: ISODate("2018-04-24T23:59:39.000+0000")   }}] }
Signatures: 33 in both queries for 'SentItems'
Signatures: PR - topics_on_artifacts - Collage 157 (Yael Ben Atar)
Signatures:<Tfs.PullRequest.Collage.Collage.157.PushNotification.a54984c2-4716-4949-ac03-3d33d739767f@harmonie-collage.visualstudio.com>
Signatures:2018-04-24T13:42:11Z
Signatures: user-data/updatesUpdater.js::handleUpdatesWithTermsMessage
Signatures: Handle systemName === 'mail' case by writing to Mongo
Signatures: New Updates schema: this.UpdateModel = await this.initModel('Update', UpdateSchema); --> exchange Updates are not compat with our new system:mail.
Signatures: var thingSchema = new Schema({..}, { strict: false });
Signatures: DateBarrier - change to 1 year or so
Signatures:: Inbox, Sent - Fetch Folders
Signatures:: Process (after) with terms-processing
Signatures: Has to append terms to existing Update --> Mongo ?
Signatures: Q: .NET or JS code ?
Signatures:: Write to Json / Mongo ?
Signatures: Format Update as terms-processing expects
Signatures: Json vs. Mongo:
Signatures: Workaround: Do not Incremenetal unless Manual, Import to Mongo (very simple) before attaching Terms-Processing
Signatures: Json must split to files: Max messages 6000 per File ~ 30MB json file --> Open new file.
Signatures: Warning logs - Console.Error.WriteLine
Signatures:: Retry - if FindItems or LoadPropertiesForItems fails - how to retry them (same page).
Signatures: Note: Didn't see Exception in 2000 emails of dekelc - maybe not a priority.
Signatures: P2: Deleted
Signatures: Consider less verbose (only Id ??)
Signatures:A: No. We need at least Id, sender, UniqueBody to understand if certain Topics coming from mostly Deleted or a certain sender is often deleted.
Signatures: Email may be fetched from Inbox and fetched afterwards from Deleted - kind of a Duplicate
Signatures: Can fetch a list of Ids from Deleted --> locate them in Mongo and mark them as such (but also fetch entire emails
Signatures:Q/A EWS
Signatures:- - - - -
Signatures: Hebrew/CHS chars
Signatures:Motivation EWS
Signatures:- - - - - - - -
Signatures: Spent days on extracting emails (over a period of 1-2 month) from the Graph
Signatures: CosmosClient reliablity and performance (slow)
Signatures: Changes and BugFixes in GraphClient (Ids, Hashed Ids) --> had to merge and recreated graph --> reSignin everybody.
Signatures: Hard to query (Gremlin) and Join across
Signatures: Missing data in Graph and need to external data (Ex: Expected Topics)
Signatures:Business / User Scenarios
Business/UserScenarios: Goal: TopTopics: 50% good.
Business/UserScenarios: Do not rely on Admin: Admin can complete the last 10%
Business/UserScenarios: CRM Structured - for Orgs with customer facing users.
Business/UserScenarios: One partial source
Business/UserScenarios: IT Eval user will not have access or activity in CRM
Business/UserScenarios: Target: Account Managers, Insurance
Business/UserScenarios:New Collage Cases:
Business/UserScenarios:1) Top Topics - Prec and Recall are important
Business/UserScenarios: Graph code + Mongoose - too much code for shell.
Business/UserScenarios: Graph will enable TopicAuthor + Replies (as mongo contains only last reply in lastReplies array)
Business/UserScenarios:: Reporting and Analytics:
Business/UserScenarios:We can issue Top-Topics and get fast feedback for: non-topic, good-topic, top-10-for-period (+ maybe specify new, missing-topic). How do we measure the effect of each Ranking factor ?
Business/UserScenarios: Goal: Explore effect by turning on/off each filter and change weight of factors --> rerun query (same user+period) --> measure aggregate effect on Ranking
Business/UserScenarios: Good topics moving up (how many rank-positions)
Business/UserScenarios: Bad topics moving down (how many rank-positions)
Business/UserScenarios: New Topics - we do not, yet know if they are good/bad - highlight them in results to ask user
Business/UserScenarios: Display Diff (in addition to rank-eval-measure) - to get a feel.
Business/UserScenarios: Save in Db for each Topic its collageUserId + Rank + 2-weeks-range-period
Business/UserScenarios: Features
Business/UserScenarios: authTopic = true + its current weight - 2
Business/UserScenarios: Filters - isFocused == true + regex of from-Senders / Email subjects to remove
Business/UserScenarios: Contributing Updates (store-full-text)
Business/UserScenarios: Must use the same basePipe + isFocused= true (or new $match conditions will not be included --> making it incorrect)
Business/UserScenarios: Turn on
Business/UserScenarios: porsche, qacoordinator, owa
Business/UserScenarios: Try 7 --> 30 days
Business/UserScenarios: Topic Author - find at least a single email (the more the better, the recent the better)
Business/UserScenarios:b) Containing the Topic in UniqBody (meaning it was in latest reply - not in thread)
Business/UserScenarios:c) updateAt: last 2 months (last 7 days is probably too little )
Business/UserScenarios: Impl: Terms auhtored by cur-user: foreach update.terms: object.recipients.from = cur-user AND clientObject.systemData.body contains the term.
Business/UserScenarios:Mail dekel sent to Kobi and he replied (simple single reply):
Business/UserScenarios: object.recipients.from = dekel --> object.recipients.to = Kobi
Business/UserScenarios: object.replies = 1
Business/UserScenarios: object.latestReplies[0] = from Kobi to Dekel
Business/UserScenarios: terms array contains terms from both messages
Business/UserScenarios: clientObject.systemData.body contains only first message in thread (Sent by Dekel)
Business/UserScenarios: Same update (conversation) in the other collageUser (Kobi) is very similar (except for FolderName=SentItems and not Inbox)
Business/UserScenarios: Meaning: Cannot find terms in reply text. Can assume that terms that
Business/UserScenarios: no need to breakdown thread to individual messages.
Business/UserScenarios: Problem: Less stat
Business/UserScenarios: More time --> less relevant (we are trying to be within 7 days)
Business/UserScenarios: Small JS code (inside server) program that accepts an array of ConversationIds (updateId) and accesses calls EWS production Db tokens
Business/UserScenarios: Replies to Thread - how much the cur-user is involved with the thread. See 'Importance of the Email/Document containing the Topic to the user'
Business/UserScenarios: Require > 1 replies to this Topic
Business/UserScenarios: Problem: Still alot of Junk Topics are repliedTo > 1 times: 'Online Meeting' 'Forgot', O365
Business/UserScenarios: Note: Opposed to current impl: Contribute 1.5 to countof-all-updates of a Topic if replied within 2 months.
Business/UserScenarios: Problem: Since it will now contribute a lot less (only for few updates 1.5 instead of 1) --> it will not have much effect (say 4 instead of 3) --> will not win over noisy topics
Business/UserScenarios: Note: Even if didn't Authored the Topic Text but replied
Business/UserScenarios: Affinity Users: If Affinity User send to me mail about Euclid (here wrote Euclid)
Business/UserScenarios: Impl: Can we manually extract Affinity Users for Ram, Yehonathan, David --> place them in a static json ?
Business/UserScenarios: Investigate Last 7 days Unstructured (Thanx?)
Business/UserScenarios: Ex: OWA,EY should be a top-topic for Ram - but appears only in unstructured.
Business/UserScenarios: 'Forgot' is a topic (Yaacov Euclid email exchange)
Business/UserScenarios:'Update' support emails [harmon.ie] Update: Harmon.ie OWA On Premise POC
Business/UserScenarios: Impl:
Business/UserScenarios: Investigate - many articles cover that
Business/UserScenarios: Investigate: Drilldown to some of the Top Topics (Office 365) - maybe disover noise or new insights.
Business/UserScenarios: Marketing Emails ('Microsoft' fom Teams notifications emails, 'LinkedIn')
Business/UserScenarios: Will not handle Closed-Won Topics - since they are not external address.
Business/UserScenarios: External address (!=@harmon.ie and !=@mainsoft.com)
Business/UserScenarios: You appears in the To:  --> implies Bcc type emails
Business/UserScenarios: num of harmon.ie users SentTo this email address (< 2) OR num of harmon.ie users Received from this email address (< 2)
Business/UserScenarios: no-reply (variants)
Business/UserScenarios: Internal Generated Emails
Business/UserScenarios: Mailing Lists detection  (Exchange)
Business/UserScenarios: Duplicate
Business/UserScenarios: Signature ('Mobile')
Business/UserScenarios: Discount freq-words / common 5000 list
Business/UserScenarios: Ex: 'Nice', 'Works', 'General', 'Forgot' (list has forget - need to match tenses or buy professional list)
Business/UserScenarios: Maybe do (Ratio counting / Maj POS Classifiers vote) only if in common 5000
Business/UserScenarios: Problem: SharePoint follow - non-specific (unlike email)
Business/UserScenarios: Impl: Can query using API / GraphExplorer --> get json --> integrate it with Mongo query
Business/UserScenarios:2) Topics Map (related)
Business/UserScenarios:3) Drill Down (Artifacts)
Business/UserScenarios: Impl:Search --> No Topics for newly created Artifacts in Drilldown view (created since last Update)
Business/UserScenarios: Ranking of Topics relevant to the Artifact
Business/UserScenarios: Prec >> Recall
Business/UserScenarios:"Period"
Period:Balance between
Period:a) Stats -
Period: Problem: Too short period (1 week) - unstable stats --> ey owa has count : 1
Period:b) Novelity and Latest
Period: Problem: 1.5 month report --> the last 2 weeks counts for 1/3
Period: Time decay
Period:"Sharepoint" "/used"
Period:-----------------------
Period:Goal: Use /used + SP updates fetching for Quality ranking - assuming one works in SP on Top Topics
Period:
Period: davidl has a clean folder structure - for /Events/<Conference Name>, PR ---> can extract TermSet like quality topics using seed of Important (ex: Clicked) topics
Period:/Used:
Period:- - - - -
Period: Con (for /Used - not for SharePoint topics in general)
Period:: Experiment: findTopicInText of top 40 topics of both whole topic + parent topic ()
Period: Data: See D:\views\Collage.Topics\Reports\sharepoint\used_feed
Period:"General Terms"
Period:-----------------------
Period:Goal: Filter out bad terms that are too general (non-informative) - may be easier to identify these specifically and kill them
Period:: Problem: If we penalized owa and office 365 -  There is not child of owa to replace it in toptopics (not even close)
Period: UI for General Terms in TopTopics with their 2-3 highest children (rendered smaller)
Period: Enhance: With Ranking, the important children (owa on premise) will enter TopTopics
Period:: Q: Investigate: Are all child Topics the same (when we childCount) ?
Period: atr4s - use Articles formulas
Period: Single User Aug-17 --> Apr-18 stats (ramt 1800 emails, yc - 4300 emails)
Period: Generally helps stats, but...
Period: Problem: ramt: owa 30 children > outlook 27 children > office365 23 children --> both are currently in ramt toptopics with same rank ~24 --> Every rank penalty based on childCounts only
Period:--> Will penalize owa even more than outlook
Period: Note: There is not child of owa to replace it in toptopics (not even close)
Period: Investigate how to separate those 2 (if possible, because owa is a main product ram is working on for several months)
Period: Problem: sp childCount is 25 (from Aug-17) --> If we penalize softly starting from ~20 --> can we remove the condition that low rank < 7 parents do not get childRank contrib ?
Period: sp has a lot of childRank (but alot less than 25 in mar-apr) --> Enter TopTopics.
Period:: Affinity on last 4 Months:
Period: Problem: childCount on report dateRange for a single user doesn't have enough childCount to decide --> general : true
Period: Ex: office365 in ramt
Period:'office365' rank: 23.30 count: 12 factors: { fromMe: 3 sig: 4 automated: 2 reFilt: 2 childRank: 14.3 childCount: 10}
Period: Ex: sp in ramt - childCount 9 or 10.
Period: Q: Should use langModel (org global) and not a single report updates to determine 'General Term' ?
Period: Problem: If harmon.ie was part of a larger org - SharePoint is only General within harmon.ie division
Period: atr4s - use Articles formulas
Period: Find a flexible relative formula to separate general childCounts (15 is not good) -->
Period: It will mark only the highliy general as such (office365, sp, outlook) and will keep atlas, q1 and arr as not general (correct)
Period: Ex: For ~10000 emails - should be ~30 for 1500 emails should be ~15
Period: Plot the Avg/Robust Avg + Variance childCount of good-specific vs. bad-general topics as a function of count/number of emails
Period: Soft penalty - if distributed slightly above or below the mean (say 30+-3) --> penalize differently
Period: Conc: childCount for General Terms grows considerably
Period: 'microsoft' rank: 35.47 count: 591 factors: { fromMe: 6 sig: 19 dup: 50 automated: 369 autoEmail: 42 reFilt: 392 reFiltNotAutomatedDiff: 28 childRank: 231.72 childCount: 220 general: true}
Period: 'outlook' rank: 26.37 count: 258 factors: { fromMe: 11 sig: 10 dup: 51 automated: 132 reFilt: 102 reFiltNotAutomatedDiff: 4 childRank: 34.2 childCount: 31 general: true}
Period: 'collage' rank: 24.67 count: 511 factors: { fromMe: 5 automated: 431 reFilt: 432 reFiltNotAutomatedDiff: 1 childRank: 66.8 childCount: 166 general: true} expected:  comment:
Period: 'google' rank: 13.23 count: 66 factors: { fromMe: 1 sig: 2 dup: 7 automated: 6 autoEmail: 6 reFilt: 21 reFiltNotAutomatedDiff: 16 childRank: 56.96 childCount: 55 general: true}
Period: 'office' rank: 11.30 count: 222 factors: { fromMe: 7 sig: 177 dup: 6 automated: 4 autoEmail: 6 reFilt: 3 reFiltNotAutomatedDiff: 1 childRank: 172.8 childCount: 138 general: true}
Period:: 'ibm' rank: 16.77 count: 174 factors: { fromMe: 1 sig: 32 dup: 23 automated: 69 autoEmail: 1 reFilt: 70 reFiltNotAutomatedDiff: 4 childRank: 58.5 childCount: 62 general: true}
Period: Raise threshold adaptive based on counts
Period: 'arr' rank: 13.57 count: 47 factors: { dup: 7 childRank: 29 childCount: 24 general: true}
Period: Raise threshold adaptive
Period: 'atlas' rank: 13.33 count: 45 factors: { sig: 1 autoEmail: 4 childRank: 26.5 childCount: 21 general: true}
Period: Raise threshold adaptive
Period: With childCount ~ 30 --> shouldn't be considered a General Term (other ranking will penalize it later)
Period: Appear uniformly distributed over the whole collection/user periods
Period: 'Board' appears every quarter in a burst --> doesn't mean it is uniformly distributed
Period: See Domain Consensus - recognition of terms uniformly distributed over the whole collection/user (SharePoint?)
Period:Search for formula 'Domain Consensus' at: https://www.eecs.yorku.ca/course_archive/2017-18/F/6412/reading/slides/HaohaoHu.pdf
Period: IDF - appears with high enough count in many other scopes (users/periods ...)
Period: Note: See Parent Topics on how to calc Nested Terms
Period: 'SharePoint Server', 'SharePoint Administration' --> less specific terms are part of many other terms (regardless of their TFIDF)
Period: Board is an alias short-form to the above, while SharePoint is not a short-form of the longer terms that contains it.
Period: Read: Domain-independent term extraction through domain modelling https://pdfs.semanticscholar.org/f02a/c5890bccd34fcee4a4ee63d992eb698ac68c.pdf
Period:1) Distribution: Generic words should appear in at least one quarter of the documents in the corpus
Period:2) Length: Only single-word candidates are considered, as longer terms are more specific
Period:3) Linguistic POS Filter  Only content-bearing words are of interest (i.e., nouns, verbs, adjectives)
Period: It uses Pointwise Mutual Information (PMI) to the N Top Ranked terms (by another methods)
Period:5) Modifies C-Value such as nested in other terms means higher? weight
Period: Yet Another Ranking Function for Automatic Multiword Term Extraction
Period:"Terms Processing "Compound - "Unithood
TermsProcessingCompound-Unithood:Goal: Term-Processing: Decide how many conseq tokens should be concatenated to the term. Score each term - higher --> more likely to be a Term.
TermsProcessingCompound-Unithood:: Q/A:Yaarit: harmon.ie Dict topics which are not detected by our terms-processing
TermsProcessingCompound-Unithood: Change dict-term-processsing addTermsInfo --> mark each term not in existingTerm (already give)
TermsProcessingCompound-Unithood: Improve dict and nlp terms overlap by using start-end token indexes
TermsProcessingCompound-Unithood: Single starts with JJ and NN (Outlook)
TermsProcessingCompound-Unithood: Merge Fix:
TermsProcessingCompound-Unithood:.topicId === 'highprimarketingopportunity'
TermsProcessingCompound-Unithood: Cross JJ (as opposed to Lali)
TermsProcessingCompound-Unithood: Require Suffix Parent ?
TermsProcessingCompound-Unithood: Ex: Office 365 modern authentication //modern JJ
TermsProcessingCompound-Unithood: Ex: Office 365 admin center //admin incorrectly JJ
TermsProcessingCompound-Unithood: Cross every 'for'
TermsProcessingCompound-Unithood: Currently only pos CC is crossed.
TermsProcessingCompound-Unithood: Investigate crossing pos IN 'for'
TermsProcessingCompound-Unithood: Trim right
TermsProcessingCompound-Unithood:: Test Stanford compound
TermsProcessingCompound-Unithood: Q: Maybe it can support weak crossing decisions ?
TermsProcessingCompound-Unithood: Q: Shouldn't we rely on stats to cross ?
TermsProcessingCompound-Unithood:"PreProcessing" and "Bugs"
TermsProcessingCompound-Unithood:: 10018 LM: Filter out Mult Word english phrases
TermsProcessingCompound-Unithood:: Investigate  ALL CAPITALIZED
TermsProcessingCompound-Unithood: Ex: PLEASE DO NOT RESPOND TO THIS EMAIL --> today extracts RESPOND and THIS EMAIL
TermsProcessingCompound-Unithood:: Today, it filters out any token that contains dot . --> meaning all Acronyms + harmon.ie cannot be Terms
TermsProcessingCompound-Unithood: Is 'Tls v1.2' considered a Term ? How are the tokens generated (CD ?)
TermsProcessingCompound-Unithood: Is my new validNumberTokem too restrictive (CD with only numbers and english letters - how does 1.2 should pass it ? What about 13-4 ?)
TermsProcessingCompound-Unithood: Ex: PS
TermsProcessingCompound-Unithood: Ex: DFGSDFG
TermsProcessingCompound-Unithood: ID Tokens: FID563604
TermsProcessingCompound-Unithood: Subject: Catalyst's acceptance of management buyout offer [APM-APM_DMS.FID563604]
TermsProcessingCompound-Unithood: If number-of-digits > 4 or 5 --> filter out.
TermsProcessingCompound-Unithood: nlp report diff.
TermsProcessingCompound-Unithood:  How to differentiate FID563604 from SP2013 ?
TermsProcessingCompound-Unithood: Workaround: Even if we can't filter it out ---> in GA 2018 --> do pick nlp topics which mixes letters and numbers
TermsProcessingCompound-Unithood: Ids/Guid S201804020253445112
TermsProcessingCompound-Unithood: Compound (2nd) --> isValidNoun --> require not Alternating (Cannot require < 80% digits - see 'SP 2013' second tok is 100% digits)
TermsProcessingCompound-Unithood: Start Term --> require both Alternating and < 80% digits Rules.
TermsProcessingCompound-Unithood: Do not consider strings that are far from english (german) words as Topics
TermsProcessingCompound-Unithood: Max 9 digits/delim
TermsProcessingCompound-Unithood: Ex: F2a0a2b086f64749b643925e72235868
TermsProcessingCompound-Unithood: Ex: PMR 88943,122,000  (only the number part)
TermsProcessingCompound-Unithood: < 80% digits with rest of eng chars and special-sign chars
TermsProcessingCompound-Unithood: Ex: S201804020253445112 (NN captialized - isNounATerm)
TermsProcessingCompound-Unithood: < 4 Alternating seq of numbers and characters (eng+delim)
TermsProcessingCompound-Unithood: Ex: 69ac5a20-0b95-42cb-aa13-774aac3511c2 (guid)
TermsProcessingCompound-Unithood: Ex: 66e84735  (commit hash)
TermsProcessingCompound-Unithood:7b9109dc  (commit hash)
TermsProcessingCompound-Unithood: How to count punctuation ? A lot of puncts indicate a guid style/PMR id, but they are also
TermsProcessingCompound-Unithood: Do not remove
TermsProcessingCompound-Unithood: Problem: Stanford may split #s<guid> --> several tokens --> stats are lower.
TermsProcessingCompound-Unithood: Note: Extract using Dictionary Regex
TermsProcessingCompound-Unithood:: The limit on Compound (5) --> filters-out a prefix (parent) topic
TermsProcessingCompound-Unithood: Ex: Catalyst's acceptance of management buyout offer --> too long --> even Catalyst is not extracted
TermsProcessingCompound-Unithood: Sentence Start: After colon (:) - does  Stanford considers is as sentence start ?
TermsProcessingCompound-Unithood: examples for different tokens with\without 'RE:' at the beginning of sentence:
TermsProcessingCompound-Unithood:(format is:  <original subject> => <topic> = POS with 'RE': // POS without 'RE:')
TermsProcessingCompound-Unithood:RE: Material for Tuesday Board meeting => Material for Tuesday Board meeting = NNP + NNP + NNP + NN // NN + NNP + NNP + NN
TermsProcessingCompound-Unithood:RE: Quote requested for EY => Quote requested for EY = NNP + VBD + NN // VB + VBN + NN+
TermsProcessingCompound-Unithood:RE: Quote Please: Quote Please//NA//US//OPS//N//.5  => Quote = NNP // VB
TermsProcessingCompound-Unithood:RE: Future of collage => Future of collage = NNP + NN // JJ + NN
TermsProcessingCompound-Unithood:RE: Harvest Growth Plan Discussion => Harvest Growth Plan Discussion = all NNP // all NN
TermsProcessingCompound-Unithood:RE: Patch to customer =< Patch = NNP (PER) // NNP
TermsProcessingCompound-Unithood:RE: BAD Certificate for signing. Unable to build Main_3 => BAD Certificate for signing = NNP + NNP + NN // JJ + NN + NN
TermsProcessingCompound-Unithood:RE: Board PPT and agenda => Board PPT = NNP + NNP (ORG) // NNP+ NNP
TermsProcessingCompound-Unithood:RE: London AE Candidate => London AE Candidate =  NNP (ORG) + NNP + NNP // NNP (Location) + NNP + NNP
TermsProcessingCompound-Unithood:RE: Top 25 - current/not yet final list => Top 25 = NNP + CD // JJ + CD
TermsProcessingCompound-Unithood:  "Urls, Files and SP Urls" +
TermsProcessingCompound-Unithood: Regular (non-SP) urls - ex: https:/xxx/yaacovmeetingroom
TermsProcessingCompound-Unithood: Email address + <mailto:> in body (gil@gps.ventures<mailto:gil@gps.ventures>)
TermsProcessingCompound-Unithood: Registery paths
TermsProcessingCompound-Unithood: 	 ) File pathes (Except business related paths - classifier): /tmp/XvfbDaemon.log.<pid>
TermsProcessingCompound-Unithood: Anything with a file extension should be disqualified (except maybe if we want to find topics within file names) Example: ‘TopicsAddIn.7z’
TermsProcessingCompound-Unithood: Use tokens[term.idxToken].characterOffsetBegin and tokens[term.idxTokenEnd].characterOffsetEnd
TermsProcessingCompound-Unithood: Use original '$subject+.+$subjectToken+.+$body' as the string to calc offesets against
TermsProcessingCompound-Unithood: Calc idxTokenEnd for each term
TermsProcessingCompound-Unithood:  Terms with freq > 1 --> check each term before remove duplicates by the end of nlp-client
TermsProcessingCompound-Unithood: Fork and modify https://github.com/sindresorhus/get-urls/blob/master/index.js --> keep regex match offsets in result
TermsProcessingCompound-Unithood: Workaround: Test if extracted Terms occurs in the list of extracted Urls (minus the allowd SP url parts)
TermsProcessingCompound-Unithood: Online Framework: Cannot delete urls from researchUpdate.bodyForTermsExtraction before signature - becuase sig counts urls ...
TermsProcessingCompound-Unithood: Only mark start/end of urls (same with signature)
TermsProcessingCompound-Unithood: terms-processing itself is unaware of these tokens annotations (doesn't skip them)
TermsProcessingCompound-Unithood: Signature enrichers marks term.inSig and Url enricher marks term.inUrl
TermsProcessingCompound-Unithood: SharePoint Url: Extract only certain Path segments
TermsProcessingCompound-Unithood: Extract RootFolder (if exist - contains Euclid important terms that do not exist in original Url path segments)
TermsProcessingCompound-Unithood: If exist - only use url-decoded RootFolder
TermsProcessingCompound-Unithood: Note: /Forms/*.aspx doesn't necessarily means RootFolder
TermsProcessingCompound-Unithood: url decoded if needed preProcessing before terms extraction - or tokenization and terms will not work.
TermsProcessingCompound-Unithood: Create textForTermsExtraction (as in oldCollage) with all preProcessing
TermsProcessingCompound-Unithood: Folder: https://harmonie.sharepoint.com/:f:/r/Product/Shared%20Documents/0365%20Project%20Euclid?csf=1&e=GItt5g
TermsProcessingCompound-Unithood: Doc: https://harmonie.sharepoint.com/:w:/r/Product/Shared%20Documents/0365%20Project%20Euclid/ISV%20Dev%20Doc.docx?d=wcf672db7ba9946b79542d03b10c60ec9&csf=1&e=AkfsPb https://harmonie.sharepoint.com/Product/Shared%20Documents/Forms/AllItems.aspx?RootFolder=/Product/Shared%20Documents/0365%20Project%20Euclid&FolderCTID=0x01200096BCBF1AD40B7C4FBC999F945F53B2A2&ShowWebPart=%7BF589CA79-8841-4F3C-B5B2-556269D488B9%7D>\r\n
TermsProcessingCompound-Unithood: {0}/_layouts/viewlsts.aspx?ShowSites=1
TermsProcessingCompound-Unithood: Impl: See SiteHelper.normalizeSPUrl
TermsProcessingCompound-Unithood: Ignore common path segments:
TermsProcessingCompound-Unithood: 'Shared Documents', 'Documents', 'Lists'
TermsProcessingCompound-Unithood: Only if a file name/path/url (or known set of file types)- Do not remove after last dot ....
TermsProcessingCompound-Unithood: get-urls
TermsProcessingCompound-Unithood: Maybe Dedplicate will handle that as well
TermsProcessingCompound-Unithood:-------------
TermsProcessingCompound-Unithood:Goal: New Terms centric language model, that considers not words and ngrams, but standalone terms and compound terms
TermsProcessingCompound-Unithood:It should not count duplicates or signature terms, automated emails ? (or atleast mark them as such)
TermsProcessingCompound-Unithood: Productization:
TermsProcessingCompound-Unithood: Problem: Term-Processing sees lowercase 'sharepoint' (NN) + 'influencer' --> How does it read from Terms-LM (efficiently) to make it process it ?
TermsProcessingCompound-Unithood:  Compound-Start stats (Mongo)
TermsProcessingCompound-Unithood: Detect 'Management buyout' 10 times as Term --> lmUpdater --> write bi-gram 'management buyout' as a Topic
TermsProcessingCompound-Unithood: Stanford nlp tokens --> LM read ('management buyout' - 10 times Topic) --> pass token index of 'management' as start token
TermsProcessingCompound-Unithood:--> terms processing starts at 'management' as usuall --> compound (as usuall)
TermsProcessingCompound-Unithood: Send all tokens to Redis Lua --> Lau creates n-grams (saves huge request)--> queries stats --> filters out non-relevant (long tail) --> respond
TermsProcessingCompound-Unithood: Bug: sentenceStart doesn't work for 'Last Chance' in subjects --> naively considers opening bracket [ --> as sentenceStart  -->
TermsProcessingCompound-Unithood:TermsLM counts for this Topic sentenceStart are incorrect
TermsProcessingCompound-Unithood: Ex: [Last Chance-Register Now] 72 Hours Notice
TermsProcessingCompound-Unithood: Ex: blabla | Other Calendars
TermsProcessingCompound-Unithood: Problem: Do we want every opening bracket to be considered as sentenceStart ?
TermsProcessingCompound-Unithood: Goal: Query Terms-LM for score of a token ngram (usually compound or parent prefix/suffix) --> determine if valid standalone Topic
TermsProcessingCompound-Unithood: Ex: Start of sentence / Subject: Skype for Business is the best app
TermsProcessingCompound-Unithood: Extract Compound: Skype for Business --> Verify a) Skype is not capitalized because of sentence-start b) Skype is a parent standalone-Topic
TermsProcessingCompound-Unithood: Factors
TermsProcessingCompound-Unithood: Count
TermsProcessingCompound-Unithood: Captialization of all words (not only first)
TermsProcessingCompound-Unithood: Management buyout at sentenceStart --> suspect, but Management Buyout --> Capitalized
TermsProcessingCompound-Unithood: Not-at-sentence-start
TermsProcessingCompound-Unithood: Contains a Number
TermsProcessingCompound-Unithood: Problem: SP 2013 (Good) vs. Task 8260 (Bad)
TermsProcessingCompound-Unithood: Count threshold for containing a number is higher (say 5, even if starts with NNP)
TermsProcessingCompound-Unithood: Starts with a Number
TermsProcessingCompound-Unithood: Problem: 800K of memory, 1G of ram --> could be Topics (ex: 888 company) --> usually not.
TermsProcessingCompound-Unithood: higher threshold for these
TermsProcessingCompound-Unithood: External Wikipedia: If NN Term occurs in External/Wikipedia (and is considered as PROPN - see factors)
TermsProcessingCompound-Unithood: Ex: 'Internet of Things' - NN
TermsProcessingCompound-Unithood: In Wikipedia - Title: Multiword capitalized, Article body: The term 'Internet of things' was likely coined ....
TermsProcessingCompound-Unithood: Wikipedia - Title: Multiword, but computing not capitalized. While the term "cloud computing" was popularized
TermsProcessingCompound-Unithood: Ex: 'Risk Prevention' :
TermsProcessingCompound-Unithood: Wikipedia - Redirect to Risk
TermsProcessingCompound-Unithood: Count of passing as standalone (NNP+NER+other tests in nlp-client)
TermsProcessingCompound-Unithood: Note: As oppossed to Count of occurrences that bythemselves do not pass as standalone
TermsProcessingCompound-Unithood: Ex: Sharepoint Influencer (Capitalized NNP+NNP) vs. sharepoint influencer all-lowercase (NN+NN)
TermsProcessingCompound-Unithood: Diversity of nested term: Microsoft XXX, Microsoft YYY
TermsProcessingCompound-Unithood: Ex: Microsoft keynote, Microsoft Office, Microsoft Services --> Microsoft is probably a parent Topic (but a general one - see ranking)
TermsProcessingCompound-Unithood: Ex: Child Topic: If mail contains 'Microsoft keynote' --> It may also contain 'Microsoft Keynote' --> which should have higher influence when calculating score of 'Microsoft keynote'
TermsProcessingCompound-Unithood: Scoring of Parents + words/ngrams intersection from
TermsProcessingCompound-Unithood: Ex: First occur of 'SharePoint Hub Site' --> Parent is 'SharePoint' (high score as a Topic in Terms-LM, not to confuse with low Ranking cause 'General Term) --> score higher  'SharePoint Hub Site'
TermsProcessingCompound-Unithood:--> high ngram overlap with 'SharePoint Site' (in a prefect world  'SharePoint Site' would be parent of 'SharePoint Hub Site' but not sure - not prefix or suffix or substring) --> score higher for 'SharePoint Hub Site'
TermsProcessingCompound-Unithood: Q: All prefixes + suffixes of each Compound
TermsProcessingCompound-Unithood: Q: Can we avoid infix (running window as in words LM) ?
TermsProcessingCompound-Unithood: How to remove duplicates in TermsLM ?
TermsProcessingCompound-Unithood: Today it is only in report.
TermsProcessingCompound-Unithood: Lowercase terms in Graph (Reduce number of Nodes)
TermsProcessingCompound-Unithood: If found at least 5 lowercase terms --> add a Topic node to Graph
TermsProcessingCompound-Unithood: This reduces clutter (most of ngrams - count < 5)
TermsProcessingCompound-Unithood: If found term1 (5 tokens)
TermsProcessingCompound-Unithood: Problem: Then found the same term but concat 6 tokens --> today it filters out on Compund Limit
TermsProcessingCompound-Unithood:"Parent Topics
TermsProcessingCompound-Unithood:- - - - - - - -
TermsProcessingCompound-Unithood:: childRank to all parents and Affinity Count: Consider allow to contribute childRank without parent.rank > 7
TermsProcessingCompound-Unithood: Problem: If allow to contribute childRank to parent with rank 1,2 --> helps Catalyst and EY OWA (Good)) but also boost 'SP' (prev top 50) --> top10
TermsProcessingCompound-Unithood: 'sp' rank: 19.00 count: 3 factors: { fromMe: 1 childRank: 15 childCount: 8} expected:  comment:
TermsProcessingCompound-Unithood:Bug: containedTopicsTopicKeys doesn't split on hyphen -
TermsProcessingCompound-Unithood: Ex: 'Last Chance-Register' -->  containedTopicsTopicKeys : [last, chanceregister]
TermsProcessingCompound-Unithood:Bug: spna is not a child of sp (spna.containedTopicKeys)
TermsProcessingCompound-Unithood: A: SPNA is a single token (not compound) - Subject:Accepted: SPNA - Pre show meeting
TermsProcessingCompound-Unithood: Problem: Longer Compounds subsumes occurences of Sub Topics (IBM --> IBM Connections)
TermsProcessingCompound-Unithood: eyowa no childRank: count: 4 (ramt mar-apr) --> rank: 4, count: 2 + EY OWA deployment status
TermsProcessingCompound-Unithood: Note: New Term is too long and detailed --> same Conversation subject.
TermsProcessingCompound-Unithood: Problem: parent.rank >= 7 (MIN_STANDALONE_RANK) --> didn't pass --> no childRank --> remain rank : 4
TermsProcessingCompound-Unithood: Was in top 20 --> now in top50
TermsProcessingCompound-Unithood: 'office365' entered top10: rank: 15.00 count: 17  --> childRank --> rank: 23.30 count: 12 factors: { fromMe: 3 sig: 4 automated: 2 reFilt: 2 childRank: 14.3 childCount: 10}
TermsProcessingCompound-Unithood: Not General Term because childCount: 10 < 15, but gained a lot of childRank
TermsProcessingCompound-Unithood: ramt mar-apr:
TermsProcessingCompound-Unithood: 'linkedin' entered top10: rank: 17.00 count: 26  --> rank: 27.10 count: 22 factors: { fromMe: 2 sig: 6 automated: 2 autoEmail: 5 reFilt: 3 reFiltNotAutomatedDiff: 1 childRank: 16.1 childCount: 9} expected:  comment:
TermsProcessingCompound-Unithood: Not General Term because childCount: 9 < 15 , but gained a lot of childRank
TermsProcessingCompound-Unithood: davidl mar-apr
TermsProcessingCompound-Unithood: 'outlook' cannot be General Term: rank: 30.40 count: 34 factors: { fromMe: 11 dup: 3 automated: 14 childRank: 3 childCount: 2} expected: badTopic comment:
TermsProcessingCompound-Unithood: davidl mar-apr
TermsProcessingCompound-Unithood: 'catalyst' deleted: rank: 11.00 count: 11 (top 40)--> deleted
TermsProcessingCompound-Unithood: catalyst alone rank : 1 count : 3 {sig: 2} --> parent.rank < 7 (doesn't even fit in 100 topTopics) --> childRank doesn't apply.
TermsProcessingCompound-Unithood: Split into (cross lower nn): catalystinvestmentslp, atalystoffices, catalystloan, catalystprivateequitypartners, catalystloanrepayment
TermsProcessingCompound-Unithood: yc mar-apr
TermsProcessingCompound-Unithood: Ex: SC --> count 53 in yc mar-apr --> SC: count 36 +  9 x 'SC agenda' + 5 x 'SC offer'
TermsProcessingCompound-Unithood: Rule 1 + Rule 2 will fix it.
TermsProcessingCompound-Unithood: Ex: SharePoint NA  --> 12 in yc mar-apr --> count 8 + 4 x SharePoint NA conference
TermsProcessingCompound-Unithood: Rule 1 + Rule 2 will fix it.
TermsProcessingCompound-Unithood: Ex: Atlas --> count 21 --> 16 +  Atlas numbers, Atlas files, Atlas technology role ....
TermsProcessingCompound-Unithood: Ex: harmonie10 --> count 15 in davidl mar-apr (and in Top 10) --> count 9 (not in Top 10 anymore) + 5 x 'Harmonie 10 storyboard' + 1 x 'Harmonie 10 script 10_4'
TermsProcessingCompound-Unithood: Note: Same Conversation duplicate subject
TermsProcessingCompound-Unithood: Rule 1 + Rule 2 will fix it.
TermsProcessingCompound-Unithood: Ex: MSGraph --> MSGraph (folks | search messages | MSGraph team), Hi Sharepoint and MSGraph folks | MSGraph search messages with special characters not work .
TermsProcessingCompound-Unithood: Rule 1 + Rule 2 will fix it.
TermsProcessingCompound-Unithood: 'sp' rank: 19.00 count: 3 factors: { fromMe: 1 childRank: 15 childCount: 8} expected:  comment:
TermsProcessingCompound-Unithood: Problem: linkedin and office365 moved into top10 (but it did help a little bit with SharePoint and Microsoft - which were already in top10)
TermsProcessingCompound-Unithood: Dropped Rule2 - the top10 rule
TermsProcessingCompound-Unithood: Problem: Helped with SharePoint, Microsoft, but not LinkedIn (davidl)
TermsProcessingCompound-Unithood: Ex: 'linkedin' rank: 27.10 count: 22 factors: { fromMe: 2 sig: 6 automated: 2 autoEmail: 5 reFilt: 3 reFiltNotAutomatedDiff: 1 childRank: 16.1 childCount: 9} expected:  comment:
TermsProcessingCompound-Unithood: davidl mar-apr
TermsProcessingCompound-Unithood: Not enough children (in dateRange) for General Term (9 < 15), but was boosted by childRank into top 10.
TermsProcessingCompound-Unithood: Check all prefixes and suffixes to a certain length (check each one stat). of/and/in/to/for Cannot be end of start of suffix/prefix
TermsProcessingCompound-Unithood: Every Comp has a zero or 1 parent
TermsProcessingCompound-Unithood: Ex: IBM Connections Development --> parent: IBM Connections --> parent: IBM
TermsProcessingCompound-Unithood: Ex: Skype for Business --> parent: Skype
TermsProcessingCompound-Unithood: Ex: City of Brampton --> parent: Brampton --> Why don't we take also City (mostly uppercase) ?
TermsProcessingCompound-Unithood:: Q: Is parent always prefix or suffix ?
TermsProcessingCompound-Unithood: If Comp has of/and/to --> natural splitters
TermsProcessingCompound-Unithood: Problem: Spark Proof of Concept
TermsProcessingCompound-Unithood: of is not a splitter
TermsProcessingCompound-Unithood: Stat.
TermsProcessingCompound-Unithood: Q: Do I keep the Compound in addition to the splits ?
TermsProcessingCompound-Unithood:A: If the 2 clauses of the and is a standalone Topic --> kill the and (Do not keep the Comp)
TermsProcessingCompound-Unithood: X of Y - the parent is Y right ?.
TermsProcessingCompound-Unithood: Problem: Board meeting, SP NA Conf team --> Top Topics report --> we want to count 'SP NA Conf' parent Topic even if 'SP NA Conf team' occur,
TermsProcessingCompound-Unithood:because SP NA Conf XXX,YYY separated will not have enough count
TermsProcessingCompound-Unithood: But SharePoint XXX,YYY --> SharePoint is already too high count --> we do not want to count SharePoint if 'SharePoint Site' occur
TermsProcessingCompound-Unithood: Parent Topic and the 2 Child Topics don't have enough count to get into Top 10.
TermsProcessingCompound-Unithood: This should not apply to SharePoint, Outlook, Office 365 and Microsoft - as they already are in Top 10.
TermsProcessingCompound-Unithood: If This rule incorrectly gets Parent 'General Term' (badTopic) into the Top 10 --> it may still be penalized by 'General Term' or TfIDF ...
TermsProcessingCompound-Unithood: If needed, Limiting aggregation to Related Child Topics (See 'Relatedness in 'Similarity) --> This avoids adding IBM Princeton (Which is Mainwin project) and IBM Connections - because they are not Related.
TermsProcessingCompound-Unithood: Note: Assumes Parent Topic is a valid standalone Topic - if not --> doesn't exist.
TermsProcessingCompound-Unithood: Problem: Neither of the rules works for 'SC', which is a strong Topic. still Top 5 after cross lower NN (so Rule 2 doesn't apply)
TermsProcessingCompound-Unithood: Grouping UI: If a Parent Topic is displayed in Top Topics --> allow to hover to see most highly ranked child Topics that contributed the Count
TermsProcessingCompound-Unithood: Impl:
TermsProcessingCompound-Unithood: Add factors childCount (count of children Topics) + childRank (sum of all ranks contributes to this topic from children)
TermsProcessingCompound-Unithood:"POC" "Enron"
TermsProcessingCompound-Unithood:------------------------------
TermsProcessingCompound-Unithood:: Eng Enron
TermsProcessingCompound-Unithood: 63K emails of 5 users
TermsProcessingCompound-Unithood: Data Mapping is non-trivial but wasn't the main issue.
TermsProcessingCompound-Unithood: Cannot easily verify Topics without
TermsProcessingCompound-Unithood: Lessons learned / חיזוקים ודגשים
TermsProcessingCompound-Unithood: Enron occurs in ----- Forwarded by Jeff Dasovich/NA/Enron
TermsProcessingCompound-Unithood: Multiple email address per person --> split stats, from ranking ...
TermsProcessingCompound-Unithood: Scalability
TermsProcessingCompound-Unithood: Stanford NLP speed should be improved for large datasets - even for reports perposes.
TermsProcessingCompound-Unithood: Enron doesn't have isFocused field
TermsProcessingCompound-Unithood: DeletedItems - currently we do not consider them (they are in roadmap`)
TermsProcessingCompound-Unithood:-----------------------------------------------
TermsProcessingCompound-Unithood:: Bug? Why sso deleted from ram report
TermsProcessingCompound-Unithood: Seamless SSO replaced SSO, but why didn't it contribute its rank to sso ?
TermsProcessingCompound-Unithood: SSO is a term in <HE1PR06MB133914A1B7BF83BB9F993E48C5A20@HE1PR06MB1339.eurprd06.prod.outlook.com> (fromMe: 1 - ramt)
TermsProcessingCompound-Unithood:: Do not start if JJ  More and Other
TermsProcessingCompound-Unithood: Run TopTopics reports - require suffix Parent-Topic ?
TermsProcessingCompound-Unithood: If so - impl suffix parent topics and see if can commit
TermsProcessingCompound-Unithood: Ex: ey questions --> rank 2, count 1 --> deleted --> replaced by 'latest ey questions' --> ey is not a prefix (parent) of 'latest ey questions' --> ey rank 23 --> 21 (bad)
TermsProcessingCompound-Unithood: Problem: eq questions doesn't exist anymore anywhere. There is only ey and  'latest ey questions' --> no mediate child between grandfather ey and grandson 'latest ey questions'
TermsProcessingCompound-Unithood: startsAdj --> remove the first token and then look at prefix (as before: ey ---> ey questions)
TermsProcessingCompound-Unithood: suffix parent (for all, not just special case) --> 'latest ey questions' --> latest, 'ey questions' --> ey prefix, questions suffix
TermsProcessingCompound-Unithood: Problem: Too many parents and combinations - How do we create only a single parent ?
TermsProcessingCompound-Unithood:  We don't need 'ey questions' to exist in 100 TopTopics, 'ey' will get the childRank because it is a prefix-of-the-suffix (ey is a prefix of the suffix 'ey questions').
TermsProcessingCompound-Unithood: 'questions' is also a potential parent but doesn't exist in 100 TopTopics.
TermsProcessingCompound-Unithood: Ex: Same with Owa --> Owa Video --> Outdated OWA video
TermsProcessingCompound-Unithood: Conc:
TermsProcessingCompound-Unithood: ramt - 1 toptopic 'German cloud' (German is JJ, ramt) was added but several General Terms were boosted (outlook, collage) + supportrequest was added (top15)
TermsProcessingCompound-Unithood: david - Same as ramt - 'Graph portal' was added but outlook and collage were boosted (but top10 is better)
TermsProcessingCompound-Unithood: Good: Most Added Compound Terms are good.
TermsProcessingCompound-Unithood: Requires suffix Parent-Topic (because it extends the Compound with the JJ prefix)
TermsProcessingCompound-Unithood: Problem: See sentenceStart problem - opening bracket [, bullet *, pipe | --> not considered sentence start --> LM has count for mid-sentence of Sure thing (1) and Last Chance (several)
TermsProcessingCompound-Unithood: Fix it
TermsProcessingCompound-Unithood: Last Chance  Ctx:Last Chance: Time Entry & Tasks Management - General
TermsProcessingCompound-Unithood: Many thanks  Ctx:please reject accept as you wish. Many thanks, Sam. - General
TermsProcessingCompound-Unithood: Sig detector removes it, but still
TermsProcessingCompound-Unithood: Deep Work
TermsProcessingCompound-Unithood: Azure cloud - Stanford incorrectly thinks Azure is a JJ
TermsProcessingCompound-Unithood: AIIM-generated report downloads - Stanford incorrectly thinks AIIM-generated is a JJ
TermsProcessingCompound-Unithood: Potential investors
TermsProcessingCompound-Unithood: before: not extraced
TermsProcessingCompound-Unithood: Early Adopter Program  Ctx:have BUPA part of our harmon.ie 10 Early Adopter Program.
TermsProcessingCompound-Unithood: before: Adopter Program
TermsProcessingCompound-Unithood: Saas companies  Ctx:good cultural fit but these are not Saas companies and old school mentality. Thanks - Stanford incorrectly thinks Saas is a JJ
TermsProcessingCompound-Unithood: Active Directory  Ctx:Freeware for access monitoring in Active Directory & file shares
TermsProcessingCompound-Unithood: Active Directory is really a compound NNP (Product name)
TermsProcessingCompound-Unithood: before: Directory
TermsProcessingCompound-Unithood: Augmented Intelligence  Ctx:around leveraging AI  /  ML technologies and Augmented Intelligence to address some of the key
TermsProcessingCompound-Unithood: Visual Studio Team Services  Ctx: Sent from Visual Studio Team Services .
TermsProcessingCompound-Unithood: before: Studio Team Services
TermsProcessingCompound-Unithood:: French tech companies  Ctx:help owners, investors and executives of French tech companies better understand the opportunities and challenges
TermsProcessingCompound-Unithood:
TermsProcessingCompound-Unithood: More Smarts Ctx:Street Journal. The New Gmail: More Smarts, More Security...More Clutter - General lang
TermsProcessingCompound-Unithood: More Security - same
TermsProcessingCompound-Unithood: More Details - same
TermsProcessingCompound-Unithood: High-pri marketing opportunity  Ctx:RE: High-pri marketing opportunity - General
TermsProcessingCompound-Unithood: Regression: no term extracted before because both marketing and opportunity are lowercase NN.
TermsProcessingCompound-Unithood:: LATEST NEWS  Ctx:.com / img / arrow.png]LATEST NEWS . Was employee fired for strange - General
TermsProcessingCompound-Unithood: Before NEWS was extracted (NN with > 2 capitalized)- meaning no regression (it is still too General though)
TermsProcessingCompound-Unithood:: Creative  - Ctx: Hello Creative Maven Clients, Colleagues and Fans -
TermsProcessingCompound-Unithood: before: Maven Clients --> now: extends the suffix with JJ 'Creative'
TermsProcessingCompound-Unithood:: Primary Business Goals Ctx: How many licenses would you like to evaluate:  1\r\nPotential seats: 100-199\r\n\r\nPrimary Business Goals:\r\n\r\n\r\nEmail: per.kristian
TermsProcessingCompound-Unithood: Part of a form (duplicate)
TermsProcessingCompound-Unithood:insideBrackets Topics
TermsProcessingCompound-Unithood:--------------------------------------------
TermsProcessingCompound-Unithood: Q: Why Currently: Microsoft (MS) --> MS is filtered out (because MS is inside brackets which follows Microsoft - a term)
TermsProcessingCompound-Unithood: A: Because they are syns --> only the new Topic matters (IBM, AT&T) --> Do not count both Topics (former Company name and new Name) as Top Topics --> or you'll have duplicate
TermsProcessingCompound-Unithood:IBM (formerlyTelelogic AB and iLogix)
TermsProcessingCompound-Unithood:AT&T (American Telephone & Telegraph)
TermsProcessingCompound-Unithood:Oracle (Formerly Siebel)
TermsProcessingCompound-Unithood: Mark terms insideBrackets. We will use it for Similarity later.
TermsProcessingCompound-Unithood: body: ... KMWorld conference (including Taxonomy Boot Camp) --> 'Taxonomy Boot Camp' is insideBrackets
TermsProcessingCompound-Unithood: Today: 'KMWorld conference' - its last word conference (correctly extracted) is just before the brackets --> making (tokenIndex - 4 < prevTermLastTokenIndex) true
TermsProcessingCompound-Unithood:--> filter out (Incorrectly)
TermsProcessingCompound-Unithood: Pros:
TermsProcessingCompound-Unithood: TCJA  Ctx:the Tax Cuts and Jobs Act (TCJA), but that guidance is
TermsProcessingCompound-Unithood: MMBI  Ctx:US quarterly Middle Market Business Index (MMBI). The quarter’s index
TermsProcessingCompound-Unithood: Eastern Time  Ctx:, 2018, 11:00 AM NY (Eastern Time) .May 2, 2018 - Added 38 times (top count)
TermsProcessingCompound-Unithood: Add to Date/Time regex
TermsProcessingCompound-Unithood:------------------------------
TermsProcessingCompound-Unithood: Concs:
TermsProcessingCompound-Unithood: Problem: LM doesn't work on Compound (Register today, Job title, Opportunity link)
TermsProcessingCompound-Unithood: Duplicate ?
TermsProcessingCompound-Unithood: Board --> Board meeting. Weekly --> Weekly meeting
TermsProcessingCompound-Unithood: CX --> CX teams
TermsProcessingCompound-Unithood: Microsoft --> Microsoft keynote
TermsProcessingCompound-Unithood: AIIM Industry Watch --> AIIM Industry Watch research project
TermsProcessingCompound-Unithood: AIIM --> AIIM research
TermsProcessingCompound-Unithood: GDPR --> GDPR solution
TermsProcessingCompound-Unithood: Wall Street Journal --> Wall Street Journal app
TermsProcessingCompound-Unithood:: Continental --> Continental brainstorm (Requires parent)
TermsProcessingCompound-Unithood:: SP Conference NA --> SP Conference NA team --> a little bit too long (Requires parent)
TermsProcessingCompound-Unithood: Same: SharePoint Conference NA --> SharePoint Conference NA happenings
TermsProcessingCompound-Unithood:
TermsProcessingCompound-Unithood: Parent Topics davidl mar-apr
TermsProcessingCompound-Unithood:: Register --> Register today (Register NNP single --> kill it with word-LM, but it doesn't work for ngrams - Register today is harder to kill if enough count)
TermsProcessingCompound-Unithood:Cross-of
TermsProcessingCompound-Unithood:------------------------------
TermsProcessingCompound-Unithood:Conc: Most 'of' should be crossed. Because we now extend more Compound, we do not want to swallow the parent Topics
TermsProcessingCompound-Unithood: Release of PDF OPTIMIZER Version --> was PDF OPTIMIZER Version (incorrect). We do any want also PDF OPTIMIZER as a parent Topic.
TermsProcessingCompound-Unithood: Conc:
TermsProcessingCompound-Unithood: Most 'and' shouldn't be crossed.
TermsProcessingCompound-Unithood: Because we now do not cross and for SharePoint and XXX, Outlook and YYY, Office 365 and ZZZ --> There are more SharePoint, Outlook, Office 365 counts.
TermsProcessingCompound-Unithood: Deletes many unrelated (or semi-related - rare topics)
TermsProcessingCompound-Unithood: Gartner and IDC --> 2 Topics
TermsProcessingCompound-Unithood: Computerized Analytical and Networked Systems --> 'Computerized Analytical' , 'Networked Systems' --> better splitted
TermsProcessingCompound-Unithood: QA Managers and Personnel --> QA Managers, Personnel --> rare Topic, but we loose the QA Personal (it is simply Personal)
TermsProcessingCompound-Unithood: Agile and DevOps --> semi-related rare Topic
TermsProcessingCompound-Unithood: Food Safety and Turnarounds
TermsProcessingCompound-Unithood: Data and Machine Learning Platform --> Data, Machine Learning Platform
TermsProcessingCompound-Unithood:: Vendor and Performance Management --> Vendor, Performance Management
TermsProcessingCompound-Unithood: Occur in a single marketing email
TermsProcessingCompound-Unithood:: Center for Devices and Radiological Health
TermsProcessingCompound-Unithood: Note: It did cross end before but compund.length > 5 --> so it didn't have this term before (not deleted)
TermsProcessingCompound-Unithood: If 'and' was ner==ORG token (as the 3 tokens Center,for,Devices) --> it would cross it (isValidOrganizationToken) --> but stanford fails to detect it as ORG
TermsProcessingCompound-Unithood:: Product Innovation and CTO --> Product Innovation
TermsProcessingCompound-Unithood: Explain why Added: 'Announces New Strategic Partnerships' <1522249331275.2343ad2a-8872-4e05-b22b-8c9ec0a336db@emailer.rbpub.com>
TermsProcessingCompound-Unithood: body: Dropbox Goes Public and Announces New Strategic Partnerships --> Stanford incorrectly thinks all (except Goes,and) are NNP prev cross 'and'
TermsProcessingCompound-Unithood:--> 'Public and Announces New Strategic Partnerships' --> Compound > 5 tokens -->  filtered out --> now doesn't cross 'and' --> 'Announces New Strategic Partnerships'
TermsProcessingCompound-Unithood:db.mailupdates.aggregate([
TermsProcessingCompound-Unithood:{ $match: { 'terms.topicKey' : /,and/i, isFocused : true}},
TermsProcessingCompound-Unithood:{ $unwind: '$terms'},
TermsProcessingCompound-Unithood:{ $match: { 'terms.topicKey' : /,and/i}},
TermsProcessingCompound-Unithood:{ $limit : 200}
TermsProcessingCompound-Unithood:----------------End and------------------
TermsProcessingCompound-Unithood: Q: Enron: Why didn't it extracted from Enron (see Commission topic) - 'California Public Utilities Commission' ?
TermsProcessingCompound-Unithood: Maybe it did, but it didn't reach the count ?
TermsProcessingCompound-Unithood: Expected Annotation
TermsProcessingCompound-Unithood:Q: How does it help our effort to understand if the new Compound takes bad tokens ? Maybe we need another tool
TermsProcessingCompound-Unithood: Goal: Create 2000 expectedTopics.json (badTopic, goodTopic)
TermsProcessingCompound-Unithood: Rationale: With new Compound, we anyway need to manually inspect alot of Terms in context --> so annotate them for next round.
TermsProcessingCompound-Unithood: To use expectedTopics.json, we need to reuse Dictionary code.
TermsProcessingCompound-Unithood:: Define scenarios
TermsProcessingCompound-Unithood: Terms Q/A: Go over diff / non diff of Terms highlighted in Email --> Click to mark badTopic, topic + optional Confidence level
TermsProcessingCompound-Unithood: Free Manual Topics Annotations (without Terms processing): Select Text --> badTopic or topic
TermsProcessingCompound-Unithood: Q: Does brat fit ?
TermsProcessingCompound-Unithood: docker image - https://hub.docker.com/r/cassj/brat/
TermsProcessingCompound-Unithood: Examples
TermsProcessingCompound-Unithood:Motivaition: Q: If we had good Compound - Do <sharepoint xxxx> bigrams in reports have meaning and good count to replace general 'sharepoint' with meaningful 'SharePoint Authentication' ?
TermsProcessingCompound-Unithood: Further Investigation: build LM only on report query emails.
TermsProcessingCompound-Unithood: A: Yes, to a certain extent
TermsProcessingCompound-Unithood: Today, David report shows Sharepoint, Sharepoint NA, SharePoint Influencer, SharePoint and Outlook.
TermsProcessingCompound-Unithood: Problem: 'Migration to SharePoint' has variations: 'File Migration to SharePoint', 'file share migration' -
TermsProcessingCompound-Unithood: see Similarity
TermsProcessingCompound-Unithood: Reviewing Ram and David reports for ,sometimes doesn't reveal interesting bigrams
TermsProcessingCompound-Unithood: Sharepoint Popular bi-grams: SharePoint Site, SharePoint hub Site, SharePoint Document
TermsProcessingCompound-Unithood: Office 365 Popular bi-grams: Office 365 app[s], Office 365 survery (David)
TermsProcessingCompound-Unithood: Since > 50% of current 'sharepoint' count will remain (not taken by bi-igrams),we will still need UI 'Display Topic Context' (TFIDF and other measures may rank it lower)
TermsProcessingCompound-Unithood:: See ATR4S (unsupervised section), Jate (unsupervised) - they impl several Algorithms (CValue, LinkProbabilty from Wikipedia)
TermsProcessingCompound-Unithood: Change filters (in configuration) - to take only PROPN followed by optional noun (Excel authentication)
TermsProcessingCompound-Unithood: Do we need Stanford NLP
TermsProcessingCompound-Unithood: Try simple LM as a baseline - before going to AutoPhrase and complex ML.
TermsProcessingCompound-Unithood: Unithood and Termhood Factors
TermsProcessingCompound-Unithood: Nested SubTerm (often part of longer phrase): lower or higher score ?
TermsProcessingCompound-Unithood: If non-nested Term := log2(count term occur in corpus) * freq(term in corpus - meaning: count term occur in corpus / corpus size)
TermsProcessingCompound-Unithood: If nested in a set of containing Terms (Ta): = log2(count term occur in corpus) * [ freq(term in corpus) - avg freq of containing terms in corpus]
TermsProcessingCompound-Unithood: See https://arxiv.org/pdf/1711.03373.pdf
TermsProcessingCompound-Unithood: ATR4S - has CValue impl also for Single Word Terms (orig only for MWT)
TermsProcessingCompound-Unithood: Problem:
TermsProcessingCompound-Unithood: 'online meeting' --> high count + freq --> probably no nested --> high CValue.
TermsProcessingCompound-Unithood: 'management board' --> much lower count/freq --> not nested --> lower CValue (compared to 'online meeting'
TermsProcessingCompound-Unithood: Basic - for recognizing multi-word terms of "Average specificity"
TermsProcessingCompound-Unithood: Goal: Idenitify terms that are more general and sometimes used to build more specific terms (Microsoft Office?)
TermsProcessingCompound-Unithood: In contrast to the CValue, the candidates that contain a given candidate increase its feature value, since average-specific terms are often used to form more specific terms
TermsProcessingCompound-Unithood: Used in PostRankDC
TermsProcessingCompound-Unithood: Rule-based Automatic Multi-Word Term Extraction and Lemmatization https://pdfs.semanticscholar.org/b4bf/f89ec40e64e63df57a848ac5e83e30a474e6.pdf
TermsProcessingCompound-Unithood: Learn Linguistic POS Filter from Dictionary of Expected Good and Bad Terms
TermsProcessingCompound-Unithood: Today, we tune a multi-token POS filter (NNP --> NN and/of ...) as a good default.
TermsProcessingCompound-Unithood: As anything, the filter can be learned from a TrainingSet - for customers that the default doesn't
TermsProcessingCompound-Unithood:Yet Another Ranking Function for Automatic Multiword Term Extraction
TermsProcessingCompound-Unithood:https://hal-lirmm.ccsd.cnrs.fr/file/index/docid/1068556/filename/PolTAL2014.pdf
TermsProcessingCompound-Unithood: RAKE (2010)- https://pdfs.semanticscholar.org/5a58/00deb6461b3d022c8465e5286908de9f8d4e.pdf
TermsProcessingCompound-Unithood: Words that form highly scored terms --> new terms composed of these words are scored higher.
TermsProcessingCompound-Unithood: Lexical Cohesion (LC): see Page 2 at https://pdfs.semanticscholar.org/f02a/c5890bccd34fcee4a4ee63d992eb698ac68c.pdf
TermsProcessingCompound-Unithood: Degree of cohesion among the words Wj that compose the term T in Domain D
TermsProcessingCompound-Unithood: If the (multi-word) Term is more frequent or longer --> higher Cohesion.
TermsProcessingCompound-Unithood: If the sum of frequencies of its words is higher --> lower Cohesion.
TermsProcessingCompound-Unithood: Q: What about the negative sign from the log ?
TermsProcessingCompound-Unithood: Articles + Tools and Insights:
TermsProcessingCompound-Unithood: Google Scholar: 'Noun Phrase Segmentation', 'Key Phrase Extraction' ,  'automatic keyword extraction' 'automatic terminology recognition' (ATR) 'automatic terms extraction' 'automatic terminology extraction' (ATE)
TermsProcessingCompound-Unithood:
TermsProcessingCompound-Unithood:AutoPhrase
TermsProcessingCompound-Unithood:- - - - - - -
TermsProcessingCompound-Unithood: Examine results of default model on its DBLP (article titles) training data (not even a real test data)
TermsProcessingCompound-Unithood: See miribpc/~dekelc/term/AutoPhrase/models/DBLP/segmentation.txt
TermsProcessingCompound-Unithood: Problem: Many english words (Extending, Developing, Flexible) are labeled as pharses
TermsProcessingCompound-Unithood: They appear in capitalized in article titles at least 10 times
TermsProcessingCompound-Unithood: Mining Quality Phrases from Massive Text Corpora (2015) http://hanj.cs.illinois.edu/pdf/sigmod15_jliu.pdf
TermsProcessingCompound-Unithood: Uses POS and External Knowledgebase (Wikipedia) for distant supervision --> training its features
TermsProcessingCompound-Unithood: Github: https://github.com/shangjingbo1226/AutoPhrase  (has docker)
TermsProcessingCompound-Unithood: Features:
TermsProcessingCompound-Unithood: Concordance Features (PMI + pointwise Kullback-Leibler divergence) +
TermsProcessingCompound-Unithood: Whether stopwords are located at the beginning or the endof the phrase candidate
TermsProcessingCompound-Unithood: IDF - quality phrases are expected to have not too small average IDF
TermsProcessingCompound-Unithood: POS + Extra features: The article doesn't use them by says that POS and External Knowledgebase (Wikipedia) can be incoporated into feature set.
TermsProcessingCompound-Unithood: Dataset: Very small - 200 to 300 labels.
TermsProcessingCompound-Unithood: 4.3 Phrasal Segmentation
TermsProcessingCompound-Unithood: github page is very clear: install jdk-8 and g++ new version (>= 4.8)
TermsProcessingCompound-Unithood: ./auto_phrase.sh: Downloads DBLP.txt --> tokenize + trains + extract ranked list of phrases to model/AutoPhrase.txt
TermsProcessingCompound-Unithood: ./phrasal_segmentation.sh - terms extractor that marks a text with <phrase>ATM Networks</phrase> (output in  models/DBLP/segmentation.txt)
TermsProcessingCompound-Unithood: Customize parameters in  ./auto_phrase.sh (has comments)
TermsProcessingCompound-Unithood: Default path is data/DBLP.txt which triggers a download of long 1 document per line text file.
TermsProcessingCompound-Unithood: MODEL: path to models and results folder
TermsProcessingCompound-Unithood: FIRST_RUN: if 1 - preprocess all, if 0 --> uses preprocessed data from <cwd>/tmp (not global /tmp)
TermsProcessingCompound-Unithood: Intel NLP Architect http://nlp_architect.nervanasys.com/np_segmentation.html
TermsProcessingCompound-Unithood:"DONE Terms Processing
TermsProcessingCompound-Unithood:--------------------------------
TermsProcessingCompound-Unithood:Bug: How come 'Benefits', 'Platform' was added ? Its not a compound ?
TermsProcessingCompound-Unithood: Whitespace: Fix bug with non-simple whitespace before CD
TermsProcessingCompound-Unithood: Ex: Enterprise 2 1
TermsProcessingCompound-Unithood: Q: Is it only for numbers or extra whitespace should stop Compound anyway ?
TermsProcessingCompound-Unithood: Report topics with non-simple whitespace
TermsProcessingCompound-Unithood: Non-CD Stanford incorrectly classify as CD (number)
TermsProcessingCompound-Unithood: |
TermsProcessingCompound-Unithood:((token.ner === 'NUMBER' || token.pos === 'CD') && token.before.length === 1 && token.before != '\t')
TermsProcessingCompound-Unithood: Allows non-word tokens
TermsProcessingCompound-Unithood: Problem: Sometimes Pipe character (|) is marked with NNP --> Compound continue
TermsProcessingCompound-Unithood: Forbid | and \\ and >>
TermsProcessingCompound-Unithood: Note: We currently do not know all possible characters (dot, comma) that are allowed in Topics - so we limit the forbidden tokens to a closed set.
TermsProcessingCompound-Unithood: A: Fixed it for numbers
TermsProcessingCompound-Unithood: Ex: More >>
TermsProcessingCompound-Unithood: Bug: Same Topic added to topTopicsArr but was already there
TermsProcessingCompound-Unithood:const topTopics = _.keyBy(topTopicsArr, 'topicId');
TermsProcessingCompound-Unithood: Bug: both EY and EY OWA are parents of EY OWA deployment --> they race for 'EY OWA deployment'.parent = <which one?>
TermsProcessingCompound-Unithood: Simple parent --> Flat children/decendents: When 'EY' lookup in prefix dict --> copy children array to ey.children and add their rank to ey.childRank (not rank) --> finally orderBy rank + childRank (lambda)
TermsProcessingCompound-Unithood: Tree Should be longest prefix match (EY OWA) - single parent
TermsProcessingCompound-Unithood: Problem: EY is now a grandParent of EY OWA deployment --> doesn't contribute childRank
TermsProcessingCompound-Unithood: Collect ranks bottom up (EY OWA deployment --> EY OWA --> EY)
TermsProcessingCompound-Unithood: Bug: MIN_POS should be 9 not 10 (0 based)
TermsProcessingCompound-Unithood:"GA 2018"
TermsProcessingCompound-Unithood:Goal: NLP is a small fraction of topics, but it has to be there in terms of Marketing pitch
TermsProcessingCompound-Unithood: Prefer less nlp topics with higher quality
TermsProcessingCompound-Unithood: Q: How many work days ?
TermsProcessingCompound-Unithood:A: 5-7 days in Sep, 23 in Oct
TermsProcessingCompound-Unithood: Euclid missing
TermsProcessingCompound-Unithood: isFocused
TermsProcessingCompound-Unithood: text body
TermsProcessingCompound-Unithood: Problem: Change threshold/algo --> how to update Graph
TermsProcessingCompound-Unithood: Spark Job that reads from Redis and writes to Graph (.isPerson = false)
TermsProcessingCompound-Unithood: Communicate: Impact and Assumptions
TermsProcessingCompound-Unithood: Slide with Q/A Reports highlighting bad and good cases of quality
TermsProcessingCompound-Unithood: Topics Quality:
TermsProcessingCompound-Unithood: Proble: Mix nlp-Dictionary our 5-6 / 10 good topTopics --> We pick 1 nlp topic --> Hard to ensure it is one of the good 6.
TermsProcessingCompound-Unithood: Risk: Not enough topTopics reports
TermsProcessingCompound-Unithood: No Progress in Topics Quality (Ranking Algo) because it requires new Research (while we only integrate and Fix bugs)
TermsProcessingCompound-Unithood: We do not have TF-IDF or similar to remove it
TermsProcessingCompound-Unithood: 2 weeks Problem: We cannot invest less than 2 weeks (without integration) per Algo (like we did in Sig), and expect a good quality.
TermsProcessingCompound-Unithood: If we get another full time Senior Eng immediately, to do integration and investigations we can continue to increase Quality for Q1 2019
TermsProcessingCompound-Unithood: Resources: Assume Dekel 100%, Yehonathan at least 50%, Hodaya 80% and Yaarit 90%
TermsProcessingCompound-Unithood: Cannot take Hodaya whenever a webapp fails - she is working 80% after several weeks Holidays when we need her to study quickly new modules and sub-systems
TermsProcessingCompound-Unithood: Not in GA Scope:
TermsProcessingCompound-Unithood: Related Topics - no change from Today (count based), except it also filters out sig/dup (by Collage Team) - 0 days.
TermsProcessingCompound-Unithood: Topics Rank will not affect Relatd Topics in this GA
TermsProcessingCompound-Unithood: Drilldown/Email Topics - only remove Sig, Lm, Person
TermsProcessingCompound-Unithood: Dictionary - should be improved - by Collage Team
TermsProcessingCompound-Unithood: UI: ChildTopics and Group Topics
TermsProcessingCompound-Unithood: IsTopic Calculated
TermsProcessingCompound-Unithood: LangModel - 2 weeks
TermsProcessingCompound-Unithood: Scope: No TermsLM (only a simple single/multi word language model)
TermsProcessingCompound-Unithood: Signature - ? 1 week
TermsProcessingCompound-Unithood: NO: Euclid will provide text soon - "html NL Bug --> text newlines bug ?
TermsProcessingCompound-Unithood: PER + isAutomated - 3 weeks
TermsProcessingCompound-Unithood: Top Topics - 2 weeks
TermsProcessingCompound-Unithood: Parent Topics - 3 days
TermsProcessingCompound-Unithood: Build and Deployment - 2 weeks
TermsProcessingCompound-Unithood: New Q/A Reports - 3 weeks
TermsProcessingCompound-Unithood: Debug Topics Pipeline - 2 weeks
TermsProcessingCompound-Unithood: Prefer Compound over Single ? Are Compund Top Topics have less chance of being garbage (Ex: Thankx ...)
TermsProcessingCompound-Unithood: Top Topics: Force inclusion of Topics dict=true --> choose the first 3 (4) + top NLP 2 (1) topics
TermsProcessingCompound-Unithood: Top Topics: If count < 15 (high enough) --> cannot ensure enough stats to declare topTopic
TermsProcessingCompound-Unithood: Single Thread - Do not trust topics that only appeared in a duplicagte-subject of a single Conversation
TermsProcessingCompound-Unithood: Ignore Feedback: Only Ignore (Yehonathan 16/9: Topic-Click will not be supported)
TermsProcessingCompound-Unithood: Collage Team work only - no Topics Research Team work
TermsProcessingCompound-Unithood: Stanford NLP Cluster - K8s config (instances?)
TermsProcessingCompound-Unithood: Dictionary - Org Ignore 1 week (non-Topic entry) - required to ignore common bad NLP topic that appear for several users
TermsProcessingCompound-Unithood: Dictionary Bugs and Improvements - Internal Feedback
TermsProcessingCompound-Unithood: UI: ChildTopics (Compound): 'Collage' is a TopTopic, together with its 2 Top Children - Collage Demo and Collage Design
TermsProcessingCompound-Unithood: Schedule
TermsProcessingCompound-Unithood: Single Word Mid-Quality: LangModel, PER, isAutomated, Signature, Terms-Processing Bug fixes (not Compund), Build and Deploy, Quality Reports - 11 weeks (Nov 8)
TermsProcessingCompound-Unithood: Depend on Gremlin Top Topics (no Spark Java)
TermsProcessingCompound-Unithood: Compound + Generic Terms: Terms-Processing Compound, Parent Topics, General Topics, Terms Processing Bugs, Duplicate - 8.5 weeks (Dec 31)
TermsProcessingCompound-Unithood: Quality - Better than Single, but without Ranking
TermsProcessingCompound-Unithood: Filter out
TermsProcessingCompound-Unithood: Topics with PER = false || LangModel-badTopic
TermsProcessingCompound-Unithood: About edges that  Sig = true || Dup = true
TermsProcessingCompound-Unithood: Parent Topics
TermsProcessingCompound-Unithood: Calc in memory as in reports.js
TermsProcessingCompound-Unithood: Problem: Q: Spark parition?
TermsProcessingCompound-Unithood: General Topics - 1 week
TermsProcessingCompound-Unithood: Problem: 2 weeks period is not enough
TermsProcessingCompound-Unithood: Cache report every 2 weeks (at least filtered-count+childTopics)
TermsProcessingCompound-Unithood: Impl:
TermsProcessingCompound-Unithood: When PER / LM is integrated into Framework --> change reports.js to get Data from topics collection (Simulates Cosmos Graph)
TermsProcessingCompound-Unithood: Add --product flag --> equiv to --dontUseXXX --> where XXX is not implemented yet in product.
TermsProcessingCompound-Unithood: Product Report (separate from research report) Write report to my_reports/<user>/Product_<DateTime>
TermsProcessingCompound-Unithood: WebViewer - present Product reports in a separate Tab
TermsProcessingCompound-Unithood: Alt (Advanced): Run Spark-Java locally
TermsProcessingCompound-Unithood: Can it easily get data from MongoDb (not only Cosmos DB) --> both at the same format ?
TermsProcessingCompound-Unithood: Use Euclid (similar, not identical) update format.
TermsProcessingCompound-Unithood: PER, LangModel-badTopic - Topic node.
TermsProcessingCompound-Unithood: Sig, Duplicate - About Edge
TermsProcessingCompound-Unithood: PER context-sensitive (rcpts - Future) - About edge --> because it is a PER in this email, with current rcpts
TermsProcessingCompound-Unithood: isAutomated - an attribute of the Actor node
TermsProcessingCompound-Unithood: IsTopic - an attribute of the  Topic node
TermsProcessingCompound-Unithood: Problem: Hard to maintain the redundant
TermsProcessingCompound-Unithood: Ex: How to update it when isPerson toggles from true --> false ?
TermsProcessingCompound-Unithood: Solution (Partial): maintain a const Germlin string of PER || langModel-badTopic ---> use it everywhere
TermsProcessingCompound-Unithood: TopTopics/Related/Drilldown Counts - looks at about edge and Topic nodes to filter-out / penalize rank
TermsProcessingCompound-Unithood: Yehonathan: For GA : We will not count if a Terms occurs > N time for Drilldown --> Terms-LM is not required.
TermsProcessingCompound-Unithood: Pure microservices (Option 1.5 weeks overhead)
TermsProcessingCompound-Unithood: Topics team builds Containers and pushes them to collage/terms-processing and collage/daily-spark-java ...
TermsProcessingCompound-Unithood: VSTS: New Topics build - 3 days
TermsProcessingCompound-Unithood: Refactor to package Collage services/common/microService - 3 days
TermsProcessingCompound-Unithood: Alt: Keep Containers and Deployment in Collage view and require('terms-processing')	require('emailStats')
TermsProcessingCompound-Unithood: Easy with terms-processing but need to create new containers for emailStats, sparkJobs (executors)
TermsProcessingCompound-Unithood: Debug Topics Pipeline (2 weeks for all Team for adding logs, Learn and Setup, tooling - this is On going process)
TermsProcessingCompound-Unithood: Inject Frozen 20000 mailUpdates --> convert to Euclid gen format --> store in Euclid Connector local files --> imports to MongoDB -->
TermsProcessingCompound-Unithood: Note: Anyway Euclid is not available locally, so the inject mailUpdates is used by Q/A
TermsProcessingCompound-Unithood: Online nodeJS containers --> nodejs remote debugging
TermsProcessingCompound-Unithood: CosmosDB Graph - Simple Web query / viewer tool to help us examine Topics, Emails and about-edges (2 weeks)
TermsProcessingCompound-Unithood: Run Spark locally in Dev mode
TermsProcessingCompound-Unithood: Input Data format from Graph
TermsProcessingCompound-Unithood: json for every Graph node or edge
TermsProcessingCompound-Unithood: Spark shell REPL - recommened to experiment with Spark
TermsProcessingCompound-Unithood: Remote Debugging of Local Spark - https://stackoverflow.com/questions/30403685/how-can-i-debug-spark-application-locally
TermsProcessingCompound-Unithood: Code:
TermsProcessingCompound-Unithood: services/topicsjob
TermsProcessingCompound-Unithood: Build: Java --> .jar + bash-script (spark-submit)--> container collage/topicsjob --> sk8 cronjob executes the script every day (future: trigger after euclid)
TermsProcessingCompound-Unithood:: How to mix nlp and dictionary
TermsProcessingCompound-Unithood: How to bring quality nlp topics (perfer precision over recall)
TermsProcessingCompound-Unithood: Yehonathan: Investigate: Impact of each algorithm: Dup ? langModel, person, marketing, Compound
TermsProcessingCompound-Unithood: redis_node.send_command --> EVAL Lua script
TermsProcessingCompound-Unithood:: Online / Redis / Druid ?
TermsProcessingCompound-Unithood: Offline Spark - rewrite
TermsProcessingCompound-Unithood: Work: MongoDB --> switch to CosmosDB ?
TermsProcessingCompound-Unithood:: Problem: How to Online/Offline LangModel, Person, AutomatedEmail, Dup, Parent Topics,
TermsProcessingCompound-Unithood: Q: Is it true, that we cannot compute rank+= Online, because it depends on all of its childRank which are not yet known.
TermsProcessingCompound-Unithood: Plus, If child will reach 7 rank --> it will not contribute to childRank.
TermsProcessingCompound-Unithood: Solution (Hard): When the child rank changes --> update all its ancestors (prefixes of child topic) .
OnlineFramework:: Story Eliyahi - https://harmonie-collage.visualstudio.com/Collage/_workitems/edit/9496
OnlineFramework: Tiny Graph
OnlineFramework: Duplicate design
OnlineFramework: online-framework instability
OnlineFramework:: LM + Per repeated runs: choose 1-2 months query --> same languagemodels every time ?
OnlineFramework: Re
OnlineFramework: Didn't repro in 100 loop processArtifact (same updateId) without await
OnlineFramework: const arrPromises = _.range(100).map(i => processArtifact(processor, options, config, stats, logInfo, update) );
OnlineFramework:await Promise.all(arrPromises);
OnlineFramework: If cannot repro -->
OnlineFramework: repro again in production (small query)
OnlineFramework: scale monitor 0 + storage-worker and dict-terms processing 0
OnlineFramework: scale terms-processing 1
OnlineFramework: Keep 2 rabbit workers --> repro
OnlineFramework: Change to a single rabbit worker in a single process --> expect doesnt repro
OnlineFramework: node extractTerms.js --noPer --userDataDbURL mongodb://localhost:27099/collage_test --save --updateId "<10bf6824-39d0-458f-abfb-4ea56e32b459@CO1GMEHUB13.gme.gbl>" --allowDuplicateProcessing
OnlineFramework: Note: --allowDuplicateProcessing --> Better to select an update (like the above), that isn't duplicated in mongo
OnlineFramework: scale=0 terms-processing - so we can run it locally --> I may have a fix
OnlineFramework: pushd D:\views\Collage.Topics\Reports\helpers
OnlineFramework:node mongo_diff.js --collectionOld languagemodel --dbUrlOld mongodb://localhost:27099/collage_test --collectionNew languagemodel --dbUrlNew mongodb://localhost:27017/collage --key gName > output\lm_collage_test_vs_prod.txt
OnlineFramework: Change query //           fromDate: new Date('20-Aug-2018 00:00:00'),
OnlineFramework://           toDate: new Date('27-Aug-2018 23:59:00'),
OnlineFramework:pushd D:\views\Collage.Topics\Reports\tests\collage_converter
OnlineFramework:node convertFormats.js --userDataDbURL mongodb://localhost:27099/collage_test --tokens --u "<d5bc2deac8d2416d8397b1e55.034c27a0b7.20180820013959.fa792b0059.60ba7cbd@mail246.atl171.mcdlv.net>"
OnlineFramework: Conc:
OnlineFramework: Sometimes not all reached online-framework (ex: 23245 / 23298 -  first_processedartifacts_11_11_1335.json)
OnlineFramework: Even if all reached processedArtifact - still not all topics are written
OnlineFramework: mycontacts - totals are much lower compared to batchExtractor --> maybe explains diff in stats --> isPerson, isAutomated
OnlineFramework: Yaarit collage_new topics table 47197 vs. Hodaya second run topics table 47200 (3 extra!!! - dups ?)
OnlineFramework:g.V().has('isAutomated',true).count() - 844 vs. myContacts.isAutomated 1007
OnlineFramework: Fixed isAutomated coalese not updating after _.add
OnlineFramework: myContacts.isAutomated is not stable
OnlineFramework: isMarketing is stable - g.V().has('isMarketing',true).count() 4683
OnlineFramework:: update package of terms-processing (package.json-lock)
OnlineFramework: Pending: isPerson - new Date(createdAt) fix
OnlineFramework: Note: Dev using npm link
OnlineFramework: Add .count and .occurs to dictionary topics (or it is hard to rely on them for ranking)
OnlineFramework: sentAt: what is it ?
OnlineFramework: Test env
OnlineFramework: Search adaptToResearchUpdate --> describe.only + it.only
OnlineFramework: pushd services\terms-processing
OnlineFramework: clear collections (since it is incUpdate) --> myContacts, lm, processedArtifacts, topics
OnlineFramework: push-subtree + revert npm link terms-processing
OnlineFramework: Push branch
OnlineFramework: Online Framework Arch
OnlineFramework: Adapting framework for Batch-Train/Predict ML model
OnlineFramework: Problem: Currently db-updater also does calcIsPerson and stores it (predict --> persist result) -->
OnlineFramework:
OnlineFramework: Train every N emails / Time period -->  currently empty for all models --> performed as separate batch (Spark/PyML) process
OnlineFramework: Problem: Predict: Currently mixed within enricher (which does both predict - return result of model (isDup, isInSig ...) --> write predict result to update
OnlineFramework: Split Enricher to Predict function called by Enricher
OnlineFramework: How to Arch Aggregate rankings that are not (currently) implemented Online:
OnlineFramework: General-Terms (7 months topics childCount)
OnlineFramework: TF-IDF (count of Topic in seversl Periods of 2 weeks)
OnlineFramework:
OnlineFramework: Url + Sig only mark start/end offsets of urls
OnlineFramework: If TextProcessor changes text (as in SP Url partial decoding) --> should fix all start/end offsets (some made by TextProcessors executed before)
OnlineFramework: terms-processing itself is unaware of these tokens annotations (doesn't skip them)
OnlineFramework: Signature enrichers marks term.inSig and Url enricher marks term.inUrl
OnlineFramework: processArtifact accepts array of extractors,updaters and enrichers.
OnlineFramework: Calls pureProcessArtifact - calls extractors (Terms is a special case)
OnlineFramework: extractor(artifact) - pure -> return extracted-info
OnlineFramework: TERMS: nlp-client --> return .terms
OnlineFramework: PER: returns stats (from/to ...)
OnlineFramework: AUTOEMAIL: Same stats as PER - no extractor for isAutoEmail
OnlineFramework: SIG: return token index-startSig
OnlineFramework: DUP: AFTER Terms-processing --> .dupInfo = forEach .terms create a map:
OnlineFramework:<Term, context r/l +- 70 chars>
OnlineFramework: LM: For each token (alpha-numeric), return stats
OnlineFramework: predict + enricher(artifact, options - mongo url, artifact-index)
OnlineFramework: PER: Cache (persons.js) (isPerson + hardPerTok intersection with Hi + updatedAt 180 days) - every 500 artifacts
OnlineFramework: Lookup every Term of artifact and tags .isPerson (if found)
OnlineFramework: AUTOEMAIL: Same solution as PER but with different query: see contacts.js::loadAutomatedContacts
OnlineFramework: SIG: Tags all Terms starting after index-startSig as .sig = true
OnlineFramework: DUP: Foreach Term in .dupInfo map - try to match the last 100 terms occur from Redis --> If found --> Tags .dup = true
OnlineFramework: LM: Foreach Term: Query DB + calc getGramsLMScore --> tags .badLM = true
OnlineFramework: db-updater(extracted-info,options - mongourl) - model updater - not Graph updater
OnlineFramework: Saves extracted-info to mongo / redis (await saveStatsToDb)
OnlineFramework: AUTOEMAIL: Looks at PER extract-info Tokenize (if needed) + call calcIsAutomated + Save stats to myContacts
OnlineFramework: DUP: Writes .dupInfo to Redis
OnlineFramework: LM: Writes langModel stats (extracted-info)
OnlineFramework: graph-updater(artifact) - adds attributes to Graph nodes/edges according to enriched-artifact
OnlineFramework: Called by storage-worker
OnlineFramework: Outside processArtifact - storage-worker that writes to Graph the new attributes (calls graph-updater)
OnlineFramework: Add MongoStorageWorker that replaces storage-worker in local dev
OnlineFramework: Note: It already update the equivalent of About edges (update.terms - dup, sig) but has to explicitly update the equivalent of Topic and Actor nodes.
OnlineFramework: Topics collection (only if options.save)
OnlineFramework: Actors collection - write
OnlineFramework:DONE - Online Framework
OnlineFramework:- - - - - - - - - - - - -
OnlineFramework:Compare Spark Gremlin to reports.js --prod
OnlineFramework:------------------------------------------------
OnlineFramework: A: dedup after triplet join
OnlineFramework: repro problem with 2 users in gremlin.js
OnlineFramework: Conc:
OnlineFramework: Problem: .dedup() too early on an artifact that is refered by 2 actors (ram, david, topic outlook in <AM5PR0601MB2434F4C70853C3B81250A88BC5330@AM5PR0601MB2434.eurprd06.prod.outlook.com>)
OnlineFramework: Move it to the end (after by.by.by) --> every join line is unique
OnlineFramework: Repro
OnlineFramework: puser=noam alone - gremlin.js return 26 puser,art,topic (if added dedup after by.by.by - 13). triplets <--> count = rank = 13 in Aug prod report. Hodaya: 13 in Spark-Gremlin (without the  26!)
OnlineFramework: docusign - report.js has duplicate
OnlineFramework: Hodaya: isFocused - add in Gremlin
OnlineFramework:Noam: Report --prod: Replace collageUserId with a condition 'from.mail' or 'to.mail' or 'cc.mail'
OnlineFramework:
OnlineFramework: User deleted an email before update-fetching --> it doesn't appear in Inbox and SentItems --> absent from mongo reports.js
OnlineFramework: Email sent to mailing list (to: salesna@harmon.ie) --> still end up in YC Inbox but in Graph, YC is not connected to this email.
OnlineFramework: Problem: Beyond comparing numbers reports.js to Graph, we loose information (Graph doesn't contain actor-->artifact edge if actor was not explicity specified)
OnlineFramework: Euclid does fetch all emails of Inbox + Sent Items of a specific puser (which is similar to collageUserId) --> every Email has a puser --> Can create edges
OnlineFramework:---------------------------------------------------
OnlineFramework: DONE Q/A: YS: Fix: Only enrichers run on duplicate updateId (isMarketing)
OnlineFramework: Sig: Cannot compare signature when convertFormat --tokens
OnlineFramework: Tokens coming in with signature (no text)
OnlineFramework: removeSignature only works on text
OnlineFramework: terms inside signature are not filtered out
OnlineFramework: Filter out tokens that are within signnature
OnlineFramework: topTopics.js --> --prod -->  removeSignature -->  bodyNoSig.indexOf(jArt.about.text) == -1 --> sig += 1
OnlineFramework: termsExtractor.js
OnlineFramework:if (researchUpdate.tokens) {
OnlineFramework:researchUpdate.tokens = researchUpdate.tokens.filter((token)=> researchUpdate.bodyForTermsExtraction.indexOf(token.originalText) !== -1)
OnlineFramework: signature still does not work
OnlineFramework: Restore link
OnlineFramework: Login to UI
OnlineFramework: Why inserting body to Graph causes exceptions ? Special chars ?
OnlineFramework:A: We don't know, yet
OnlineFramework: Yaarit: Q/A:  --prod Report: fromMe + badLMTopic
OnlineFramework: Why isFocused, totalWithBodies, totalTermsCount are all 0 in production ?
OnlineFramework: Case sensitivity: myContacts.count - 3594 vs. g.V().hasLabel('actor').count() - 3655
OnlineFramework: Case sensitivity: Collage incorrectly creates case-sensitve email address actors --> Dekel@xxx.com and dekel@xxx.com are 2 different actors
OnlineFramework: Problem: Still see Camelcase (Upper) actorId
OnlineFramework:: Currently .ssh works: private npm: Build fails (hard to install .ssh into docker and not safe to deploy it )
OnlineFramework: Replace .ssh with private npm feed
OnlineFramework:a) git pull .npmrc and copy it to Reports/terms_processing
OnlineFramework:c) npm publish
OnlineFramework:d) Examine TopicsQualityFeed https://harmonie-collage.visualstudio.com/Collage/_packaging?feed=TopicsQualityFeed&_a=feed
OnlineFramework:e) Change services/term-processsing/package.json and try to npm i
OnlineFramework: online-framework mongo indexes are not created
OnlineFramework: delete collections --> down and start: Do they get created now ?
OnlineFramework: Maybe they do, but when deleting collection, when they are auto-created without restart the indexes are not created ? If so - minor problem.
OnlineFramework: myContact.isPerson is 0 in latest run
OnlineFramework: link was broken and Collage didn't see the fix in package
OnlineFramework: Not related to Graph - but may be related to production Date query
OnlineFramework: isPerson always false
OnlineFramework:  { isPerson: true,   hi: { '$exists': true, '$ne': {} },   perToks: { '$ne': [], '$exists': true },   updatedAt: { '$gte': ISODate('2018-01-20T18:33:58.000Z') } }
OnlineFramework: husky of Collage.Topics overwrites Collage husky script path
OnlineFramework:Collage/.git/hooks/post-commit --> scriptPath="services/terms-processing/Collage.Topics/Reports/terms_processing/node_modules/husky/run.js"
OnlineFramework: Meaning: When switch to develop that doesn't have Collage.Topics --> it finds the wrong husky or even cannot find it
OnlineFramework: Husky docs
OnlineFramework: dict-term-processsing masks isPerson = true and lmBadTopic
OnlineFramework: Changed dict-terms to mark existing nlp terms with .dict = true
OnlineFramework:Before: It replaced nlp topic with dictTopic
OnlineFramework: Should be small Diff: To match existingTerm (nlp) with new dictTopic, it requires almost identical text (removeParentheses(t.text)) --> meaning same text and topicKey --> meaning almost same topic anyway --> doesn't matter if nlp or dict wins.
OnlineFramework: Q/A and Review: messaging.js - sends what-where ?
OnlineFramework: Actors and Email insert - Not called via terms-processing anymore:
OnlineFramework: Only topics are inserted from terms-processing:
OnlineFramework:messaging.sendTextProcessing --> TEXT_PROCESSING_SUBSCRIPTION --> TERMS_SERVICE_TEXT_PROCESSING_QUEUE --> sends to dict-terms-processing --> sends to storage-worker  --> STORAGE_SERVICE_TEXT_WITH_TERMS_RECEIVED_QUEUE --> insertTopics
OnlineFramework: isFocused - does it reaches storage-worker ?
OnlineFramework: textPreProcessor
OnlineFramework: removeSignature removes only a part of body
OnlineFramework: Last processor ? update.subject + beforeEndSubjectToken (move from processArtifact.js)
OnlineFramework: If update.tokens && options.useCache --> do not call textProcessor
OnlineFramework: Problem: Today, textProcessor.prepareText is called before extractTermsFromText
OnlineFramework: Solution (Longer Term): Online Framework to include text preProcessing plugins (urls ...) + separate tokenizer + post tokenizer
OnlineFramework: Create clear new phases:
OnlineFramework: text preProcessing (body+subject --> textForTermsExtraction), postToken (mark url boundaries and all we need for terms processing)
OnlineFramework: Create order withing the phases (enricherA before enricherB)
OnlineFramework: docker-compose (xDev/QA/Prod): terms-processing depend on mongodb
OnlineFramework: Q: Where ? It seems only old docker-compose files mention it and terms-processing is already depends on db (mongosb service)
OnlineFramework: Remove terms_processing sub packages --> run npm_install_all.bat --> see if node_modules is now o.k.
OnlineFramework:  Note: No need for processRecords, storageTopics.cleanup(), maybeSaveTerms and storageTopics.save
OnlineFramework: addTermsInfo(update) --> processArtifact(this.processor, options?, config)
OnlineFramework: batchExtractor contains:
OnlineFramework: extractNLPTerms return: { artifact, terms, tokens }
OnlineFramework: extractFinalTermsAndTokens --> createFinalTerms
OnlineFramework: Merge createFinalTerms with addTermsInfo (but keep it in separate createFinalTerms)
OnlineFramework: old structured,
OnlineFramework: inTitle --> move to occur (it is per occurrence)
OnlineFramework: extractTermsAndTokens (useCache ? <from text> : <from tokens>)
OnlineFramework: nlpClient.extractTermsFromText ( endSubjectToken)
OnlineFramework: Note: addTermsInfo(update) --> textProcessor.prepareText (removeSignature) --> nlpClient.extractTermsFromText()
OnlineFramework: If from.mail === 'unknown@unknowncompany.com' --> delete researchUpdate.from
OnlineFramework: Convert from MSGraph to researchUpdate
OnlineFramework:: Why UniqueBody.Content.Content ?
OnlineFramework:"UniqueBody": {
OnlineFramework:"Content": {
OnlineFramework:"ContentType": "Microsoft.OutlookServices.BodyType'HTML'",
OnlineFramework:"Content": "<html><body><div>\r\n<div><font size=\"2\"><span style=\"font-size:11pt;\">When: Jul 17, 2018 15:00:00<br>\r\n\r\n--<br>\r\n\r\n</span></font></div>\r\n</div>\r\n</body></html>"
OnlineFramework:}
OnlineFramework: Log incoming update
OnlineFramework: Which json is passed to terms-processing ?
OnlineFramework: The one with Capitalization (like in Mock)- Euclid starts with Upper
OnlineFramework: Or result of adaptMail (mailHandlers.js) --> change json data of test accordingly (lowerCamelCase, body: plain text)
OnlineFramework: folderName - N/A
OnlineFramework: euclid-connector mailHandlers.js  body: plainBody,
OnlineFramework:Bug: extractTerms Error, updateId: b29910e3-663b-4288-9ae5-56c90c009dc2, message: Error: Error: Unexpected: should be either cached or extracted from CoreNLP
OnlineFramework:at extractFinalTermsAndTokens (/service/node_modules/terms-processing/processArtifact.js:32:11)
OnlineFramework:at extractNLPTerms (/service/node_modules/terms-processing/processArtifact.js:43:26)
OnlineFramework:at addTermsInfo (/service/termsExtractor.js:108:35)
OnlineFramework: It assumes update.tokens
OnlineFramework: buildConfig: options: incUpdate: true, save: true, userDataDbURL : <mongoDbUrl from env>, useProdTables, useCache (in case update.tokens exists)
OnlineFramework: start/stop init/cleanup
OnlineFramework: microService.perfromStart --> termsExtracor.start --> proccessor.create()
OnlineFramework: microService.perfromStop --> termsExtracor.stop --> framework.cleanup() --> modules cleanup
OnlineFramework: hostApi
OnlineFramework: require getLogger, MongoStorage, topicKey ...
OnlineFramework: framework.init(config, hostApi) --> modules init
OnlineFramework:Hodaya: Dependecies of Collage.Topics on Collage code
OnlineFramework: Problem: Reports/terms-processing is a required package from Collage terms-processing service --> lmExtractor cannot depend on MongoStorage and winstone log
OnlineFramework:
OnlineFramework: framework.init(<Pass Collage objects and functions into framework in options JS>)
OnlineFramework: Alt: Copy to a package email-utils common/utils/topicKey, common/mongo-storage, nlpWrapper
OnlineFramework: Alt: Create package.json in Collage and only publish to npm package.
OnlineFramework: Package: Can we npm install from *nested* git url ?
OnlineFramework: npm install from folder
OnlineFramework: const tpFramework = require('./terms-processing');
OnlineFramework: A: Cannot npm nested-git-url
OnlineFramework:GA Top Topics / "Ranked Topics  / Spark
GATopTopics/RankedTopics/Spark: If not - we have split-stats problem calc childRank (and we can fix it by merging 2 partial reports from 2 executors)
GATopTopics/RankedTopics/Spark: groupByKey(collageUserId) --> ensures all updates with a specific collageUserId end up on a single executor (not too much).
GATopTopics/RankedTopics/Spark: Solution Alt: Use spark API all along (also to count children - not using Custom Java) --> I believe it ensures everything will be accounted for in terms of map-reduce.
GATopTopics/RankedTopics/Spark: Still need to define a partitioner to ensure all collageUserId updates are on a single executor
GATopTopics/RankedTopics/Spark: DF hash partitioning, via def repartition(numPartitions: Int, partitionExprs: Column*)
GATopTopics/RankedTopics/Spark: Instead of Top 100 --> Process Top 1000 --> Used it in AutoComplete and Drilldown
GATopTopics/RankedTopics/Spark: Every Topic will have several rows, each row with a different prefix (unwind)
GATopTopics/RankedTopics/Spark:https://stackoverflow.com/questions/36248652/cleanest-most-efficient-syntax-to-perform-dataframe-self-join-in-spark
GATopTopics/RankedTopics/Spark:Performance / "Scalability
Performance/Scalability: Spark NLP and Memory Copying - is it really not efficient to call corenlp and spacy (which are not Spark native libs) from Spark, as it excessively copies memory between processes ?
Performance/Scalability: https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html
Performance/Scalability: Persistent Counting (Probabilistics) DataStructures (Redis like) with Batch operations (query/update counts of large batches)
Performance/Scalability: Redis cuckoofilter - Set Membership - better than Counting Bloom Filters https://github.com/kristoff-it/redis-cuckoofilter
Performance/Scalability: Redis module - https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/
Performance/Scalability: Problem: Q: Production Ready ?
Performance/Scalability: Offline (Spark): Sync Cosmos into Spark and run Top-Topics Report (Java?) for all users.
Performance/Scalability: Q: Is it fast enough ?
Performance/Scalability: The set of Topics per user is already used for All-Topics
Performance/Scalability: When new term in update is processed --> inc per user+date(day)+term counters + inc Per user Map of <Term,Counters>
Performance/Scalability: Tomorrow (new Day) -> for all Terms in per user Map --> dec counters of the day before (found in user+<the day before>+term)
Performance/Scalability: What about lowercase 'sharepoint influencer' --> almost every bi/tri/quad gram needs to be tested against Terms-LM ?
Performance/Scalability: Yehonathan: Do not insert Term, Update (and Query) Terms-LM until count = 5 --> only then insert Term
Performance/Scalability: Get rid of long-tail - millions
Performance/Scalability:Y: Insert all Terms in Graph --> update isTopic flag every 1hr  for all Terms nodes --> Spark Top topics/allTopics Job consumes it
Performance/Scalability:A: When we first time deploy (Large batch) or with Euclid batch --> we need Spark to parallelize the processing (for everything, not only for LM)
Performance/Scalability:1) Terms-LM is stored in Redis like data-structures store and has a probabilistic cache (ex: Counting Bloom Filter) in-mem
Performance/Scalability:2) When each mail is proceeded, InMem-Cache --> then Terms-LM is both queried and updated (can async update)
Performance/Scalability: If a candidate is in crossed-threshold-Cache (and it qualify for common threshold --> meaning it is not sentence start of lower) --> can mark it as a Topic without Redis request
Performance/Scalability: Else --> async update Redis (for stats)
Performance/Scalability: A Single batch query from Redis for all email update's term
Performance/Scalability:3) If 'SP NA conf' in Terms-LM crosses the Topic scoring threshold (based on count + other factors) --> simple case, no further work
Performance/Scalability: There are several thresholds - Capitalized NN < Capitalized at Sentence Start < Lower case
Performance/Scalability:4) If 'SP NA conf' in Terms-LM stil didn't cross Topic scoring threshold --> add the querying udpateId to LM['SP NA conf' ].updates.add(<updateId>, options: { sentenceStart, starts-lowercase ...} )
Performance/Scalability: Different occurr in different updates (ex: sentenceStart) are updated at higher threshold.
Performance/Scalability:6) There are only few, because threshold requires only 3-10 occurrences, after which we do not need to keep this updates list
Performance/Scalability: Partial: Must still keep the updates that have not been yet cross threshold, because of lower-case/sentenceStart
Performance/Scalability:7.a) Each time term is updated in LM --> timestamp
Performance/Scalability:7.b) Periodically G.C scan for those whoe haven't been updated recently (using timestamp) and are far away from thershold.
Performance/Scalability:7.c) Compress them, maybe using Counting Bloom Filters (exist with 0.5% false positive, not exist - no error)
Performance/Scalability:7.d) If the same Term does occur again later, query Bloom Filter, and if found --> start its count from 2 instead of 1
Performance/Scalability: Why langModel on 63K Enron emails was taking so long to insert ?
Performance/Scalability: Is bulk-udpate slow ?
Performance/Scalability: Disable insert - only create in-mem stats
Performance/Scalability: Try on large memory instance (10GB free, 16GB, 24GB) - configure mongodb to utilize the added memory.
Performance/Scalability: core-nlp:
Performance/Scalability: Q: Investigate How to boost perf of nlp (memory/cores ...)
Performance/Scalability: Cluster of Stanford NLP https://spotinst.com/getting-started/kubernetes-on-spot/
Performance/Scalability: Problem: Avi: corenlp is slow. Try Luis
Performance/Scalability: Dictionary: Load all into memory (per Org is small) - can still be in mongoDb
Performance/Scalability: Fixes fullText mongo query bug
Performance/Scalability: Today: Before each email, we query mongo fullText search for all tokens
Performance/Scalability: Duplicates: Online mark dup=true for each occur of a Term after Terms-Processing
Performance/Scalability: Problem: jOtherArt.cntDup++ --> we cannot update past instances of Terms in emails to be duplicate
Performance/Scalability:
Performance/Scalability: Analyze approximation worst / avg case if we don't update the past.
Performance/Scalability: Improve current Algo efficiency
Performance/Scalability: leftNbr x 2
Performance/Scalability: diff
Performance/Scalability: diff
Performance/Scalability: Extract Nbr (R,L) of ~ 70 chars around each Term
Performance/Scalability: Note: If keeping tokens - we already have token indexes
Performance/Scalability: Keep 1 month / 100-200 nbrs of recent occr (for each Term)
Performance/Scalability: Exact match case - consider lookup nbrLeft,nbrRight in other recent occr - keep counter for each
Performance/Scalability: Consider LSH (Locality Sensitive Hash) to reduce 100-200) duplicate tests  --> to 10-20
Performance/Scalability: LSH requires each nbr to be modelled as a vector (wordvector avg of words in nbr) --> so we have metric space --> divides the space with many random half-spaces
Performance/Scalability: Theory: See https://www.youtube.com/watch?v=Arni-zkqMBA
Performance/Scalability: Impl? How much memory is required (LSH per Term on the 100-200 nbrs)
Performance/Scalability: Do not keep nbr term.count < 5
Performance/Scalability:"Redis"
Performance/Scalability: Keys:
Performance/Scalability: KEYS *title* - all keys that contain the text 'title'
Performance/Scalability: Namespace:
Performance/Scalability: SET title:1 'The Hobbit'
Performance/Scalability: GET title:*
Performance/Scalability: Bit Vector - GETBIT (BloomFilter uses that)
Performance/Scalability: Lists: pushd head/tail
Performance/Scalability: LRANGE - returns first/last N elements
Performance/Scalability: Set
Performance/Scalability: Hash / Map
Performance/Scalability: Each opearation is Log(N)
Performance/Scalability: Used as Index to another Hash Redis - Sorted Set <emailAddr>, updatedAt <Timestamp> -->  Hash myContacts <emailAddr> <stats>
Performance/Scalability: Opt: Problem: 2 requests (Sorted Set index --> then Hash)
Performance/Scalability: Maybe fast enough? Save second request
Performance/Scalability: Lua to save second request
Performance/Scalability: Lua - https://redis.io/commands/eval
Performance/Scalability: Redis server side Scripting with Lua (no need to process all items at client side)
Performance/Scalability: It also supports JSON at the server (so can filter by, aggregate or project part of JSON value)
Performance/Scalability: Example: https://stackoverflow.com/questions/21707021/lua-script-for-redis-which-sums-the-values-of-keys
Performance/Scalability: Lua script can execute Redis commands on server.
Performance/Scalability: KEYS are required to be explicitly provided --> used for Partition Cluster routing --> but this rule is not enforced --> so we can voilate it if absolutely required.
Performance/Scalability: Slow scripts --> use in a separate Redis cluster / instance -->  no other script or Redis command will be executed while a script is being executed
Performance/Scalability: Redis supports the MONITOR command, which lets you see all commands received by the Redis server
Performance/Scalability: Installing
Performance/Scalability: docker run -d -p 6379:6379  --sysctl net.core.somaxconn=4096 --rm redis:latest redis-server
Performance/Scalability: #redis-cli
Performance/Scalability: Samples
Performance/Scalability: D:\Collage\Redis_Lua_Test
Performance/Scalability:- - - - - - - - -
Performance/Scalability: Get david Topic stats into a .json file
Performance/Scalability:redis-cli GET e4c91e9b-4a9c-4173-b268-9dc0c8e73c89_topicsStats > a.json
Performance/Scalability:"Druid Database"
Performance/Scalability: Suitable for Time related Queries
Performance/Scalability: Memory Optimizations:
Performance/Scalability: Dictionary encoding with id storage minimization for String columns
Performance/Scalability: Bitmap compression for bitmap indexes
Performance/Scalability: Approximation for Ranking
Performance/Scalability:Feedback
Performance/Scalability:--------------
Performance/Scalability: David:
Performance/Scalability: Good: 11 - Microsoft Graph, 9 - RSPB (becuase it is an account for customer cases), 16 - Consortium (though leadership site), 17 - Unison 3-14-18, 22- E3 or E5 (o365 license type), 24 - Digital Workplace (concept),
Performance/Scalability: Bad: 1 - Office 365 (too general) , PR, O365, Facebook (too general or / Sig?), Outlook (too-gen), OneDrive (not very interesting, but how to differentiate it from Microsoft Graph - which is interesting ?), SharePoint
Performance/Scalability: Ram:
Performance/Scalability: Good:
Difflicultcases:We have 2 types of errors:
Difflicultcases: Weak/Too General Terms which may be topics in some context, but not in other.
Difflicultcases: Ex:
Difflicultcases: 'GBP' - appeared in 'legit' manually written emails with count 4 --> rankScore 6 (replied to)
Difflicultcases: 'Dropbox' - appears in email 'I placed the files in my Dropbox ..'
Difflicultcases: 'API' - even if coming from emails really discussing APIs updateCount : 3 (14 days)
Difflicultcases: Tools and Services commonly used by Org (Ex: Gotomeeting, Dropbox, DocuSign, SharePoint, Pdf, Workflow LOB systems), but are not a business value (as Tech the org sells or discusses projects)
Difflicultcases: Some can be removed by mailing lists / generated email + dup
Difflicultcases: Problem: When topic counts < 10 - not enough stats.
Difflicultcases: Ex: Someone can write 'Google' or 'Github' one more time (authorTopic), after  --> will enter
Difflicultcases: Increase stat for more stability --> count occurrences inside every reply and not only
Difflicultcases: non-Topics dictionary (cross-org) - to be customized by TopicsAdmin - common business terms   ('GPB')
Difflicultcases: User Feedback: Ignore/Remove-from-Top-Topics - inside and across orgs
Difflicultcases: 'General' – POS tagger incorrect ‘General questions:\n’  PROPN. Google NLP correctly says ADJ + NOUN
Difflicultcases: 'Below' - 'Below you will find today’s industry news including:' --->  Google NLP correctly says
Difflicultcases: 'Agenda' - 'Agenda:\n 6 - 6:40 pm - Eat & Greet'
Difflicultcases:'Register' (thinks NNP), 'Reminder', 'Webinar'
Difflicultcases: Common slang synonyms: 'Thanx' --> syn to Thanks
Difflicultcases: Not only at sig (it can replace Thanks anywhere)
Difflicultcases: TFIDF - It appears in many emails over several dateRanges
Difflicultcases:  Sig: It may be at sig at first line (currertly excluded) or not close to the beginning of line (currertly excluded)
Difflicultcases: Problem: What if a slang word wouldn't occur in so many emails / periods ?
Difflicultcases: wiktionary dump
Difflicultcases: Ex: It knows common slang words: https://en.wiktionary.org/wiki/thx
Difflicultcases: Blacklist ?
Difflicultcases: Ex: body:Looking forward to it. Thanx  body:Paulina, This looks much more in line with what I had in mind. Thanx much. david body:This is grand. Thanx. body:Thanx. Looks great. Please follow ...
Difflicultcases: body:Thanx. I will  body:Thanx.
Difflicultcases: Marketing Emails from real-senders (David):
Difflicultcases: 'PDF', 'IL' - should be dropped by PROPN or non-Topics ignore list
Difflicultcases: Boilerplate: File is difficult example, because it appears in Ram's inbox at least 50 times (support emails like Open File menu ...)
Difflicultcases: Since it comes from a boiler-plate that Jean uses in many Support cases --> TF-IDF (idf) will lower its score for the current email.
Difflicultcases: Dll: Jean sometimes writes 'Dll' with capital --> making it a topic
Difflicultcases:Enhance:HIGH: Term --> becomes Topic only if coming from 2 different authors.
Difflicultcases: If information (Topic) is aquired from multiple independent sources (authors) --> it is more likely to be reliable --> If multiple authors wrote this Topic in Updates --> higher score
Difflicultcases: Editable (common word captialized)
Difflicultcases: Originated from 2 SP updates (2 authors) with 2 different Documents (XXX-Editable.xlsx, YYY-Editable.xlsx)
Difflicultcases: Q: How many cases are there ? Is it a common false-positive or 'Q/A' data planted by Vadim ?
Difflicultcases:Enhance:HIGH: If Term appears in dictionary (common word) but not in Wikipedia (proper-nouns) --> lower its score (but do not discard it) --> it will now become a topic only if occur 4 times (diff authors), intead of 2
Difflicultcases: Duplicate Texts (ex: 'DB Migration') - see 'Duplicate'
Difflicultcases: Generated / Notification Emails (Ex: 'Microsoft' from Microsoft Teams notification emails - 'Vadim Chekinda mentioned Development in Microsoft Teams'
Difflicultcases: Marketing Email filter - Microsoft Teams <noreply@email.teams.microsoft.com
Difflicultcases: 'Google Calendar' 'Google' from invitations
Difflicultcases: 'Office 365', 'Office 365 SharePoint', 'ADFS' - Ram last 7 days 71 occurrences --> Support Connectivity Email --> Alex is pasting a template to reply.
Difflicultcases:'URL' - Support Request, 'Username', 'Le', 'SharePoint lorsque' - Support Requests + Order fulfillment emails,   - see 'Duplicate
Difflicultcases:'Update' support emails Ex: subject: '[harmon.ie] Update: Harmon.ie OWA On Premise POC'
Difflicultcases:'View' - vsts Emails (button 'View story NNN')
Difflicultcases:'Uninstall' (Form Submission from website from support@harmon.ie)
Difflicultcases: 'Job','Opportunity Link','PO Number',  'Reseller'  - (subj:Closed Won or IT Evaulation Request:One), body: ' Subscription\nProduct Line: Harmon.ie\nEula Type: \nReseller: \nEnd Customer
Difflicultcases: Sent from sales@harmon.ie (automated email inside Org)
Difflicultcases:Enhance: 'Remove Duplicates' before counting occurrences in Email threads / Updates.
Difflicultcases: Solution Impl: For every TopTopic --> extract all occurrences --> Cons Inverted-Index on every nbr of every occurrence (key=token+before_or_after_the_topic)
Difflicultcases: \r\n is important as nbr token, since they separate form-fields and signature lines --> they are duplicated layout hint
Difflicultcases: Ex: support: [harmon.ie] Update: Can we get rid of this pop up ## Please do not write below this line ##\r\nYou are registered as a cc on this help desk request and are thus receiving email notifications on all updates to the request.\r\nReply to this email to add a comment to the request.\r\nyou can also click here to open the ticket in Zendesk https://harmonie.zendesk.com/agent/tickets/40366\r\ncustomer:Lyondellbasell, submitted by :Stephanie Forman stephanie.forman@lyondellbasell.com\r\n\r\nAlex Beaudoin (harmon.ie)\r\nMar 7, 08:39 EST\r\nHi Stephanie,\r\n\r\nThere is no current option to turn this off.\r\n\r\nRon Johnsen will be able to assist you further if you'd like to explore the RFE (Request for Enhancement) in more detail.\r\n\r\nRon Johnsen\r\nVP of Services\r\nronj@harmon.ie<mailto:ronj@harmon.ie>\r\n\r\nThanks,\r\nAlex\r\n\r\n\r\n\r\n\r\nRam Tagher (harmon.ie)\r\nMar 7, 07:44 EST\r\nPrivate note\r\nNo known key to remove this message.\r\n\r\n\r\nThanks,\r\nRam\r\nAttachment(s)\r\nimage001.png<https://harmonie.zendesk.com/attachments/token/EMBq3cUQBDNDxen0ZwKJNUgZw/?name=image001.png>\r\n\r\n\r\n\r\nAlex Beaudoin (harmon.ie)\r\nMar 6, 19:33 EST\r\nPrivate note\r\nRam,\r\n\r\nDo you know if we can remove this pop up for new installs?\r\n\r\nThanks,\r\nAlex\r\n\r\nFrom: Forman, Stephanie D <Stephanie.Forman@lyondellbasell.com<mailto:Stephanie.Forman@lyondellbasell.com>>\r\nSent: Tuesday, March 6, 2018 4:44 PM\r\nTo: harmon.ie Support <support@harmon.ie<mailto:support@harmon.ie>>\r\nSubject: Can we get rid of this pop up\r\n\r\nGreetings,\r\n\r\nWe are packaging Harmon.ie or outlook v6.5.\r\n\r\nCan we remove this popup before package:\r\n\r\n(Happens first time on drag email)\r\n\r\nThanks,\r\nStephanie\r\nInformation contained in this email is subject to the disclaimer found by clicking on the following link: http://www.lyb.com/en/about-us/disclaimer\r\nAttachment(s)\r\nimage001.png<https://harmonie.zendesk.com/attachments/token/8EJBrcYfeOFwvMbm9i71OYmPm/?name=image001.png>\r\n\r\n\r\n\r\nStephanie Forman\r\nMar 6, 16:44 EST\r\nGreetings,\r\n\r\nWe are packaging Harmon.ie or outlook v6.5.\r\n\r\nCan we remove this popup before package:\r\n\r\n(Happens first time on drag email)\r\n\r\nThanks,\r\nStephanie\r\nInformation contained in this email is subject to the disclaimer found by clicking on the following link: http://www.lyb.com/en/about-us/disclaimer\r\nAttachment(s)\r\nimage001.png<https://harmonie.zendesk.com/attachments/token/GbDyyICVjWGWuMDXjePhTZD8s/?name=image001.png>\r\n\r\n\r\n'\r\nYou are an agent. Add a comment by replying to this email or view ticket in Zendesk Support<https://harmonie.zendesk.com/agent/tickets/40366>.\r\nTicket #        40366\r\nStatus  Pending\r\nRequester       Stephanie Forman\r\nCCs     Alex Beaudoin, Ram Tagher, Ron Johnsen\r\nGroup   Support\r\nAssignee        Alex Beaudoin\r\nPriority        -\r\nType    Ticket\r\nChannel By Mail\r\n\r\nThis email is a service from harmon.ie. Delivered by Zendesk<https://www.zendesk.com/product/tour/?utm_campaign=text&utm_content=harmon.ie&utm_medium=poweredbyzendesk&utm_source=email-notification>\r\n[6LXPYW-YG0G]Ticket-Id:40366Account-Subdomain:harmonie\r\n\r\n
Difflicultcases: Foreach occurrence lookup every nbr+pos token in Inverted-Index --> List of matching occurences for this positional matching token (same token before/after topic)
Difflicultcases: Continue with next token
Difflicultcases:Problem: We want to detect a repeating pattern across some of the occurrences (not all) --> 5 x  Assignee: Dekel Category: Sidebar
Difflicultcases: Counting current nbr against all others do not ensure the SAME duplicated nbr tokens appear in all occurrences (only ensures that some tokens match - not always the same ones)
Difflicultcases: Cannot intersect because if non-matching token is not found --> empty intersection.
Difflicultcases: If occurence.pos_match_token_count > 2 or 60% --> declare a duplicate pair
Difflicultcases: Winnowing: Local Algorithms for Document Fingerprinting http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
Difflicultcases:: Html Dom context: 'View' <random text> in vsts emails --> no surrounding repeating tokens --> but html structure is repeating
Difflicultcases: Scholar: Document Fingerprintings
Difflicultcases: 'Person Topics Invalidation':
Difflicultcases: Ex: IBM Partnerworld (in sujb) --> wasn't recognized as a Topic IBM (even though I have IBM as Topic)
Difflicultcases: Action fulfillment / Citrix Ready membership - 2nd / 3rd word in a topic is not capitalized.
Difflicultcases:Enhance:Stanford NLP produces 'compound'
Difflicultcases: Problem: Today, even if we have a Topic 'Action Fulfillment' (Capital F) --> Collage won't recognize it in an email with 'Action fulfillment' (lower case f)
Difflicultcases: First NLP results Action/NNP fulfillment/NN --> Action continues but 'Action fulfillment' is dropped (common and not proper noun)
Difflicultcases: Solution (Naive): if has Capitalized Topic --> identify also lower case
Difflicultcases: Problem: Ex: We have topic False (noisy) --> Do we want to highlight 'false' ?
Difflicultcases: Only in multi-word Topics that occur enough. Require extra lookup
Difflicultcases: If So:Enhance: If someone at your org replies >= 4 times to this email-address  (or you replied >= 2 times) --> mark it as non-marketing / focused addr
Difflicultcases: Missing Topics (e.g Password reset) - meaningful topics that didn't appear (enough times) as proper-noun phrase 'Password Reset'.
Difflicultcases:
Difflicultcases: The 2 words, password and reset did appear close in the same setence
Difflicultcases: Recommend updates with similar, important (tfidf?) words in title/gcontent, In addition to Topic based ranking.
Difflicultcases: Structured Topics
Difflicultcases: Wikipedia boost score
Difflicultcases: Ex:Neg: Mime Type - appears in Wiki but we don't want it as a topic
Difflicultcases: Ex: 'IBM Workspace'
Difflicultcases: Term scoring will be used to:
Difflicultcases:a) determine if Candidate-Term becomes a Topic
Difflicultcases: Today: 2 occurences for the same term.
FeaturesScoreFactors: Google Search factors: https://backlinko.com/google-ranking-factors
FeaturesScoreFactors:?? HIGH: Merge with CSO / IIA Appeal
FeaturesScoreFactors: Abstract:
FeaturesScoreFactors: Ranking:
FeaturesScoreFactors: Explicit Feedback (Click/Search and Ignore),
FeaturesScoreFactors: Affinity (Ignore by others),
FeaturesScoreFactors: TFIDF / Trending / Unique,
FeaturesScoreFactors: Structural - Position in Parts of Email/Document
FeaturesScoreFactors: Surrounding Text -  Neighborhood Assumption: The Surrounding Text of important Topics and common words/Terms is different
FeaturesScoreFactors: Importance of a Term to the containing Email/Doc (word <--> document distance word2vec / svd)
FeaturesScoreFactors: Recent Context: Topic that appear in Google Searches, Recently browsed Documents, Emails and WebPages (Browser) --> rank higher (at least in Email Topics)
FeaturesScoreFactors: Common Bad: Duplicates (PO Number, Google Calendar), Signatures, Focused vs. Automated and Marketing Email, Language Model (Mobile, Request, Admin, Invitation),
FeaturesScoreFactors: PER and LOC invalidate, Filter out IDs
FeaturesScoreFactors: Similarity - BNY Mellon <-> Bank of New York Mellon, Acronym / Abbreviation (USA <--> United States of America)
FeaturesScoreFactors: Smart Dictionary - Normalization and Similarity (below)
FeaturesScoreFactors: Learning To Rank - Adapt weights of above feature per Org per User
FeaturesScoreFactors: Number and Affinity of Topic Authors ( >= 2) - Scope-Score
FeaturesScoreFactors: Ratio of number of times topic authored by me or my affinity / total topic count.
FeaturesScoreFactors: Did the current User wrote the Topic ? --> maybe 1 author is enough (but with 2-3 Threads)
FeaturesScoreFactors: Did People I work with wrote the Topic ? (Different than looking at Affinity Users Topics - because they may have recieved the Topic from Mass Mail or other noise.
FeaturesScoreFactors: Explicit Feedback := Interacted with a Topic directly
FeaturesScoreFactors: Clicked/Searched the topic
FeaturesScoreFactors: In Collage: Used the topic (auto-complete) to filter/search.
FeaturesScoreFactors: Search query logs --> if user searches for a Topic-word --> higher score (See 'Search Query Logs Implicit Feedback')
FeaturesScoreFactors: Ignore
FeaturesScoreFactors: Parent Topic: Several Ignores of XvfB Damon, XvfB Trace ... --> Suggest to Ignore XvfB <everything>
FeaturesScoreFactors: Affinity: Several Users ignored something --> dec Rank (but not yet ignored for this user)
FeaturesScoreFactors: Sender: User / Affinity Ignored several Topics from a sender (support@harmon.ie) --> dec Rank for this sender
FeaturesScoreFactors: Wrote it (see above)
FeaturesScoreFactors: Affinity with similar users who has a high score for this Topic.
FeaturesScoreFactors: Microsoft Graph - People affinity
FeaturesScoreFactors: Calendar:
FeaturesScoreFactors: Collection: Currently, we do not collect calendar meeting, so we do not know if meetings the user has initiated (or Required/Optional)
FeaturesScoreFactors: Accepted Calendar meeting email subject (Accepted: ) --> rank higher than regular subject.
FeaturesScoreFactors: Note: There are very few Accepcted: Subjects compared to other Subjects (and Bodies) - even for some of the Top Topics.
FeaturesScoreFactors: Seniority of recipients: Emails from Boss/Superviser/Senior Official/Informal Expert - are usually a lot more important.
FeaturesScoreFactors: ML: All of the mentioned here are features (Specific words in email, Flagged, To/CC, Attachment ...) --> learn weights by using Feedback (as in recommendations)
FeaturesScoreFactors: Until then - we study weights manually.
FeaturesScoreFactors: See 'Priority Inbox Clasifier'
FeaturesScoreFactors: Specific words in email - task oriented, deadline, schedule a meeting
FeaturesScoreFactors: Can use separate classifiers for some of those (like IBM uses for scheduling)
FeaturesScoreFactors: Thread/Conversation importance – Length of thread (in time period) Initiated Thread, Replied to, Affinity with  participants
FeaturesScoreFactors: Ex: 'Terms Review' has Count 5 (noamg Aug) --> all in same day, except the first --> meaning it was taking too little time of the month to be considerd topTopic.
FeaturesScoreFactors: See Email 'Importance
FeaturesScoreFactors: On To: vs. On Cc: of Email/Thread containing the Topic
FeaturesScoreFactors: Negative: Didn't open a mail with the Topic Received a mail with the topic, but didn't open it (sits in Inbox for a 2 weeks and the user did open other emails since it arrived)
FeaturesScoreFactors: Note: Not sure we need negative logic, since Time Decay should lower its score anyway if didn't interacted with Topic
FeaturesScoreFactors: Outlook Folder
FeaturesScoreFactors: Uploaded to SP using harmon.ie
FeaturesScoreFactors: Emails flagged, Category
FeaturesScoreFactors: Deleted (without Reply) - less important (similar to isFocus = false)
FeaturesScoreFactors: Impl: Fetch also 'Deleted' email folder (Recyclebin like: doesn't include Shift+Delete)
FeaturesScoreFactors: User types: Some do not delete much emails (dekelc) on regular basis, so do not use Deleted in their ranking
FeaturesScoreFactors: Non-Email (SP Docs, Zendesk) - same Idea - Did respond to support case ? In CC ....
FeaturesScoreFactors: Modified Doc >> Read Doc
FeaturesScoreFactors: Email contains Attachment or SP Link --> meaning there is some work going on in associated documents
FeaturesScoreFactors: Impl: Filter small attachments from signatures - signature.asc a
FeaturesScoreFactors: Problem: Some Automated systems send attachments (VoiceMail <voicemail@harmon.ie>) .mp3 --> Factor less important
FeaturesScoreFactors: Affinity with readers / writers of Email/Document (Scope-Score)
FeaturesScoreFactors: User created or freq visited a Folder with a name matching a Topic (or Folder-Name contributes the Topic)
FeaturesScoreFactors: .pptx - usually > email.
FeaturesScoreFactors: Number of users modified / read the Document (SP version history) --> usually more well written and important
FeaturesScoreFactors: NER: Title / PER / LOC   - non-Topics or less important Topics (configurable per org/user)
FeaturesScoreFactors: From: <Person>
FeaturesScoreFactors: Stanford NER - if Location or Person --> invalid (or ommit it)
FeaturesScoreFactors: High Quality Structured - if from SP Metadata value (TermSet, Keyword)
FeaturesScoreFactors: SalesForce (SF) - rank higher
FeaturesScoreFactors: isMarketingEmail - do not extract Topics
FeaturesScoreFactors: Duplicate Texts detection ('Office 365')
FeaturesScoreFactors: isFocused = false (Email) --> No Topic Extracted (Today)
FeaturesScoreFactors: topics collection - isNonContributor = true
FeaturesScoreFactors: Structural - Position in Parts of Email/Document
FeaturesScoreFactors: Subject/Title --> higher (Note: Depend on current Subject Duplication in Threads)
FeaturesScoreFactors: Bulleted Lists --> higher
FeaturesScoreFactors: Near start of long document / email- higher
FeaturesScoreFactors: Note: In Certain Docs/Email - the first paragraphs are chit-chat (intro/welcome) --> we may need to account for that
FeaturesScoreFactors: Enhance: Repeating Boilerplate --> TF-IDF should also help
FeaturesScoreFactors: Number of Email Thread + Updates it occurs in (currently 2)
FeaturesScoreFactors: Zendesk doesn't contribute Topics (too much noise), SP Updates do contribute (If Topic from Document Name and 2 Updates refer to same Document - Count as 1)
FeaturesScoreFactors:Enhance: Raise to 3.
FeaturesScoreFactors:Enhance: If very Popular update (No. Of Views + Comments + Likes + maybe MSGraph Trending) --> Count as 2 (say threshold is 3)
FeaturesScoreFactors: Topic count in external sources: If we raise count to 3 or 4 --> external sources may contribute 1-2
FeaturesScoreFactors: Wikipedia, Twitter hashtag, Stackoverflow tag, News/Tech/Med/Business feeds, Important Text in Webpages (Title, anchor text ..., bold)
FeaturesScoreFactors: See how to determine if Wikiepdia article (title always Capitalized) is PROPN or a NOUN (Algorithm) in http://www.idi.ntnu.no/~noervaag/papers/AINA2010.pdf
FeaturesScoreFactors: Time sensitive (Twitter)
FeaturesScoreFactors: Keyphraseness Ratio: Number of Times this term (case-insensitive) was detected as ProperNoun or Topic or Named Entity divided by number of times it appears (maybe lowercased) in corpus as not-(ProperNoun or Topic or Named Entity).
FeaturesScoreFactors: Problem: Compound Topics: Sugar CRM --> Sugar has bad ratio (lots of lowercase sugar occurences in the corpus), but Sugar CRM has a good ratio.
FeaturesScoreFactors: Note: Longer Compound Topics (more Terms) --> usually less occurrences in corpus.
FeaturesScoreFactors: Ex: Editable / String
FeaturesScoreFactors: Solution (Alt): Lookup in Dictionary + Wikipedia
FeaturesScoreFactors: If Google NGram contain the 1 Gram (single term) or 2-3 Gram (2-3 tokens) as very common --> similar to 5000 freq --> Consider lower rank it (at least from Top Topics)
FeaturesScoreFactors: Note: If Ratio has enough stat --> seems more reliable as it is in context of org emails --> but if not enough info in Ration stats --> use external sources.
FeaturesScoreFactors: New User - if the initial (last month) scan reveals many updates / emails with a distinctive set of topics.
FeaturesScoreFactors: Time Decay - If a user didn't interact with the topic or even has the topic in its last month updates --> we should lower its score.
FeaturesScoreFactors: Blacklisted-Penalty - See Dictionary - Wikipedia above
FeaturesScoreFactors: Today: if a term is NOT capitlaized && occurs in english-dict --> not a Topic
FeaturesScoreFactors: General: Some methods (see 'Unithood') combine Unithood (Compund) , Termhood (how likely it is to be a Term in this corpus/domain) and Ranking per Corpus/User/Doc (TFIDF - Uniquness, PageRank)
FeaturesScoreFactors: Today: Terms processing ad-hoc concat as long as PROPN or connecting and/or/but
FeaturesScoreFactors: LM: 'IBM and Microsoft', 'Google & Dropbox': Count 'IBM and Microsoft' vs 'IBM' separated from 'Microsoft' --> if together low count and separate higher count --> do not concat
FeaturesScoreFactors: Can also use external KB
FeaturesScoreFactors: Context / Neighbourhood / Surrounding Text / PageRank
FeaturesScoreFactors: Assumption: The contexts of Terms and common words are different
FeaturesScoreFactors: For Email/Doc: Containing Sentences
FeaturesScoreFactors: Mean length of containing sentences: Longer sentences should carry more information
FeaturesScoreFactors: Dialog Acts - see "Email Importance --> we may look at Paragraph around Topic (in addition to Email scope - that contains the reply-time and sender metadata)
FeaturesScoreFactors: See NC-value - if candidate is surrounded by good words (obtained from another ranking - CValue) --> rank higher.
FeaturesScoreFactors: 'General Term': Problem: Wouldn't it find 'Collage' as high pagerank ? It may identify real topics, but mostly
FeaturesScoreFactors: Cluster the Topics Graph: If a candidate is related to many clusters --> General. If related to few clusters --> more specific.
FeaturesScoreFactors: If the related clusters (or other candidate) are specific to the user+period (TF-IDF) --> more specific.
FeaturesScoreFactors: Edge weight:
FeaturesScoreFactors: Count of co-occur (same doc parahraph, same email ...)
FeaturesScoreFactors: See Personalized PageRank (starts with a seed of good-candidates - dictionary or automated or from other orgs or from Wikipedia ....)
FeaturesScoreFactors: TFIDF: Uniqueness of Topic for 'Current' := User/User+Period/Email/Doc/Org-Corpus using Stats of 'Current' vs. all Others.
FeaturesScoreFactors: Others: May be all other emails in the user's inbox or emails of user+affinity users or all-org. Org-Corpus: vs. Background Corpus (other Orgs, Wikipedia ...)
FeaturesScoreFactors: Assumption: Uniqueness implies informative / local importance
FeaturesScoreFactors: Normalize by Document Length: If the email contains many words/topics --> rank lower any of its topics (ex: ai occurs in a list of technologies and buzzwords)
FeaturesScoreFactors: Note: Filter out sig,per Terms (but not Dups) before counting num Terms
FeaturesScoreFactors: Meaning: If a Topic always has high count ('Office 365') --> rank it lower because it is probably not interesting for this period/week (or not at all, since user is well aware of it)
FeaturesScoreFactors: Compared to other users
FeaturesScoreFactors: Compared to other topics of this user
FeaturesScoreFactors: Compare Trends (uses sharepoint in this preiod much more than before ...)
FeaturesScoreFactors: Corpus: See cosiderations in https://tinyurl.com/y9euknkf
FeaturesScoreFactors: See Opposite: Filter out 'General Terms'
FeaturesScoreFactors: Distance/Similarity of a Term to the containing email/doc
FeaturesScoreFactors: Transform the email words/Terms and the current Term to reduced dimenational space of latent factors
FeaturesScoreFactors: Compute distance (ex:cosine) between Term and Email vector
FeaturesScoreFactors: Intuition: If Term (Authentication) co-occur a lot with other words in email (password, login) --> it is more important than Colage SharePoint
FeaturesScoreFactors: See https://pdfs.semanticscholar.org/presentation/ca25/56bb43143e81519aae0276607abaea36a03c.pdf?_ga=2.110256653.964172951.1530086220-1906975421.1530086220
FeaturesScoreFactors:  Noise Reduction:
FeaturesScoreFactors: Disqualify words preceded by a greeting like hi, hello or dear. Examples: Gokan (Y62)
FeaturesScoreFactors: Remove Term after: a, an  possessive determiners: my, your, her, his, its, our, their
FeaturesScoreFactors: non-Topics: BlackList - Titles, Common Business Terms ...
FeaturesScoreFactors:"DONE"
FeaturesScoreFactors: Merge dict-topics
FeaturesScoreFactors:: Doesn't load -
FeaturesScoreFactors: A: After deleting the Q and restart - didn't repro.
FeaturesScoreFactors: this.UserModel is undefined when called from dictTermsProcessing getUser
FeaturesScoreFactors:  contributingContactId = getContactKey(update.actor); - there is no actor on update anymore ?
FeaturesScoreFactors: Noam: Review - unit-testing - against cosmos ? I had test_graph gremlin in dict_topics branch
FeaturesScoreFactors: Test toptopics, saveUser, Admin UI work
FeaturesScoreFactors:A: Auto - didn't manually do it
FeaturesScoreFactors: storeText
FeaturesScoreFactors: MailPager - _.get(collageUserInfo,'features.storeText.enabled',false);
FeaturesScoreFactors: graph worker - insertArtifact (conditional - not always textForTermsExtraction exists)
FeaturesScoreFactors: Investigate Rank lower  isFocused = false ?
FeaturesScoreFactors: A: Removes some noise, but doesn't change the big picture
FeaturesScoreFactors: Ex: Ram's Top Topics Support Request emails (see 'Duplicates) are focused --> will not be filtered
FeaturesScoreFactors: Manuall filtering of senders (from)
FeaturesScoreFactors: Remove: sales@harmon.ie and other manual when
FeaturesScoreFactors: Structured: Yaacov Top 50
FeaturesScoreFactors: Count in Recent 2 weeks / month - updates array timestamp range filter + remove duplicates by updateId
FeaturesScoreFactors: Problem: Doesn't have enough updates with Topics in last week / 2 weeks
FeaturesScoreFactors:A: Yaacov harmon.ie user
FeaturesScoreFactors: Only count emails - not sure the weight of Topic in email vs. Topic in SharePoint document update.
FeaturesScoreFactors:"TFIDF
TFIDF: davidl: sharepoint, office365, outlook has similar high monthly counts across Mar--Aug (see PROD reports)
TFIDF: ?? Can display it as topTopic in Aug, because it was abesnt from Jul ?
TFIDF: Problem: outlook (ramt) has the same count behavior ---> high except in Jul
TFIDF: Outlook does appear in ramt Jul (not that much though, but count = 10) --> inside compount (Outlook Design, GA version of Outlook) + as a single term capital NN
TFIDF: Count for TFIDF also parent topics + Terms LM will find single capitalized NN --> enough Outlook in Jul
TFIDF: count across preiods and affinity users --> if sharepoint is usually high both at davidl and for yc, ramt --> discount its rank
TFIDF: Count the number of periods (recent is higher, but doesn't have to be conseq periods) --> it will push rspb a bit down, but not so much, because david affinity do not have it and it doesn't have many children.
TFIDF: Calc count-term-in-preiod / count of all emails in preiod.
TFIDF: Trending: Even if rspb is a top topic for several months in a row --> it still doesn't make sense to display it all the time, instead of displaying new topics.
TFIDF:Tutorials
TFIDF: Topic Extraction TFIDF+ Clusters + LDA + NNM Factorization + LDAvis
TFIDF:https://github.com/ahmedbesbes/How-to-mine-newsfeed-data-and-extract-interactive-insights-in-Python/blob/master/article_2.ipynb
TFIDF:- - - -
TFIDF: Search Engine using TFIDF on Wikipedia (py lib): http://docs.deeppavlov.ai/en/latest/components/tfidf_ranking.html
TFIDF:Email "Importance"
EmailImportance:Models email, paragraph, sentence and sender importance (implies Topic importance) using Time to reply to an email/sender + high level features of email  (speech-acts: request, commit)
EmailImportance: Builds several classifiers to detect Schedule, Pitch, Technical ...
EmailImportance: Ex: If Business Emp ignores several Topics that occur in Technical Context --> reduce rank for other Technical Context Topics. If Clicks Technical Terms --> inc Rank
EmailImportance: Problem: Cold Start: New User --> No Click/Ignore, Yet
EmailImportance: Affinity + Personas: Classifier to classify (based on past emails in first scan) if Business, Technical, Legal, Finance, Services --> Default for Business that
EmailImportance:Technical is less important.
EmailImportance: Problem: Non-important Terms in Business / Important context
EmailImportance: Ex: Legal Team - Ron thinks semi-general
EmailImportance: Consider filters (such as semi-general) - also inside Important context
EmailImportance: Merge with email 'prioritization
EmailImportance:Factors
EmailImportance: Add Features from:
EmailImportance: Characterizing and Predicting Enterprise Email Reply Behavior   https://www.microsoft.com/en-us/research/wp-content/uploads/2017/04/sigir17a.pdf
EmailImportance: Combining textual and non-textual features for e-mail importance estimation https://repository.ubn.ru.nl//bitstream/handle/2066/122950/122950.pdf
EmailImportance: Mining Social Networks for Personalized Email Prioritization http://nyc.lti.cs.cmu.edu/yiming/Publications/syoo-kdd09.pdf
EmailImportance: Complex-Important Emails
EmailImportance: Intuition: If user invested a lot in composing the email (and its Attached documents, Contained SP Doc Links) --> Good feature for Importance
EmailImportance: Size of email (+Attachments) and its complexity (longer takes more to digest and reply)
EmailImportance: Affinity Features
EmailImportance: Num of recipients
EmailImportance: Reply Times (per-user, per-time window)
EmailImportance: Problem: Easy-to-Reply-Emails-First:
EmailImportance: User may reply first to emails which are easier to reply to (do not require thinking/research)
EmailImportance: These may not be the important emails (that sometimes require research and long reply)
EmailImportance: Ex: Yael may reply early in the morning from home / smartphone some Easy-to-Reply-Emails. When she gets to the office, she starts composing responses for the heavy ones.
EmailImportance: There is another feature to cover those Complex-Important Emails : If Conversation contains at least one a Long / with Attachments fromMe
EmailImportance: fromMe (and not in RE: <subject>): The report-user wants to schedule a meeting or Convince/Pitch/Sell --> makes it important to the report-user.
EmailImportance: Negative-Improtance Acts / Context:
EmailImportance: Technical Context is less important for Business Staff (Support emails ...):
EmailImportance: Log info pasted into an email
EmailImportance: Describing Technical Workflow (Sign In --> fill User and Password fields)
EmailImportance: See Telephone Switch board Dialog Acts dataset in http://www.aclweb.org/anthology/W04-2319
EmailImportance: HCRC Map Task http://groups.inf.ed.ac.uk/maptask/
EmailImportance: Article with Code using it: http://people.cs.uchicago.edu/~dinoj/da/
EmailImportance: Extractive Summarization and Dialogue Act Modeling on Email Threads
EmailImportance: https://www.superlectures.com/sigdial2014/downloadFile?id=11&type=slides&filename=extractive-summarization-and-dialogue-act-modeling-on-email-threads-an-integrated-probabilistic-approach
EmailImportance: Dialogue Act Recognition in Synchronous and Asynchronous Conversations  https://www.cs.ubc.ca/~carenini/PAPERS/sigdial-13-CR-GC-comments.pdf
EmailImportance: BC3 (Email)
EmailImportance: Manual Admin: Can determine a boost for specific classifier or build a custom one
EmailImportance:Algorithms
EmailImportance:- - - - - - - - - - - - -
EmailImportance:: See 'NLP Deep Learning + 'Document Similarity for Embeddings + Keras with Universal Sentence Encoder
EmailImportance:Assumption: The shorter the reply time to an email/sender/topic --> it is more important
EmailImportance: Detecting Action Items in Emails http://www.cs.cmu.edu/~pbennett/papers/action-item-detection-poster.pdf
EmailImportance: Question Classification for Email - http://www.aclweb.org/anthology/W11-0137
EmailImportance: Modeling Intention in Email - http://lintool.github.io/UMD-courses/bigdata-2008-Spring/speakers/2008-04-16-slides.pdf
EmailImportance: Read: Characterizing and Predicting Enterprise Email Reply Behavior   https://www.microsoft.com/en-us/research/wp-content/uploads/2017/04/sigir17a.pdf
EmailImportance:Problem: Reply time is not normalized across time in the work-week: Cannot compare reply time to night sent emails (night of this specific person in his timezone), to
EmailImportance:a reply time in Mon morning ...
EmailImportance: Normalize: Always compare per-user, per-time window (Ex: rank which emails get replied first, starting from first reply in the morning)
EmailImportance:Problem: Info/Task Request emails (speech-act) or longer emails contains important Topics, but takes more time to respond (because the user is doing some task or research in* order to respond)
EmailImportance:"Schedule meeting dialog act classifier
EmailImportance: Try Text Classfiers (NB,RF,Keras with USE)
EmailImportance:A) +- 70 chars ?
EmailImportance: Problem: while schedule and meeting are  high precision, the others-syns inc coverage but not precision (session - may be a conference session)
TrainDatasetforDialogActs:- - - - - - - - - - - - - - - -
TrainDatasetforDialogActs:Looking forward to your ‘first look.’
TrainDatasetforDialogActs:Schedule
TrainDatasetforDialogActs:- - - - - - - - - - - - - - - -
TrainDatasetforDialogActs:I would be happy to set up a call to discuss technical requirements for the Early Access program. Let me know what works
TrainDatasetforDialogActs:we can schedule a brief call for early next week
TrainDatasetforDialogActs:Does this work?
TrainDatasetforDialogActs:Can we find a 15-minute window next week for a brief call?
TrainDatasetforDialogActs:What does your calendar look like to talk ?
TrainDatasetforDialogActs:If it makes sense to talk
TrainDatasetforDialogActs:Sale / Pitch (child of Positive relationship in hrr?)
TrainDatasetforDialogActs:- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TrainDatasetforDialogActs:This is a great opportunity
TrainDatasetforDialogActs:it is really a unique opportunity
TrainDatasetforDialogActs:I would be happy to proceed on this track
TrainDatasetforDialogActs:would like to progress
TrainDatasetforDialogActs:I would be happy to prepare a demo
TrainDatasetforDialogActs:To prepare the demo, I will need
TrainDatasetforDialogActs:we can build a short demo that will demonstrate
TrainDatasetforDialogActs:showcase
TrainDatasetforDialogActs:We help to improve all of those areas and that is why I am reaching out to you.
TrainDatasetforDialogActs:We can discuss your goals and challenges and share any value and insight that we have to offer?
TrainDatasetforDialogActs:Have you given thought to my proposal ?
TrainDatasetforDialogActs:I would be happy to give a quick review on the phone
TrainDatasetforDialogActs:I would be happy to anwer any questions
TrainDatasetforDialogActs:I didn't hear from anyone
TrainDatasetforDialogActs:I look forward to hearing from you
TrainDatasetforDialogActs:Following our call
TrainDatasetforDialogActs:Who is the appropirate person to talk to ?
TrainDatasetforDialogActs:Can you make the introduction ?
TrainDatasetforDialogActs:Join today
TrainDatasetforDialogActs:Commitment
TrainDatasetforDialogActs:- - - - - - -
TrainDatasetforDialogActs:done
TrainDatasetforDialogActs:Request
TrainDatasetforDialogActs:- - - - -
TrainDatasetforDialogActs:Technical
TrainDatasetforDialogActs:- - - - - -
TrainDatasetforDialogActs: Random select sentences around common Topics (count > 10) in our support emails
TrainDatasetforDialogActs:"Search Query Logs Implicit Feedback"
SearchQueryLogsImplicitFeedback:: How to consider the collected search terms ?
SearchQueryLogsImplicitFeedback: Problem: Search logs are full of non-sense (non topics and even non words - lots of spelling mistakes and so on)
SearchQueryLogsImplicitFeedback: Only if the search term matches a topic extracted independently (ex: from Email) --> it will strength it.
SearchQueryLogsImplicitFeedback: Ex: MyLastSearch uses a closed list of Google/Bing ...
SearchQueryLogsImplicitFeedback: Problem: It doesn't include Google Scholar or Wikipedia Search
SearchQueryLogsImplicitFeedback:Enhance:Connect to SharePoint query logging (export) using PowerShell
SearchQueryLogsImplicitFeedback:Enhance:Outlook: Use harmon.ie Desktop to read searchbox --> analyze the number of hits + spellcheck of the search term to determine quality
SearchQueryLogsImplicitFeedback: Group auto-suggest partial queries that creates several entries (every character typed creates a request)
SearchQueryLogsImplicitFeedback: u --> un --> uns --> unsu --> unsup --> unsupe --> unsuper --> unsuperv --> ... --> unsupervised training
SearchQueryLogsImplicitFeedback: Created mainly from Chrome bar searches
SearchQueryLogsImplicitFeedback: Problem: Doesn't appear as another query in history.
SearchQueryLogsImplicitFeedback: Maybe can submit the same query to Google again to see if it corrects it.
SearchQueryLogsImplicitFeedback: Problem: Privacy: We do not want to send all Web Searches to Collage server for Topic processing.
SearchQueryLogsImplicitFeedback: Collect candidates at server (as today) --> send it to harmon.ie desktop client in batches to confirm if / how many times appeared in WebSearches and When.
SearchQueryLogsImplicitFeedback:Person Topics "Invalidation
PersonTopicsInvalidation:Problem: Person names intersects with Company/Song/Software names --> If we invalidate every Person name in the world --> will invalidate Company and other Topics
PersonTopicsInvalidation: Ex: Stan is also a Software package name
PersonTopicsInvalidation: Ex: Jhon is also a Song name
PersonTopicsInvalidation: Classifier to detect if the Topic is used as a Person or not
PersonTopicsInvalidation:Every Topic containing a first or last name --> will be invalid (unless coming from SP/CRM)
PersonTopicsInvalidation: Collage doesn't want to have person names as Topics (although it allows searching by Email names)
PersonTopicsInvalidation: All Email From: field.DisplayName are split by white space (From:'IBM Marketing' --> 'IBM','Marketing')
PersonTopicsInvalidation: Since CRM-relatedAccounts/SP-structured has 'IBM' --> IBM.isTopic = true --> it appears in Search auto-complete
PersonTopicsInvalidation: Topic['IBM Partnerworld'].invalid = true. Becuase it contains 'IBM' which is invalid (even though it is a isTopic=true by itself)
PersonTopicsInvalidation: Note: This email doesn't have a topic 'IBM' associated --> only 'IBM Partnerworld' (Subsumption?)
PersonTopicsInvalidation:Enhance: if IBM has relatedAccount/structured --> do not invalidate it (invalid=true)
PersonTopicsInvalidation:Enhance: Do not use From: field.DisplayName from isFocused=false. See 'isMarketingEmail below
PersonTopicsInvalidation:: Do all emails marked with isFocus=false (or lack of field means isFocus=true): Search for isFocus=true.
PersonTopicsInvalidation: isFocused=false inlcudes many of the marketing + automated emails --> good candidates to avoid using field.DisplayName (it doesn't contribute Topics anyway).
PersonTopicsInvalidation: Problem: Still Notification Emails (From: Connections Cloud, Watson Workspace) --> are in Focused Inbox, since they are important for user.
PersonTopicsInvalidation:  detection of marketing / automated notification emails (even if isFocus=true)
PersonTopicsInvalidation: Problem: Alex Kitton/Raleigh/IBM - not a marketing email but some of the From: contains company 'IBM'
PersonTopicsInvalidation:Enhance:
PersonTopicsInvalidation: Split the From: to 3 words Alex Kitton, Raleigh, IBM
PersonTopicsInvalidation: If some of the words that matches parts of email address domain @us.ibm.com --> ignore this From:
PersonTopicsInvalidation: Can be also a feature in isMarketingEmail
PersonTopicsInvalidation: Alt:Solution Detect Person name using Stanford NER + Lists of Person Names
PersonTopicsInvalidation: Different names in languages: Stanford NER doesn't recognize Guy and Ram as Person.
PersonTopicsInvalidation: Alt: Use Company Exchange AddressBook
PersonTopicsInvalidation: Doesn't recognize names out side org (Kirti from IBM). They are not listed as personal Contacts either.
PersonTopicsInvalidation:"isMarketingEmail
isMarketingEmail: isAutomated by cluster subjects - if from > 100 and only few unique subjects
isMarketingEmail: Process:
isMarketingEmail:node emailStats.js --modelName MailUpdate --userDataDbURL mongodb://localhost:27099/collage
isMarketingEmail: Algorithms decide if isPerson=true/false and isAutomated=true/false (2 different purposes: NER and isMarketingEmail)
isMarketingEmail:cd per
isMarketingEmail:node cleanContacts.js
isMarketingEmail: Inspect Results and Annotate for testing
isMarketingEmail: Create mycontactstemp table with joined updates - so can easily inspect if isAutomated / isPerson are correct
isMarketingEmail:node myContacts_expected.js --createTemp
isMarketingEmail: Update myContacts_expected.json with expectedIsAutomated new annotations
isMarketingEmail:node myContacts_expected.js --updateExpected
isMarketingEmail:npm test
isMarketingEmail: TOOO: Run with weak filters
isMarketingEmail: subject : /Get Started with your|Top Topics|Toptopics|structured only|Industry News/i
isMarketingEmail: from.email: /notifications+|noreply|no-reply|invitations@|marketo@|report@meltwater.com/i
isMarketingEmail:Impl
isMarketingEmail:isMarketingEmail and Top Topics emails filters
isMarketingEmail:- - - - - - - - - - - - - - - - - - - - - - - -
isMarketingEmail:: Remaining diffs:
isMarketingEmail: Ram
isMarketingEmail: office365 (rank 12-->15), adfs 0 --> 4, office365sharepoint 0 --> 4, microsoftsharepoint 0 --> 3, internetexplorer 0 --> 3
isMarketingEmail: Not sent specifically to ram (usually sent to support@harmon.ie)	 --> reduce rank if report user is not in cc/to/from
isMarketingEmail: 'Get Started with your harmon.ie Evaluation'
isMarketingEmail: excel (rank 22-->26) - correct: richard and ram are manually discussing Excel addin
isMarketingEmail: 'tim@famapr.com Re: harmon.ie Industry News - April 20'
isMarketingEmail: Rank Implications: industrynews 0 --> 25 (top 5), facebook 11-->14, twitter 13-->14
isMarketingEmail: industrynews (+ all this email topics): isMarketingEmail per email --> relies to html and links.
isMarketingEmail: industrynews --> dup subject NOT from the same thread --> repeating digest emails.
isMarketingEmail: Solution (Partial): Since its emails contain many Terms (buzzwords) --> reduce per Term rank.
isMarketingEmail::Bug: Fix missing 'From' in 2100 emails --> better .automated
isMarketingEmail: See 'Updates Fetching'
isMarketingEmail:: isPersonTok: Investigate using lemma of dispNameTokens - in addition to the whole token
isMarketingEmail: Plural: purchases@some.com --> lemma is purchase
isMarketingEmail::Bug: Filter nonPerson substr matches - voicemail@harmon.ie
isMarketingEmail: Problem: voicemail@harmon.ie is automated (today isAutomated : false)
isMarketingEmail: nonPerson substr matches
isMarketingEmail: Problem: brandon matches 'brand' prefix - a common non-person word
isMarketingEmail: Q: Is this a real problem ?
isMarketingEmail: Very few contacts will hit this prefix/suffix issue --> get
isMarketingEmail: Intersect words from freq_words_no_person.json  with prefixes/suffixes clean person list (python)
isMarketingEmail:: Investigate if DisplayName tokens and userMail@ tokens do not high tri-gram overlap  --> inidication isAutomated = true ?
isMarketingEmail: Seems to be a good score += 1 feature. There are many isAutomated = true that this feature doesn't apply (ex: Microsoft Azure <azure-noreply@microsoft.com>), but still.
isMarketingEmail: Ex: Rob McGlanaghy - LogMeIn <emeawebinars@m.logmein.com>
isMarketingEmail: Ex: PSPDFKit <sales@pspdfkit.com>
isMarketingEmail: Ex: Maya Bickson <bicksonit@gmail.com> - names overlap
isMarketingEmail: Do not apply feature to those.
isMarketingEmail: Ex: See below davidl -tim@famapr.com, ton.dobbe@valueinspiration.com, Linda.Kedem@gartner.com (Podcaster that also sent a single marketing email to david)
isMarketingEmail: Features:
isMarketingEmail: Outside Org
isMarketingEmail: fromOnlyTo
isMarketingEmail: 'Marketing Urls' - see below
isMarketingEmail: Integrate with report
isMarketingEmail:Bug: microsoft + office365 + outlook count --> inc high --> remove support@ to see if it is the cause (or maybe these are the subject regex ??)
isMarketingEmail: support@harmon.ie and sales@harmon.ie: isAutomated = false --> now true
isMarketingEmail:Before Finish:
isMarketingEmail::Combine myContacts stats for oldUpdates and mailUpdates
isMarketingEmail: cleanContacts: if update is old only (not in mailUpdates) --> do not mark it with isAutomated at all (but do mark it with isPerson)
isMarketingEmail: Problem: Duplicate
isMarketingEmail: Not an issue. Count twice to and cc -->
isMarketingEmail: Q: How come Azure Team <azureteam@e-mail.microsoft.com> has from: 18 in emailStats (myContacts built from mailUpdates),
isMarketingEmail:but cannot find mailUpdate with 'from.mail' :/azureteam@e-mail.microsoft.com/ ?
isMarketingEmail: myContacts_trainset.json expectedIsAutomated dataset: Make it easier to add new and update existing labels in
isMarketingEmail: Create skip option only for isPerson or only for isAutomated - the skip now ignores the test even if it can be used for isAutomated = false.
isMarketingEmail:: Bugs:
isMarketingEmail:: Curran, Cheryl (UK) --> real person incorrectly detected as 'Registered'
isMarketingEmail: regex only 'Register' whole word
isMarketingEmail:: lowStat problem: total < 5 and other features indicate isAutomated = true (html that we do not have, number of Terms, fromToOnly = total) -->
isMarketingEmail:we currently do not take them into account unless total >= 5.
isMarketingEmail: Assume lowStats senders will not affect much Top Topics
isMarketingEmail: Change MIN_STATS_AUTOMATED from 5 to 2
isMarketingEmail: Use only strong features in case total < 5 (fromToOnly == total && unsubscribe - not subscribe, html, very large number of Terms, Register Today|Register Now ...)
isMarketingEmail: Analyze and 'skip' failing tests.
isMarketingEmail: Company name (domainTok) in DisplayName + lowStat --> isAutomated (domainTok in )
isMarketingEmail: Ex: Jim Puckett/Austin/IBM <jim_puckett@us.ibm.com> total: 4 (below MIN_STATS_AUTOMATED = 5), from: 0, to: 1, cc: 3
isMarketingEmail: If total < 5 and to > 0 or cc > 0 --> not automated
isMarketingEmail::Bugs :
isMarketingEmail: Langs: DisplayName is language specific (ע.ב. מסחר בעץ) while email address is english (kobi@ob-wood.com) --> cannot disqualify with domainTok
isMarketingEmail: Filter/Rank: true marketing email that includes many terms that we always want to filter out
isMarketingEmail: Announcement (of a new product/service)
isMarketingEmail: Events: Conference marketing, Webinars
isMarketingEmail: News Digest / NewsLetters
isMarketingEmail: machine-generated
isMarketingEmail: Notifications combine machine generated html with real manual text
isMarketingEmail: Redmine: We don't want to filter out completely because they add info from other systems (ex: Redmine), but we need to filter out the constant (form - dup) parts.
isMarketingEmail: External PO: May be sent to Sean from Aon from their automated purchase order system (which is not our SF).
isMarketingEmail: manual - that sometimes sends machine-generated
isMarketingEmail: Ex: stephen.reilly@fiftyfiveandfive.com <Stephen Reilly> - sends both marketing and manual-digest features
isMarketingEmail: Note: May not be an issue, because the same user doesn't get so many of those --> will not affect Top Topics.
isMarketingEmail: Problem: How to differentiate between marketing and machine-generated (plan.io) emails ?
isMarketingEmail: sender isFocused ratio ?  Intereseting notifications are always isFocused = true.
isMarketingEmail: We know some marketing emails are isFocused=true, but maybe at other inboxes the same marketing sender is isFocused=false ?
isMarketingEmail::Problem: New marketing emails --> no stable to:0 and fromToOnly emailStats (say 1-3)
isMarketingEmail: Use very small stats (1-3) + displayName / emailAddress featurs (domainToks) + subject/urls/body features.
isMarketingEmail: Mark suspected actors by their address or Name - several actor-features
isMarketingEmail: How many name words matches parts of email address domain
isMarketingEmail: Substring domain match: Count total num of chars matched (in any nameTok to any domainTok) + number of nameToks matched
isMarketingEmail: Application Insights <ai-noreply@applicationinsights.io> - 2 toks --> domainTok
isMarketingEmail: Expedia Member Pricing <usmail@expediamail.com>  - tok + [e]mail --> domainTok
isMarketingEmail: IBM Marketing <ibmmarketing@emm.ibmmail.com>  - tok + [e]mail --> domainTok
isMarketingEmail: EL AL Matmid Club <elal@elalinfo.co.il>		- single startsWith
isMarketingEmail: Learning Technologies <mail.ctvdsaxaozcwiaogf@learningandskills.closerstillmedia.com>  - single startsWith
isMarketingEmail: Apple Music <new@applemusic.com> 				- 2 toks --> domainTok
isMarketingEmail: Yael Shapira <yael@studioyael.co.il>, Ilana Barda <ilana@bardabusinessenglish.com>
isMarketingEmail: Freelancer has its name in the company name (at least as a substring), but is a real email.
isMarketingEmail: marketing:
isMarketingEmail: marketingwb@tasktop.com
isMarketingEmail: marketing@cukierman.co.il
isMarketingEmail: digital_marketing@eb.emediausa.com
isMarketingEmail: substr
isMarketingEmail: mail, email,or e-mail as a word in domain (split each domain word only on hyphen - )
isMarketingEmail: Keep it as a separate feature (because it may rarely false positive - by its own)
isMarketingEmail: Exclude @.*.mail.<co.il|.domain> (correctly excludes mail.com and mail.mil) --> mail as a whole token in domain part (correctly misses gmail, and hits e-mail after split)
isMarketingEmail: flow@email2.microsoft.com - email<number> as a single token.
isMarketingEmail: reply@e-mail.microsoft.com
isMarketingEmail: no-reply@yammer-email.com - split on hyphen
isMarketingEmail: david@mail.mil
isMarketingEmail: someone@mail.com --> mail.com is a free webmail service.
isMarketingEmail: Count of Words at From (> 3 is usually marketing)
isMarketingEmail: Olliver, Valla Vonga III CIV DISA BD --> a person in Gov DISA org.
isMarketingEmail: Outside Org -
isMarketingEmail: Didn't use it for now (see emailStats.js and cleanContacts.js) --> cause there are many internal automated emails.
isMarketingEmail: Still we may need it for individual isMarketingEmail (autoEmail)
isMarketingEmail: isPerson: false factors (but not as aggressive) to == 0 + fromToOnly > 0.95
isMarketingEmail: isPerson = true --> probably not a marketing email.
isMarketingEmail: isFocused ratio
isMarketingEmail: Urls: Captioned: Avg length and
isMarketingEmail: Common Marketing words in subject
isMarketingEmail: Impl: same as in per.js for displayName
isMarketingEmail: Body features
isMarketingEmail: Motivation:
isMarketingEmail: Html
isMarketingEmail: Massive non-MSWord Html is a very strong feature.
isMarketingEmail: Problems:
isMarketingEmail: Not all automated are html
isMarketingEmail: Ex: builder, Your Planio invoice no. XXX, Voice Mail, AUTOABS
isMarketingEmail: Regular (manual) messages are by may be Text or Html (MSWord format)
isMarketingEmail: Problem:Impl: Cannot use oldUpdates for that (body is trimmed) -->
isMarketingEmail: Problem: Results in a duplication of Sep-2017 - Mar-2018 - exist both in oldUpdate and mailUpdate
isMarketingEmail: body Tokens - ML/DL
isMarketingEmail: Average body length
isMarketingEmail: Impl: Do we need Robust Avg. ? - keep totalLength, 2 max, 2 min --> (totalLength - 2 max - 2 min) / (total - 4)
isMarketingEmail: Count of Terms / Capitalized sequences (very high in a marketing email)
isMarketingEmail: We already have Terms count but we do not currently use it (for no reason)
isMarketingEmail: For each contact Count:
isMarketingEmail: sent, sent with no CC, recv (to),cc, replies, forwards,
isMarketingEmail: isFocused ratio - real PER should be a lot more in isFocused = true than false.
isMarketingEmail: Negative: Find 'real' person who are certainly NOT marketing email (see also PER)
isMarketingEmail: Problem: Sometimes, a fake marketing entity (Ashly) has a real Office 365 account that includes (by default) SP.
isMarketingEmail: Use it as a strong factor, but not filter
isMarketingEmail: SharePoint actor: Start with mail actors that their user do not have another actor with 'sharepoint' (because automated systems and aliases and marketing do not have sharepoint actor)
isMarketingEmail: Fetch all users and create users/actors
isMarketingEmail: Problem: What if the ORG doesn't have sharepoint ?
isMarketingEmail: We can identify it and remove the filter in this case.
isMarketingEmail: president@h-spug.org - why not detected ?
isMarketingEmail: A: Empty object.recipients.to (and .cc) --> current user was not in to --> fell out
isMarketingEmail:Topics are not displayed for Marketing Emails (nor they contribute topics). NLP is not performed (to save resources)
isMarketingEmail:Enhance: Fix isMarketingEmail --> require isFocused = false --> fixes problem of 'mail Send by Andrei from ibm to me and I didn't reply to' below.
isMarketingEmail: Problem: Try to find email with displayName: IBM which is in Focused Inbox --> Do Emails with From: <Real Topic> exist in Focused Inbox ?
isMarketingEmail: A: I believe that if an automatic notification email is important to the user --> it will end up in Focused Inbox --> therefore
isMarketingEmail: We can rely on isFocus=false to filter out a lot of noise (at the price of some real person names)
isMarketingEmail: CANNOT rely on isFocus=true to ensure that the From: is not 'IBM'
isMarketingEmail: Problem: Today: isMarketingEmail is not accurate.
isMarketingEmail: Sent from outside org
isMarketingEmail: Only you appears in the To:  --> implies
isMarketingEmail: Didn't reply (today includes also Forward)
isMarketingEmail: Ex:Problem: Email Send by Andrei from ibm to me and I didn't reply to --> isMarketingEmail = true
isMarketingEmail:Enhance: Built simple detector/classifier for marketing emails
isMarketingEmail:- - - - - - - - - - - - - - -
isMarketingEmail:See https://harmonie-collage.visualstudio.com/Collage/_workitems?id=6537&
isMarketingEmail:: Consider get From: Display Name only from emails that a user initiated a sent email (not reply or forward).
isMarketingEmail: Ex: People send to support@harmon.ie, plan.io (special email address).
isMarketingEmail: Problem: OSS/Simple Automated Emails from builder (Jenkins), Redmine harmon.ie Support (support@harmon.ie), Attentix asp.webtime do not look like marketing emails
isMarketingEmail: Sometimes sent withing the org (OnPrem vs. Cloud based) : builder@harmon.ie, support@harmon.ie (that both received and sent)
isMarketingEmail: These Automated Emails are created with a template, which is highly repetitive
isMarketingEmail: Compared to Human/Person sender - that his emails varies, Automated Systems do not have more than several templates (Say different issue/notification types)
isMarketingEmail: They do contain typical header with words like 'email notifications'
isMarketingEmail:## Please do not write below this line ##
isMarketingEmail:You are registered as a cc on this help desk request and are thus receiving email notifications on all updates to the request.
isMarketingEmail:Reply to this email to add a comment to the request.
isMarketingEmail:you can also click here to open the ticket in Zendesk https://harmonie.zendesk.com/agent/tickets/36055
isMarketingEmail:customer:Pictet, submitted by :Daniel CASTRO dcastro@pictet.com
isMarketingEmail:Footer:
isMarketingEmail:This email is a service from harmon.ie. Delivered by Zendesk
isMarketingEmail: Citrix Login Information <CustomerService@citrix.com> - automated but not marketing - doesn't contain the unsbscribe and typical (because it is a post-regisration email)
isMarketingEmail:Thank you for completing the registration on the Citrix website. The Login ID you chose during the registration process is displayed below.  Please keep this in a safe place for your records.
isMarketingEmail: Still
isMarketingEmail: Plan.io
isMarketingEmail:You have received this notification because you have either subscribed to or are involved in a project on harmon.ie Planio.
isMarketingEmail:To change your notification preferences, please click here: https://connections.plan.io/my/account
isMarketingEmail:You are receiving this email either because you have subscribed to the monthly Good News newsletter, or because a friend forwarded it to you. To ensure that you'll always receive our latest news and offers directly into your inbox, please add GoodCustomer@good.com to your address book. If you can't see this email properly, please click here..
isMarketingEmail: Features:
isMarketingEmail: utm_xxx: https://www.launchdigitalmarketing.com/what-are-utm-codes/
isMarketingEmail: go. links - may redirect to utm_ links
isMarketingEmail: click.
isMarketingEmail: tracking /trackingId
isMarketingEmail: .png - many image links
isMarketingEmail: Long with with queryParams like: 1vt38rW14jNsv6XnxXCW2_yFZt7dmfdPW6-5bsW8Sn5KZN4NpR3WrVprCW56gBW-35XnXqf2nfYbH04
isMarketingEmail: Ex:  [Share on Twitter]<http://clicks.trendkite.com/track/click/30778849/twitter.com?p=eyJzIjoiS2dyWHVLVk1IU2FWNFpyemdlWk9fMVU0NFJJIiwidiI6MSwicCI6IntcInVcIjozMDc3ODg0OSxcInZcIjoxLFwidXJsXCI6XCJodHRwOlxcXC9cXFwvdHdpdHRlci5jb21cXFwvaW50ZW50XFxcL3R3ZWV0P3N0YXR1cz1UaGUgU2lsaWNvbiBqb2tlPyBMb25kb24ncyBqb3VybmV5IGZyb20gdW5pdmVyc2FsIGRlcmlzaW9uIHRvIEV1cm9wZSdzIGJpZ2dlc3QgdGVjaCBodWI6K2h0dHBzJTNBJTJGJTJGdWsuZmluYW5jZS55YWhvby5jb20lMkZuZXdzJTJGc2lsaWNvbi1qb2tlLWxvbmRvbi1hcG9zLWpvdXJuZXktMTgwMDAwMjQ1Lmh0bWxcIixcImlkXCI6XCI1YWRkNzVhMjIxOGU0YzM1OTZjNzg2YzViOWI5MDQ0Y1wiLFwidXJsX2lkc1wiOltcIjE5YTRjZDVhYTA4MDMyZjRjNDZiNWNmNDlmM2I1MjJjMzFhMGYxNDVcIl19In0>
isMarketingEmail: Sent from outside org -
isMarketingEmail: { 'from.personType.subclass' : { $ne: 'OrganizationUser' } }
isMarketingEmail: The Sender may forward you a thread that contains marketing / automated email - discard if body contains email metadata From: Sent: To: Subject ...
isMarketingEmail: Only you appears in the To:  --> implies Bcc type emails
isMarketingEmail: Number of to, replies (+ num of forwards), cc You / Others in Org to this  email-address
isMarketingEmail: Ratio: If a sender appears in From: several emails, but never appears in To: --> nonPerson. Fixes the problem of the below real person who were only mentioned in Cc few times
isMarketingEmail: support@harmon.ie (both received and sent) - but we want to filter it.
isMarketingEmail: Exchange distribution lists (see 'PER')
isMarketingEmail: Ananth.Sundararaj@microsoft.com is a real person from MS, who only was on To once and CC once --> therefore considered as non-PER ?
isMarketingEmail: Brian.Lee@microsoft.com is a real person from MS, who only was on CC 3-4 times (not even in To)
isMarketingEmail: Cees.deBoer@eneco.com (and others) are real persons, who only sent once 'harmon.ie Support Request' with connection problem - we do not want to filter him
isMarketingEmail: Stemmed Typical words (especially in header/footer)
isMarketingEmail: unsubscribe
isMarketingEmail: receiving email notification
isMarketingEmail: If you have trouble displaying this email, view it as a web page.
isMarketingEmail: Email not displaying correctly? View it in your browser
isMarketingEmail: Post-registration process email: registration login reset password ...
isMarketingEmail: Html Formatting
isMarketingEmail: Count Urls
isMarketingEmail: Avg length and content of urls (query params - utm_campaign)
isMarketingEmail: How many words matches parts of email address domain
isMarketingEmail: Neg:
isMarketingEmail: Azure Team --> azureteam@e-mail.microsoft.com
isMarketingEmail: Whois: Registrant Organization: Microsoft Corporation --> Accurate but with added 'Corporation'
isMarketingEmail: Citrix Summit 2018 <Travel@email.mktgheadquarters.com>
isMarketingEmail: Whois: mktgheadquarters.com --> NOT MKTG
isMarketingEmail: Google: mktgheadquarters (strip .com) --> MKTG - Home, MKTG | LinkedIn
isMarketingEmail: SmartCloud <no-reply@collabserv.com>
isMarketingEmail: Whois: Registrant Organization: IBM - Social Business  (includes IBM but adds noise)
isMarketingEmail: Google Search: Finds the Cloud site - but IBM is not easily found in search results
isMarketingEmail: Maureen at European SharePoint, Office 365 & Azure Community <maureenb=sharepointeurope.com@mail73.wdc01.mcdlv.net>
isMarketingEmail: Several other
isMarketingEmail: Ashley, from harmon.ie <ashleyw@harmon.ie>
isMarketingEmail: Whois: Correct, but result in a different format: descr:        Harmon.ie Corporation, descr:        Body Corporate (Ltd,PLC,Company)
isMarketingEmail: IBM Cloud Notifications cloudnotifications@us.ibm.com
isMarketingEmail: The team at LogMeIn <support@citrixonline.com>
isMarketingEmail: Visual Studio Subscriptions <vs_subscriptions@microsoft.com>
isMarketingEmail: Attenix-TS <attenix@webtime.co.il>
isMarketingEmail: Pos:
isMarketingEmail: GoToMeeting Team <webinars@mail.gotomeeting.com>
isMarketingEmail: GoToWebinar Team <feedback@mail.gotowebinar.com>
isMarketingEmail: Application Insights <ai-noreply@applicationinsights.io>
isMarketingEmail: Microsoft Teams <noreply@email.teams.microsoft.com>
isMarketingEmail: Microsoft AppSource Team <appsrc@email.appsource.microsoft.com>
isMarketingEmail: Whois: Registrant Organization: BrowserStack Limited
isMarketingEmail: Citrix Ready <citrixreadycommunications@citrix.com>
isMarketingEmail: harmon.ie on Yammer <noreply@yammer.com>
isMarketingEmail: IBM ECuRep <no_reply@ecurep.ibm.com>
isMarketingEmail: Peter from PSPDFKit <peter@pspdfkit.com>
isMarketingEmail: Atlassian <noreply@mailer.atlassian.com>
isMarketingEmail: Citrix Synergy Orlando <Travel@travel.rsys2.com>
isMarketingEmail: Whois: rsys2.com --> NOT Citrix (It wsa Oracle!)
isMarketingEmail: From: Email address words + tri-grams (email,e-mail, mail, marketing, events, no-reply, donotreply /no.{0,5}reply/ well known email marketing addresses - mktgheadquarters)
isMarketingEmail: Problem: Sometimes the From: in marketing email does contain a person name (to personalize)
isMarketingEmail: Ex: Teresa from Black Duck <Teresa@blackducksoftware.com>. It shouldn't affect our ability to classify it as marketing
isMarketingEmail: Ex: Ashley, from harmon.ie <ashleyw@harmon.ie>
isMarketingEmail: Ex: Peter from PSPDFKit <peter@pspdfkit.com>
isMarketingEmail: Ex: Maureen at European SharePoint, Office 365 & Azure Community
isMarketingEmail: Maybe the words 'from' or 'at' in the displayName is actually a good feature ? If not - we can filter out very common patterns such as '(.*?) from X'
isMarketingEmail: Count of Words at From:
isMarketingEmail: Visual Studio Subscriptions
isMarketingEmail: > 3
isMarketingEmail: Visual Studio Subscriptions
isMarketingEmail: GoToWebinar Team
isMarketingEmail: Tracy at European SharePoint, Office 365 & Azure Conference
isMarketingEmail: Across many emails from the same address - Org wide
isMarketingEmail: Zendesk Notification Emails are locally repetitive - the main content is comments to the issue. They do contain typical header though.
isMarketingEmail: Email Mapi Headers from Exchange
isMarketingEmail: unsubscribeEnabled - true  whether the message is enabled for unsubscribe. Its valueTrue if the list-Unsubscribe header conforms to rfc-2369.
isMarketingEmail: Tested in several marketing emails - was false in most (probably because it is an opt in feature) -->
isMarketingEmail: A: Didn't find yet a good header for feature. SPF is for security, and can also occur in private emails.
isMarketingEmail: Are there Databases of marketing/mailchimp like senders addresses ?
isMarketingEmail: A: Didn't find any.
isMarketingEmail: Bugs:
isMarketingEmail:1)	Bug about Collage invalidates topic because of sender’s name (we can re-open it) https://projects.harmon.ie/issues/43317
isMarketingEmail:The solution was: if we find a Marketing email, but it’s in your Focused inbox, we map the domain anyway (maybe we can add display existing topics). https://projects.harmon.ie/issues/45577
isMarketingEmail:3) Email from 'IBM Marketing' see above
isMarketingEmail:DONE - isMarketingEmail
isMarketingEmail:--------------------------
isMarketingEmail: myContacts_trainset.json expectedIsAutomated dataset: Make it easier to add new and update existing labels in
isMarketingEmail: Read trainset json to update expectedIsAutomated
isMarketingEmail: updateTrainsetFromTempCollection - processRecords of temp labels --> write json
isMarketingEmail: If new temp label expected is different than current json expected --> warn (Yes/No/YesToAll)
isMarketingEmail: Display help if no parameter (yargs)
isMarketingEmail: Support multiple passes of labeling with different queries
isMarketingEmail: display instructions: menu: 1) Person 2) Not Automated 3) Not Sure 4) Automated 5) Marketing