-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes_train.txt
3306 lines (3306 loc) · 292 KB
/
notes_train.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
HighLevelTaskRecommendUpdatesforSelectedEmail: Learning to Rank Neural (high level py lib) - http://docs.deeppavlov.ai/en/latest/components/neural_ranking.html
HighLevelTaskRecommendUpdatesforSelectedEmail: RankLib: Java impl of 8 LTR Algos: MART, RankNet, RankBoost , AdaRank , Coordinate Ascent, LambdaMART , ListNet, Random Forests
HighLevelTaskRecommendUpdatesforSelectedEmail: Elastic Search Plugin (with Tutorial) to integrate RankLib - https://medium.com/@purbon/learning-to-rank-101-5755f2797a3a
HighLevelTaskRecommendUpdatesforSelectedEmail: https://algorithmia.com Play with different algorithms for free
HighLevelTaskRecommendUpdatesforSelectedEmail:shanirevlon
HighLevelTaskRecommendUpdatesforSelectedEmail:dudu460
HighLevelTaskRecommendUpdatesforSelectedEmail:Given Email Body + Rcpts + Metadata (DateTime ...) --> return a ranked list of updates from Collage that are filtered by the extracted topics from selected email.
HighLevelTaskRecommendUpdatesforSelectedEmail:"Gremlin"
Gremlin: findByEmail - Get the collageUserId of user with a specific email
Gremlin:g.V().has('user', 'email', '[email protected]').values('userId')
Gremlin: Top Topics for Yehonathan.
Gremlin: Get vertices with label: 'actor' (not user) that also has a property 'name': 'Yehonathan Sharvit' (can be also 'email')
Gremlin: See has(label, key, value) in http://tinkerpop.apache.org/docs/current/reference/#has-step
Gremlin: filter only mail actors (actors can participate also in SP)
Gremlin:g.V()
Gremlin:.has('system', 'mail')
Gremlin:.outE('performed')
Gremlin:.inV()
Gremlin:.out('about')
Gremlin:.not(has('bad', true))
Gremlin:.as('t')
Gremlin:.groupCount().by(select('t').values('topicId'))
Gremlin:.unfold()
Gremlin:.order().by(values, decr)
Gremlin:.limit(10)
Gremlin: Does property exist ?
Gremlin:g.V().hasLabel('artifact').properties().hasKey('textForTermsExtraction')
Gremlin: within - or clause for has (or query)
Gremlin: choose - conditional ($cond, if)
Gremlin:g.V().hasLabel('person').
Gremlin:__.in(),
Gremlin:__.out()).values('name')
Gremlin: optional - g.V('vadas').optional(out('knows'))
Gremlin:returns the result of the specified traversal (if vadas has outgoing edge knows - return the target node, else return vadas)
Gremlin: "D:\Collage\gremlin-console\bin\gremlin.bat"
Gremlin: Note: Had to place full path to my java, since it is not installed properly - not related to gremlin console
Gremlin: Connect to localhost:default-port gremlin server
Gremlin::remote connect tinkerpop.server conf/remote.yaml session
Gremlin:
Gremlin: __.out('contains') vs. .out('contains') - the __.out can be used as a static func - without object context
Gremlin: Ex: .coalesce(
Gremlin:__.out('about').has('topic', 'topicId', topicId),
Gremlin:....
Gremlin: gremlin.js example
Gremlin:node graph\gremlin.js --database desktop-eih0hb2 --query "g.V().has('topicId', 'docusign').in('artAbout').not(has('isMarketing', true)).as('art').inE('performed').has('date', gte(1533070800000)).has('date', lte(1535749140000)).outV().not(has('isAutomated', true)).select('art').dedup().values('artifactId')"
Gremlin:node ..\..\graph\gremlin.js --database desktop-eih0hb2 --query "g.V().has('topicId', 'docusign').in('artAbout').not(has('isMarketing', true)).as('art').inE('performed').has('date', gte(1533070800000)).has('date', lte(1535749140000)).outV().not(has('isAutomated', true)).has('email', '[email protected]').select('art').dedup().count()"
Gremlin:"Graph"
Gremlin:--------------------------------
Gremlin: artifact is a single email message with in-edges from conversation
Gremlin: Every email thread has a view for each user: Not all users sees the same artifacts in a single conversation, so every user has its own conversation with out 'contains' edges to artifacts
Gremlin: artifact has out edges to 'topic'
Gremlin: Each artifact
Gremlin:"Email Topics" "Drilldown"
EmailTopicsDrilldown: About edge.isPerson = true for this Email --> Tal topic node.
TopTopicsSprintStories: Roadmap:
TopTopicsSprintStories: Data
TopTopicsSprintStories:: Free edition: How we get data from different orgs / users for analysis and testing ?
TopTopicsSprintStories: Free edition
TopTopicsSprintStories: Privacy collect emails.
TopTopicsSprintStories: Reflect potential customers )From another Org). Risk: Dealing with small number of harmon.ie management staff emails may not reflect many types of customers.
TopTopicsSprintStories: Testsets: Construct Expected Topics, Expected Signatures, Expected Duplicates - so we can improve our algorithms and control changes.
TopTopicsSprintStories: Research:
TopTopicsSprintStories: Stream of New Reports and Analytics to explore new research questions, insights and intuitions
TopTopicsSprintStories: Machine Learning - to tune weights of formulas.
TopTopicsSprintStories: We do not want to accumulate large backlog of many months
TopTopicsSprintStories: Product Topic Logic should be factored such that Topic experiments (no docker containers restart, debugger, no-reindex with new update-fetching when logic changes, quickly import data from json and mongo)
TopTopicsSprintStories: UI support for Topics (2 months Algo + 2 month backend + UI)
TopTopicsSprintStories: General: We have 3 Main goals in this section
TopTopicsSprintStories: Assist the user in easily and quickly workaround limitations of current Topics collection and ranking (Add / Remove from Top-Topics)
TopTopicsSprintStories: Collect User Feedback to use with other users/orgs --> make Collage more intelligent and accurate.
TopTopicsSprintStories: Group Co-occurring Topics in Algo+UI:
TopTopicsSprintStories: Ex: Ram email thread: subject:RE: [S201804020253445112] looking for these APIs: File, CompressedFile , AddinCommands for Excel. from:[email protected]
TopTopicsSprintStories: This is an important thread for ramt --> good for Top Topics
TopTopicsSprintStories: Ex: Subject:RE: RSPB as a Case Study for Harmon.ie --> both RSPB and Case Study are top10 (davidl) --> but they are not grouped together
TopTopicsSprintStories: Ex: Managed access to Microsoft Graph in Microsoft Azure Preview --> extract 'Managed access' and 'Microsoft Azure Preview'
TopTopicsSprintStories:Setup Managed access for Microsoft Graph in Microsoft Azure Preview step --> 'Setup Managed access' --> 'Managed access'.count -= 1
TopTopicsSprintStories: Ex: ADF, Group of users, PAM approval, Ravenwood CopyActivity RunId (Noams report Aug 1 - 31) ranked 2, 7, 8, 9
TopTopicsSprintStories: excel, compressedfile, addincommands, apis --> top of Ram Mar-Apr
TopTopicsSprintStories: Duplicated subject (above)
TopTopicsSprintStories: It will take 3 out of 5 places in Top Topics.
TopTopicsSprintStories: Display top topic: Excel without related compressedfile and addincommands, apis doesn't help the user
TopTopicsSprintStories: Calculate Co-Occur (as part of related topics) and Display them in a group in UI.
TopTopicsSprintStories: This give a context to Excel for the user
TopTopicsSprintStories: Group Parent (Collage) with its Top Children (Collage Demo, Collage Design)
TopTopicsSprintStories: Display Topic Context
TopTopicsSprintStories: Problem: terms-processing 'Excel authentication' --> 'Excel', because 'authentication' is not a topic --> Excel becomes a too genenral Topic that occur in 2-3 contexts
TopTopicsSprintStories: Store +-3 words around each extracted Term to display in a Hover + DrillDown.
TopTopicsSprintStories: More relevant than per-email, because user has a small number of Top Topics (+ it is hard to rank them)
TopTopicsSprintStories: Note: 'Display Topic Context' feature above - helps user decide quickly if / how bad is this topic
TopTopicsSprintStories: Only Remove from Top Topics (keep as a Topic)
TopTopicsSprintStories: Remove from-sender from TopTopics (ex: [email protected])
TopTopicsSprintStories: Display several other Topics contributed by this sender (First to Top Topics) to help decide if want to remove it.
TopTopicsSprintStories: Suggest/Auto remove certain from-senders -->
TopTopicsSprintStories: Based on other user's feedback that removed this sender.
TopTopicsSprintStories: if they contribute too many top-topics which have a high count, but penalized by our rank (ex: common words, ignored topics, duplicated topics)
TopTopicsSprintStories: Note: They are NOT isMarketingEmail ([email protected]) but we can detect certain factors (mailing-list, repeating duplicate patterns, dictionary of 'support','sales','marketing' ... in their title/email) that make them suspicious
TopTopicsSprintStories: Motivation: Since we will not be 100% in 5-10 Top-Topics rank --> but there will be more real Top-Topics in rank 10-20. Allow user to select / add new Topics to their Top Topics.
TopTopicsSprintStories: Explore / Browse the ranked Top-Topics from rank 5-20
TopTopicsSprintStories: These 5-20 are may be displayed in a 'Tag Cloud' at the Bottom (more discoveability, but also noise), or displayed only in Add-Topic UI
TopTopicsSprintStories: Search Topics (similar to Old Collage), using auto-complete + ranking --> from all extracted Topics
TopTopicsSprintStories: Add New Topic (personal Dict)
TopTopicsSprintStories: Ranking (3-4 months):
TopTopicsSprintStories: One of the short Term Goals: Filter out 'General Terms' (sharepoint, outlook)
TopTopicsSprintStories: Solutions: TFIDF, NLP-Compound, 'Terms Processing' - see 'Average specificity'
TopTopicsSprintStories: TFIDF and stat based methods (2 months)
TopTopicsSprintStories: User Feedback (1 month for algo only - there is also backend-collection and UI)
TopTopicsSprintStories:Problem: There are few filter/search by Topic. UserSelections (10, 25)
TopTopicsSprintStories: Bootstrap: Nobody uses Collage --> No feedback
TopTopicsSprintStories: Conc: Cannot rely on feedback in Collage UI to help ranking for new Orgs (at least until usage is high enough)
TopTopicsSprintStories: Conc: Still must let user Ignore/Remove from Top-Topics and log Topic Click/Search for later stats - when we can aggregate Affinity users
TopTopicsSprintStories: Ignored Topics
TopTopicsSprintStories: There are ~150 ignoredTopics per management user - could it be because of Demos ?
TopTopicsSprintStories: Add to Graph
TopTopicsSprintStories: Note: Already used in mongo with isTopic = true, but only for this user ignoredTopics - not for other Affinity users
TopTopicsSprintStories: Topic Click/Search - Topic Filter counting + timestamp
TopTopicsSprintStories: Affinity
TopTopicsSprintStories: Add to Graph
TopTopicsSprintStories: Structural - Subject + First part of email
TopTopicsSprintStories: Score inSubject (boolean) + function of closer to start + agg across all updates with this Topic
TopTopicsSprintStories: Dictionary Topics - rank higher SP / CRM / Dict (1 week)
TopTopicsSprintStories: Noise Reduction / "Cleanup (3 months so far - at least 4.5 months in total)
TopTopicsSprintStories: myContacts stats are per email address --> meaning emailStats are split between all person email addresses.
TopTopicsSprintStories: 'Urls, Files and SP Urls'
TopTopicsSprintStories: Q: Are Titles Top Topics / Drilldown Topics ? Many of the Topics are Director of/VP Marketing/Chief Economist of Wells Fargo/Chairman of Supervisory Board/Head of XXX
TopTopicsSprintStories: Investigate: Count How many times titles appear outside Signature (that we filter out anyway)
TopTopicsSprintStories: Regular expression (after removeSignature) in terms for VP/Director/Senior/...
TopTopicsSprintStories: Note: If we know how to detect Titles --> use it also in signature
TopTopicsSprintStories: PER invalidate Topics (2 weeks - advanced prototype)
TopTopicsSprintStories: Duplicate (1 week)
TopTopicsSprintStories: See 'Remove Duplicates
TopTopicsSprintStories: Dedup subject, but instead add a boost for subject topics in general + boost if replied a lot of times to emails with this subject.
TopTopicsSprintStories: isMarketingEmail (2 weeks - advanced prototype)
TopTopicsSprintStories: Run without/partial manual filter
TopTopicsSprintStories: Signatures (or non-Topics regex if difficult)
TopTopicsSprintStories: Improve (1 week)
TopTopicsSprintStories: Admin non-Topics dictionary (3 days)
TopTopicsSprintStories: harmon.ie can cleanup a new early adopter from Noise and too-General topics (if we decide )
TopTopicsSprintStories: isFocused = true
TopTopicsSprintStories: LOC ?
TopTopicsSprintStories: Terms Processing NLP - PROPN (2 months)
TopTopicsSprintStories: Compound - 'City of Brampton', 'Migration to SharePoint', 'SharePoint authentication' (1 week for fixes, 1.5 months for Terms Processing change)
TopTopicsSprintStories: Bug: Doesn't join bi-gram PROPN+NOUN (even if they are clearly a Topic using Mutual-Information measures + have other occur as PROPN+PROPN)
TopTopicsSprintStories: See ATE / ATR methods below
TopTopicsSprintStories: Currently terms-processing doesn't join those --> only X and Y (X's Y) --> sometimes extracts 'City' --> which is a non-topic
TopTopicsSprintStories: ATE / ATR methods (AutoPhrase)
TopTopicsSprintStories: Filter out non-words better
TopTopicsSprintStories: s201804020253445112
TopTopicsSprintStories: ORG > LOC: We currently filter LOC (SNER: Bradford is LOC but 'City of Brampton' is an ORG - SF). Currently Dict (uses tokens) > LOC filter
TopTopicsSprintStories: Google NER, Investigated other NER systems.
TopTopicsSprintStories: Count
TopTopicsSprintStories: For Top-Topics: Count each update if contained in conversation that has a reply from last 14 days (even if latest reply doesn't mention the topic)
TopTopicsSprintStories: For Drilldown (Top Topics already have high counts)
TopTopicsSprintStories:: Depends on Similarity
TopTopicsSprintStories: Normalization: and <--> &, ltd, co.
TopTopicsSprintStories: Relax exact matching rules:
TopTopicsSprintStories: Structured Topics Sources (SP/SF): (2 months)
TopTopicsSprintStories: Dict (or certain Dict type) --> rank higher
TopTopicsSprintStories: SalesForce (SF) - rank higher
TopTopicsSprintStories: SharePoint
TopTopicsSprintStories: /used feed
TopTopicsSprintStories: Problem: Current Follow gets Yaacov all documet updates in Sales/Accounts - even if he doesn't care about them --> much less targeted (relevant) than Email in Inbox
TopTopicsSprintStories: Admin UI: Choose Termsets to consider (RSPB birds sociaty ) on newUpdates
TopTopicsSprintStories: Admin UI: Import Termset
TopTopicsSprintStories: CRM connector
TopTopicsSprintStories: ADF Linked Service (Copy Activity) - supports SF,Dynamics, ZenDesk ?
TopTopicsSprintStories: Domain mapping - add quality topic we do not have today
TopTopicsSprintStories: P2: Top Topics: SF API - Accounts (topics) user changed recently
TopTopicsSprintStories: Could bypass stanford PROPN mistakes on company name (many?) - current impl doesn't detect non-PROPN terms
TopTopicsSprintStories: People will not write in email the full long-form of a multi-word topic.
TopTopicsSprintStories: Unify counts of 'ms office' and 'microsoft office' --> better top topics, related topics, LM counts
TopTopicsSprintStories:Q/A "Report - Top Topics
Q/AReport-TopTopics: Measure Progress at a high level (Top Topics No. 6/10)
Q/AReport-TopTopics: Next: CI
Q/AReport-TopTopics: Diffs, Tracebility
Q/AReport-TopTopics: Next: Distribute Reports to Users to collect feedback on expected: top, good and bad topics
Q/AReport-TopTopics:Requirements
Q/AReport-TopTopics:- - - - - - -
Q/AReport-TopTopics: Several users (select)
Q/AReport-TopTopics: Several Date-Ranges (select)
Q/AReport-TopTopics: Summary:
Q/AReport-TopTopics: Preserve analysis comments between reports (copy from prev report)
Q/AReport-TopTopics: Drilldown from to details
Q/AReport-TopTopics: Today: Search in editor
Q/AReport-TopTopics: Details include individual algorithms output (ex: duplidate pair)
Q/AReport-TopTopics: Do not rely on console.log at the middle of algorithms, only collection of diag json (duplicate pair) at report.js
Q/AReport-TopTopics: Keep reports history
Q/AReport-TopTopics: Where? git or mongo ?
Q/AReport-TopTopics: Diff of Ranking, factors, individual algorithms output (ex: duplicate changes)
Q/AReport-TopTopics: Summary Scores for Ranking (Average Ranking Measures)
Q/AReport-TopTopics: Expected Good vs. Bad Topic --> Per Org
Q/AReport-TopTopics: Trace changes back to algo code.
Q/AReport-TopTopics: Problem: Dev run report (for regression) on uncommitted changes --> no commit marker
Q/AReport-TopTopics:Reports Impl
Q/AReport-TopTopics: Save Algo eports to calc paths (user+date-range)
Q/AReport-TopTopics: console.log dup,sig --> replace with output collectors
Q/AReport-TopTopics: updates.json (stuctured + readable possible)
Q/AReport-TopTopics: Problem: [mongo-storage] - how to capture the query + connecting console.log (start of report)
Q/AReport-TopTopics: Output Summary.txt + all other files
Q/AReport-TopTopics: Serialize line-simple-json (sort by key name - stable diff)
Q/AReport-TopTopics: Per Topic: We already have that in arrTopTopics (except comment): count, rank, factors, comment, jArt - updates array
Q/AReport-TopTopics: Validation: updates.json must include every updateId occuring in report
Q/AReport-TopTopics: expected: top/good/bad --> Add to summary
Q/AReport-TopTopics: report --commit
Q/AReport-TopTopics: Note: Used from CI (even if there are no changes --> to mark that a certain report status is linked to latest build commit)
Q/AReport-TopTopics:: Deserialize Summary.txt - getFactorsFromText: lm:badTopic --> factors.lm.score < 0 ?
Q/AReport-TopTopics:- - - - - - - -
Q/AReport-TopTopics: Bug: Summary report (or topics ranking) doesn't use stable sort, so a group of topics with same rank
Q/AReport-TopTopics: Ex: (davidl, vegas rank : 11) --> change pos without changing rank --> moved : 15 but no diff in report.
Q/AReport-TopTopics: Implement 'Diff Flow' algo to not display moved for topics that were pushed down by a single topic that is now ranked 1st.
Q/AReport-TopTopics: Added moved to metaData
Q/AReport-TopTopics: Expected Backend:
Q/AReport-TopTopics: Central file/db for Expected:
Q/AReport-TopTopics: Problem: If a topic is deleted --> its Expected is also deleted
Q/AReport-TopTopics: Note: When a new report is added --> It is good to have Expected defaults, based on central db.
Q/AReport-TopTopics: Expected UI: 6/10 in Header (Summary + Summary Diff)
Q/AReport-TopTopics: Green/Yellow/Red small markers for Expected that are not in their proper rank
Q/AReport-TopTopics: badTopic in rank 2 --> red
Q/AReport-TopTopics: topTopic rank > 10 --> yellow
Q/AReport-TopTopics: goodTopic (but not topTopic) rank <= 10 --> yellow
Q/AReport-TopTopics: PreProcessing Algo output
Q/AReport-TopTopics: Problem: terms-processing code changed (ex: compound 'City of Brampton')--> 'City' deleted and 'City of Brampton' added in yc top report
Q/AReport-TopTopics:Q: How to traceback to terms processing code change ?
Q/AReport-TopTopics: When running new terms-processing --> generate the newDifferentTerms report --> commit it with the code changes
Q/AReport-TopTopics: UI: Each Topic is a link + Each factor is a link
Q/AReport-TopTopics: Q: Drilldown to Facrors algorithms output: How to locate duplicates output of a specific topic ?
Q/AReport-TopTopics: Same for sigs
Q/AReport-TopTopics: Details:
Q/AReport-TopTopics: Details for Deleted --> if topic.deleted --> use old SummaryReportId in the call to /detailed
Q/AReport-TopTopics:Highlight finish:
Q/AReport-TopTopics: Bug: Text Search for a string enron, the Terms, plus a lot of emails @enron ...
Q/AReport-TopTopics: Details of Topic + Children (checkbox) - if checked fetches all updates of topic (linkedin) + all its children.
Q/AReport-TopTopics: Highlight should work
Q/AReport-TopTopics: Add From: and optionally more metadata. (click to expand)
Q/AReport-TopTopics:: Bug: Need back twice to return from Details to Summary
Q/AReport-TopTopics: Q: How to render factors ? Simple text for now, but ...
Q/AReport-TopTopics: Q: How outlook.count is 19 but it has 15 detailed artifacts ?
Q/AReport-TopTopics:A: Because sig/per/automated prevent topic.artifacts.push
Q/AReport-TopTopics: Attribution/Tracebility of Summary/SummaryDiff to Code changes
Q/AReport-TopTopics: Q: How to insert a comment into Summary that describes the changes ?
Q/AReport-TopTopics: Problem: Changes may be spread across many commits in many repos (sig)
Q/AReport-TopTopics: Cont 1: Problem: Few other fixes that are considered minor --> meaning no new /report:/ comment --> Summary report changes (regression) but still with *same* comment
Q/AReport-TopTopics: Displays same report comment + number of commit since --> if hover comment --> ToolTip with all commit messages from last /report:/ - including.
Q/AReport-TopTopics: Summary Diff:
Q/AReport-TopTopics: Diff of the 2 commits of reports (Possibly with integration to Git webviewer)
Q/AReport-TopTopics: react-diff-view - renders WebUI for git diff output: https://www.npmjs.com/package/react-diff-view
Q/AReport-TopTopics: Datetime generated, date-range, org + user, change comment (ex: fix sig detector + sha1)
Q/AReport-TopTopics: Aggregate all attrs in a totals line (diff on sum of all counts, sum of dups)
Q/AReport-TopTopics: Only displays the number of deleted/added (at the top) and not which one was added/deleted - which is important.
Q/AReport-TopTopics: Which diff is it (3 last path components ramt/Apr-18../Summary.txt or commit comments+hash )
Q/AReport-TopTopics: Return in json and display
Q/AReport-TopTopics: Summary View (not Diff): Provide Title in presentation .json
Q/AReport-TopTopics: Stats: At the Top Summary: 2 moved, 3 added, 1 deleted
Q/AReport-TopTopics: metaData : { topicId : { ownCol : true }, rank: .. }
Q/AReport-TopTopics: Diff selection UI:
Q/AReport-TopTopics: Reports menu: Select a user+dateRange --> Display git log
Q/AReport-TopTopics: Mark important commits - by comments (report: or major:)
Q/AReport-TopTopics: Future: Allow Drop only 1 and select the other from reports menu (D&D from reports list ?)
Q/AReport-TopTopics: Dashboard: Aggregate stats from all / group of users
Q/AReport-TopTopics: History Perfromance Chart
Q/AReport-TopTopics: Q: How to manage the data collected from all Summary stats + expected K/10 ?
Q/AReport-TopTopics: A: Collect it to a history.json file with sha1 + timestamp + all summary stats / expected collected + copy from expected.json the topics and their values
Q/AReport-TopTopics: Problem: Mongo vs. Git: If we saved all summaries in mongo (Key: user+dateRange+sha1 of commit), wouldn't it be a simple query (avoiding duplication in history.json) ?
Q/AReport-TopTopics: Q: What if expected.json is changed ?
Q/AReport-TopTopics: Ex: Add expected topTopic to a term prev had no expected.
Q/AReport-TopTopics: Delete current history.json, replacing it (mutable) with new stats or keep old data for reference (to track changes) ?
Q/AReport-TopTopics: edit expected.json --> commit it --> new sha1 --> new history.json version with special comment (rebuild following expected changes) -->
Q/AReport-TopTopics: Q: What if base data is changed ? Adding emails, change dateRange, add/remove users (Ex: enron)
Q/AReport-TopTopics: Note: Bizportal stock sites do not show history charts beyond a major change (Investment Policy changes ...)
Q/AReport-TopTopics: Multiple Aggregations (queries in mongo stats)--> Multiple History Charts
Q/AReport-TopTopics: First: Name the dataset (harmonie management) and chart it
Q/AReport-TopTopics: When adding enron: Start a new config (Name: harmonie + enron) and start charting it in a new graph.
Q/AReport-TopTopics: Chart comments - If something is changed and we need to continue with this Chart --> Allow rendering comments (maybe from report: commit comments)
Q/AReport-TopTopics: UI:
Q/AReport-TopTopics: Checkbox for each Users + Groups + All Selection (server return data from config.users + userGroups)
Q/AReport-TopTopics: Checkbox for each dateRange
Q/AReport-TopTopics: Impl:
Q/AReport-TopTopics: Save stats at the bottom of each Summary (below sep -----, alog with topicMetaData)
Q/AReport-TopTopics: ReportsStorage.findStats(user/grp/all, dateRange) - iterates all users and dateRanges (may include several ranges) summary
Q/AReport-TopTopics: Study WebUI
Q/AReport-TopTopics: Table Component with Cell Editing, Filtering, Customizations ...
Q/AReport-TopTopics: Example React+Express App rendering tables and search interface
Q/AReport-TopTopics:https://github.com/fullstackreact/food-lookup-demo/blob/master/client/src/FoodSearch.js
Q/AReport-TopTopics: ReactRouter with back button + params: https://reacttraining.com/react-router/web/example/url-params
Q/AReport-TopTopics: Q: Share Topic Attributes Metadata to allow generic rendering ?
Q/AReport-TopTopics: https://caolan.org/posts/writing_for_node_and_the_browser.html
Q/AReport-TopTopics:"Summary Diff"
Q/AReport-TopTopics:- - - - - - - - -
Q/AReport-TopTopics: Display new Summary + Diff annotations per topic (pos: 72 -> 59)
Q/AReport-TopTopics: Ex: sharepoint rank: 28 count: 19 factors: { fromMe: 11 automated: 2 (+1) reFilt: 2} comment: ranking(tf-idf)
Q/AReport-TopTopics: Problem: Compact json format --> sig is missing if 0 --> cannot differentiate between new-factor and current-zero-factor (sig)
Q/AReport-TopTopics: Keep old topic fields after summary separator (update code to stop parsing for topics at ------------- )
Q/AReport-TopTopics: Impl below moved (diff in position) --> annotate 'moved 14 --> 1'
Q/AReport-TopTopics: Metadata field added/removed
Q/AReport-TopTopics: Added new factor or topic.comments --> ignore in diff if doesn't appear in both (but if rank is different - mention it)
Q/AReport-TopTopics: Impl: Store currentMetadata at the end of summary as a json island. When diff--> read it as oldTopicsMeta --> now we know which
Q/AReport-TopTopics:- - - - - - - - - - - - - - - - - - - - - - - - - -
Q/AReport-TopTopics:- - - - - - - -
Q/AReport-TopTopics:old
Q/AReport-TopTopics:- - -
Q/AReport-TopTopics:0 sharepoint
Q/AReport-TopTopics:1 owa
Q/AReport-TopTopics:3 office365
Q/AReport-TopTopics:new
Q/AReport-TopTopics:0 url
Q/AReport-TopTopics:1 sharepoint
Q/AReport-TopTopics:2 owa
Q/AReport-TopTopics:0 != 0 sharepoint != url : url 2->0
Q/AReport-TopTopics:0 == 1 sharepoint == sharepoint :
Q/AReport-TopTopics:Complicated 2 moves
Q/AReport-TopTopics:- - - - - - - - - - - -
Q/AReport-TopTopics:old
Q/AReport-TopTopics:- - -
Q/AReport-TopTopics:0 sharepoint
Q/AReport-TopTopics:1 owa
Q/AReport-TopTopics:3 office365
Q/AReport-TopTopics:new
Q/AReport-TopTopics:- - -
Q/AReport-TopTopics:0 url
Q/AReport-TopTopics:1 owa
Q/AReport-TopTopics:2 sharepoint
Q/AReport-TopTopics:Legend: <old idx> !=/== <new idx>
Q/AReport-TopTopics:0 != 0 sharepoint != url --> edit is not an option --> insert (or moved) url in new --> lookup url in oldTopics --> moved: 2->0
Q/AReport-TopTopics: Note: if not found in oldTopics --> added
Q/AReport-TopTopics:0 != 1 sharepoint != owa --> insert (or moved) owa in new --> lookup owa in oldTopics --> didn't move (1-->1) --> no pos annotation for owa
Q/AReport-TopTopics:0 == 2 sharepoint == sharepoint --> moved 0->2
Q/AReport-TopTopics:owa,url - remaining suffix in old --> lookup in newTopics --> not found --> deleted, else do nothing (added and moved already handled above).
Q/AReport-TopTopics:"DONE Q/A Report
Q/AReport-TopTopics: Convert childTopics to topicId only, before writing to Detailed.json
Q/AReport-TopTopics: Reports menu
Q/AReport-TopTopics: On click report --> React Router /summary with reportId (same as for /details) --> summary parses the router params --> jsonFetch
Q/AReport-TopTopics: Click to Diff-Prev
Q/AReport-TopTopics: /api/reports --> Recurse fs under 'ReportsRoot' to discover all reports ramt/<DateRange>/Summary.txt
Q/AReport-TopTopics:Bug: David --dontUseDuplicateSubject --> changed dup but doesn't render dup n1-->n2
Q/AReport-TopTopics:Bug: Highligher doesn't do 'FSCP 2018' --> FSCP%202018
Q/AReport-TopTopics: Copy expected from pervSummary
Q/AReport-TopTopics: highlight diffed attrs (mark in json to highlight a property)
Q/AReport-TopTopics: pos info
Q/AReport-TopTopics: Highlight topic in detailed email (findTopicInText)
Q/AReport-TopTopics: Save jArt.about.text in Detailed.json - used by findTopicInText
Q/AReport-TopTopics: respond with details update.json read from mongo based on joined updatesIds
Q/AReport-TopTopics: use collageUserId (from config) + updateId (otherwise duplicated updateId for several users)
Q/AReport-TopTopics: Convert text concat inside <td> to an array of <span or <a>
Q/AReport-TopTopics: onClick (only <a>):
Q/AReport-TopTopics:<a href="#link" onClick={(e) => this.handleSort(e, 'myParam')}>
Q/AReport-TopTopics:handleSort = (e, param) => {
Q/AReport-TopTopics:e.preventDefault();
Q/AReport-TopTopics:console.log('Sorting by: ' + param)
Q/AReport-TopTopics:}
Q/AReport-TopTopics: Q: Dynamically create a react onClick event to a local function (with the reportPartId, topic,key as parameters)
Q/AReport-TopTopics: Nav (Router) to a DetailedComponent --> fetch emails from server updates.json (Params: reportPartId, topic[key])
Q/AReport-TopTopics: Round float to 2 digits
Q/AReport-TopTopics:A: Only rank is a float
Q/AReport-TopTopics: Remove comments (from diff UI) if null (delete it in getSummaryView if null)
Q/AReport-TopTopics: A: webviewer
Q/AReport-TopTopics: Note: Textual diff with git default diff util (we can also customize it in the future)
Q/AReport-TopTopics: Cmdline (shortcut icon + explorer shell menu)
Q/AReport-TopTopics: calls git log --> display selection list to specify 2 reports versions (opaque: sha1) --> output txt or csv (file name - containing the 2 reports user+dates+change_comment)--> editor refresh
Q/AReport-TopTopics: If passed 2 commit hashes - no need to display
Q/AReport-TopTopics: Specify a single report identifer --> compare latest report of this user to this report (unless the latest report was incorrectly specified)
Q/AReport-TopTopics: summaryDiffRoute returns presentation JSON (with metadata)
Q/AReport-TopTopics: Problems of textual Diff
Q/AReport-TopTopics: Solution Alt: Textual Report on which have moved/added/deleted + their factors diff
Q/AReport-TopTopics: Problem: New topic reaches no. 1 --> actuall diff is small (ex: bod moved from 15->1) --> all others are affected -1 in rankpos --> large noisy diff report ?
Q/AReport-TopTopics: Diff Flow algo below
Q/AReport-TopTopics: Added/Deleted
Q/AReport-TopTopics: Problem: It doesn't help analysis if a topic is deleted and we cannot see its new (low ranking) factors. Same for added: We want to review its prev factors
Q/AReport-TopTopics: topTopics cutoff at 100 --> 200
Q/AReport-TopTopics: Moved Ranking Measure - for all topics in a summary.
Q/AReport-TopTopics: Q: Do we need Exepcted good/bad for this to work ?
Q/AReport-TopTopics: Q: If a good topic (ex: bod) moved from 15->1 --> Inc goodness ranking measure -->
Q/AReport-TopTopics: Problem: The current 10 toptopics (say all are good) were all moved down a pos --> is it a penalty to ranking measure ?
Q/AReport-TopTopics: Analysis Comments
Q/AReport-TopTopics: A: Simple: Edit Summary.txt and commit
Q/AReport-TopTopics: A: During report generation - reports.js calls git package to extract the prev comments and add
Q/AReport-TopTopics: Deserialize Summary.txt
Q/AReport-TopTopics: Goal: Parses array of nested json objects from a middle of a text file. Each obj is a line
Q/AReport-TopTopics: Bug: addFldsCommas: incorrectly adds a comma before first fld in the nested factors : { ,fromMe: 1, }
Q/AReport-TopTopics: pass regex to start line (or start after line ------- ) --> lib parses Summary.txt and decides where to Start
Q/AReport-TopTopics: Config -
Q/AReport-TopTopics: Ex: Run Mar-Apr on david,ram
Q/AReport-TopTopics: Several Date-Ranges (select named periods mar-apr, )
Q/AReport-TopTopics: Database to get updates and models from (future: updates from Graph?)
Q/AReport-TopTopics: dontUseXXX - ablation tests: compare reports turning off some of the algorithms
Q/AReport-TopTopics: Validation for config/cmd-line selections
Q/AReport-TopTopics: Write reports (ReportsStorage) based on selection user+date-range
Q/AReport-TopTopics: Problem: [mongo-storage] - how to capture the query + connecting console.log (start of report) ?
Q/AReport-TopTopics: The report is not piped > file.txt anymore --> console.log will output to screen (not collected)
Q/AReport-TopTopics: report.js will explicitly collect diag return values (also from mongoStorage getUpdates)
Q/AReport-TopTopics: Q: Where do we keep history ?
Q/AReport-TopTopics: Detailed Report - first is the Summary
Q/AReport-TopTopics: Q: Diff - Machine Readable ?
Q/AReport-TopTopics: Q: How to find prev report for latest Diff ?
Q/AReport-TopTopics: A: Timestamp
Q/AReport-TopTopics: Q: Build CI machine ?
Q/AReport-TopTopics: Problem: Q: 20000 files repo each one 7MB ?
Q/AReport-TopTopics: Summary + details pointers updateIds + their factors
Q/AReport-TopTopics: updates.json for this report (formatted?)
Q/AReport-TopTopics:Problem: Terms chagnge (Compound) --> updates.json changes
Q/AReport-TopTopics: terms.json separate file (treat terms as algo output)
Q/AReport-TopTopics:"Language Model"
LanguageModel:Productization
LanguageModel: If total < MIN_STAT (as we do in PER) --> return score=0.5 --> cannot disqualify it.
LanguageModel: mongo_diff
LanguageModel:cd D:\views\Collage.Topics\Reports\helpers
LanguageModel:node --max_old_space_size=3500000 mongo_diff.js --collectionOld languagemodel --dbUrlOld mongodb://localhost:27099/collage_new --collectionNew languagemodel --dbUrlNew mongodb://localhost:27017/collage --key gName > output\lm_collage_new_vs_prod.txt
LanguageModel: Report diffs:
LanguageModel: Q: How come microsoftteams was changed from m: badLMTopic->undefined but its rank is not change ?
LanguageModel: Q: How come microsoftteams was badLMTopic, when it is a bigram ?
LanguageModel: Move expectedTopicsTest + main --> unitTest in __tests__ --> create the output output/expectedTopicsTest.json --> file commit to git (to see track changes when changed)
LanguageModel: gName: 'london ae candidate' - total : 3 (new) and 17 (old - Copy_of_languagemodels)
LanguageModel: A: Seems new code is correct.
LanguageModel: extractTerms node process doesn't exit --> --noLM doesn't repro --> Which promise in LM / framework ?
LanguageModel: lmConversations bug ?:
LanguageModel: There are few added and many
LanguageModel: 'connecting software' bigram --> old total: 2 (correct), new total: 4
LanguageModel: Correct behavior ()
LanguageModel: lmConversations count: 10103 (same as inmem-norm-subject set)
LanguageModel: Re
LanguageModel: On second run, the diff is very small a lot less - all of diff tokens are common subject tokens (re, fw, re :)
LanguageModel: see Diff D:\views\Collage.Topics\Reports\batchExtractor\outputs\diff_new_old_lm_samecount_798497_gte_2.txt
LanguageModel: Ex: the new returned the same number of gNames as old (798497_gte_2) + 'connecting software' bigram --> total Now changed 4-->2 (correct)
LanguageModel: eslint config for all Collage.Topics/Reports - as was done for terms-processing
LanguageModel:gName --> id (filter in updateRecs)
LanguageModel: getGramsLMScore - adjust to new schema if needed.
LanguageModel: deleteMany - copy to GenericStorage
LanguageModel: basic connect and incremental update stats.
LanguageModel: Q: Keep ? and !
LanguageModel: Delete total : 1 && updateAt < 2 weeks ago.
LanguageModel: extractTerms report-log --> include toggled lmBadTopic (at the updateId when they change) - same info for enriched.
LanguageModel: No point including all tokens that have their stat change (too much log output)
LanguageModel: Redis-Lua
LanguageModel: After Terms-Processing --> send alpha-numeric tokens (not Terms) to Redis-Lua + idxTokenStartBody
LanguageModel: Lookup ConversationId (see below) --> ignore Subject tokens if found
LanguageModel: Update each token stats
LanguageModel: Problem: Online: Do not count again in the same duplicate subject (same email-thread)
LanguageModel: Store a Set of seen ConversationIds in Redis (separate from LangModel)
LanguageModel: If Race --> doesn't matter --> another worker has updated a gram from the subject of same Conversation AFTER current worker started processing it -->
LanguageModel: Impact: gram stat is +1 or +2 from correct count
LanguageModel: Clear ConversationId Set when extractTerms starts: require('langModel') --> langModel.init, unless -u <updateId> flag (incremental update)
LanguageModel: Note: only Clear at start and Update if --save and No flag --dontUseLangModel
LanguageModel: Opt: Conversation Affinity - all updates from same conv are routed to same Online worker.
LanguageModel: Lua to incr stats
LanguageModel: Do not filter out Duplicate
LanguageModel: We cannot calc Duplicate for every unigram (all tokens - not only Terms)
LanguageModel: Do not update grams inside Url --> since they are not a real eng language
LanguageModel: Depends on other features - not estimated here
LanguageModel: Opt: single word (filter out non-word tokens): No need for bi-grams as they are not currently used.
LanguageModel:: Daily:
LanguageModel: Reads from Redis all dirty grams (stats changed but not yet processed)
LanguageModel: Calc bad/good : langModel.getGramsLMScore(<array of dirty grams>)
LanguageModel: If toggle (ex: good->bad) --> Update Topic node Graph. lmBadTopic = true.
LanguageModel: TopTopics: uses the lmBadTopic from Graph to filter out / discount.
LanguageModel: Cleanup sources:
LanguageModel: Subject - should we it
LanguageModel: If Dup, Urls/Emails/Paths , Sig (lali - keep) --> do not use it for language model, because it is not an english sentence.
LanguageModel: Lali: Test very short sentence (Thanx,) ?
LanguageModel: Do not disqualify bi-gram using lower/upper (currently only can score < 0 unigrams):
LanguageModel: A/AN Bug: a Sharepoint file/guy/conference/migration project --> high count of 'a' before unigram 'SharePoint' --> when we need to count 'a' before 'SharePoint File'
LanguageModel: Possible unigram - if we can determine it is a unigram and not part of bi-grams (a sharepoint <something>) --> maybe the a/an rule does work for those cases
LanguageModel: Rerun the report with a/an on (some threshold) --> examine diff - for unigrams (that we manually know are unigrams in expectedTopics).
LanguageModel: Note: Not even sure we like to rule-out 'SharePoint Migration' as a Topic - it is not a strict-Propernoun
LanguageModel: Problem: 'management buyout', 'proxy configuration' in upper and lower (it occurs almost only in lower) --> same Topic !
LanguageModel: If 'management' appears as a prefix of several good (bad) compound topics --> score higher (lower) compound topics starting with 'management'.
LanguageModel: good / bad will be computed from other source - such as ignored topics
LanguageModel: Con (of not using lm for bi-grams): Bigrams that could be disqualified based on lowerRatio : We do not disqualify now: 'web page', 'user name', 'proxy configuration', 'online meetings', 'digital marketing', 'direct line' (not confident),
LanguageModel: How to disqualify bi-grams in the future ?
LanguageModel:Q: Only disqualify Topics which are common language phrase (e.g 'web page', 'user name', 'on board') - can use Google NGrams + Email stat
LanguageModel:Q: In bi-gram
LanguageModel: Conc: Cannot disqualify bi-gram based only of high web freq, because 'Board of Directors' has high freq (maybe it is not a real PROPN Topic - general term)
LanguageModel: badTopicsCorrect bi-grams: web page, user name, practical guide, online meetings, digital marketing, direct line
LanguageModel: * management buyout - goodTopicsIncorrect
LanguageModel: total: 23,allLower: 20,startsUpper: 3,startsUpperFirstInSentence: 2,afterMidUpperSU: 1, afterPos: 0,afterPosSU: 0, afterAn: 5,
LanguageModel: Difficult Topic, since in body, it appears almost always lowercase (Subject in uppercase, but current LM remove duplicate subjects)
LanguageModel: Subject -
LanguageModel: LM remove duplicate subjects --> count of Upper from the thread is 1, but still count the lowercase from the bodies of these emails.
LanguageModel: We need to weight subject
LanguageModel: When lowerRatio shouldn't be used ?
LanguageModel: Problem: same ngram is used lower and upper interchangeably in the same context (wordvectors) or lower and upper co-occur.
LanguageModel: WordVectors: Both lower and upper forms occur around the same words context --> they are the same
LanguageModel: Consider large window (not 5 words, but the whole email)
LanguageModel: board, teams - goodTopicsIncorrect. lowerCaseRatio: 0.90476
LanguageModel: After Lali change lowerCaseRatio >= 0.8 --> certain badTopic --> we lost Board
LanguageModel: Many of the lowercase are actually the Topic 'Board' - (Stanford: only Upper are NNP, lower - NN) --> surrounded by same words as the upper.
LanguageModel: Run board.context also for lower - to prove that.
LanguageModel: Does board (lowercase) appears in same context as 'Board' (Ex: board of directors, ) --> some board will go to Compound (board meeting)
LanguageModel:and some will be in same context as Board --> meaning they are the same and board is a topic.
LanguageModel:decrease board unigram count.
LanguageModel: Without afterMidUpperSU - board and teams are correct (good)
LanguageModel: Short-form (Similarity): If 'Board' co-occur 'Management Board'
LanguageModel: Batch getLMScore --< fast !
LanguageModel: Print standalone 'Teams' occurrences - is it enough we have several dozens of these (maybe from diverse source/authors)to declare it as a Topic - without counting ratio ?
LanguageModel: Problem: some badTopics will pass:
LanguageModel: 'view' - startsUpper: 3188 - startsUpperFirstInSentence : 2042 - afterMidUpperSU : 224 > 800 legit-startsUpper --> will make it a good topic
LanguageModel: 'register' - 250 legit-startsUpper
LanguageModel: 'below'
LanguageModel: Use afterMidUpperSU in score.
LanguageModel: Q: Does Dups resolves that ?
LanguageModel: allLower: 93, startsUpper: 60 (only 3 of the 60 in sentenceStart)
LanguageModel: many counts within other Topics: 'User Story', 'Science of a Story'
LanguageModel: Report:
LanguageModel: Add Counter Inbox vs sentItems, isFocused vs non-Focused (to report only)
LanguageModel: Add score -0.5
LanguageModel: Add measure for confidence - correct-confident incorrect-confident
LanguageModel: Note: When adding new factors (afterMidUpperSU) we want to fix incorrects, but also have the classifier less senstive --> less 0.5,-0.5 and more -1,1
LanguageModel: Corrl: We want correlation between confidence and correctness (also divided to good and bad)
LanguageModel: Export LM and mailUpdates --> Yheonathan - New Train/Test Set of many Good and Bad Topics
LanguageModel: ML scikit-learn model to separate Good vs. Bad Topic based on LangModel features
LanguageModel: 5000-freq words - tie breaker if 1.2 > lowerCaseRatio > 0.8
LanguageModel: Buy the 100,000 list with n-grams
LanguageModel:: Duplicates: We want to count unique Linguistic contexts (sentences)
LanguageModel: Dups:
LanguageModel: Extension: 701 - General\r\nGrasshopper #: (Voice Mail generated )
LanguageModel: Titles: Assistant General Counsel, General Partner/Manager
LanguageModel: Simple (Exact match duplicate) unigram cur token --> If tri-gram from prev to next token already exists in model --> this unigram token is duplicated
LanguageModel: Q: If tri-gram 'User Story' is a topic and occur 5 times - should we count its unigram 'Story' as duplicated ?
LanguageModel: Q/A:
LanguageModel: Ram + Yaacov
LanguageModel: Unit Tests
LanguageModel: Expected - *Good* and Bad Topics
LanguageModel:: Q: Should we count words we want to eliminate from topics
LanguageModel: duplicate-subjects
LanguageModel: const helpers = require('./nlp-helpers');
LanguageModel: Problem: How to store the nbrs for each token ? There are many such nbrs
LanguageModel: Terms will have their nbrs tracked (future terms-processing)
LanguageModel: signature
LanguageModel: marketing emails
LanguageModel: Compare lowercase/upper case with simple common-words (5000 freq-words) lookup
LanguageModel:: Paging - maybe will not be able to read large amounts of text into memory
LanguageModel:LM conclusions and problematic Topics
LanguageModel:- - - - - - - - - - - - - - - - - - -
LanguageModel: Disqualify bi-grams ? (Currently only disqualify unigrams)
LanguageModel: Problem: 'management buyout', 'proxy configuration' in upper and lower (it occurs almost only in lower) --> same Topic
LanguageModel: Problem: common words (build,word) which are (when NNP capitalized) - are topics in harmon.ie email context.
LanguageModel: Org Dictionary should take precedence over LangaugeModel - Word, Workplace - Topics which are also common words in LM
LanguageModel: contexual (surrounding terms in email) Wikipedia popular Entities can also serve as generic Dictionary --> in context (similar to opencalais Social Tags)
LanguageModel: When lowerRatio shouldn't be used ? (see above)
LanguageModel: Problem: same ngram is used lower and upper interchangeably in the same context (wordvectors) or lower and upper co-occur.
LanguageModel: allLower / (startsUpper - startsUpperFirstInSentence - afterMidUpperSU)
LanguageModel:total: 346,
LanguageModel:notFound: 71,
LanguageModel:lowStats: 32,
LanguageModel:correct: 187 --> 191,
LanguageModel:badTopicsCorrect: 61 --> 67,
LanguageModel:badTopicsIncorrect: 47 --> 41,
LanguageModel:goodTopicsCorrect: 126 --> 124,
LanguageModel:goodTopicsIncorrect: 9 --> 11
LanguageModel: 6 badTopics were gained (badTopicsCorrect) + confidence of lowerRation is much stronger.
LanguageModel: 3 good topics were lost
LanguageModel: board - Management Board, Advisory Board, Endgame Board (company name), Job Board.
LanguageModel: teams - Microsoft Teams
LanguageModel: Conc: While 'Teams' is a short-form for Microsoft Teams, it is not so evident from lowerRatio stats, as 'Teams' appears ~ 823 - 9 - 582 as upper and 573 as lower.
LanguageModel: explorer - Internet Explorer, IBM File Explorer
LanguageModel: notFound: 107/349 - do not appear in 8500 mails --> 71/349 do not appear in 20000 emails (4 users 9/17-5/18)
LanguageModel: We can say that if a token doesn't occur in email at all, we do not need LM to decide if it is a bad/good topic.
LanguageModel: A: A problem with our test-topics (not in LM)
LanguageModel: Ex: ticket id, disa tem .... - most of them are junk - we should cleanup.
LanguageModel: We can increase the number of detected good topics (maybe contribute to ranking)
LanguageModel: Right now, we can only disqualify badTopics --> so it doesn't help much to inc the number of good topics.
LanguageModel: Filter out isFocused : true --> notFound += 71 --> 91 most of them Good Topics
LanguageModel: Stats went down (106 GoodCorrect / 9 GoodIncorrect ), but in reality, it still identifies Good and Bad Topics the same.
LanguageModel: Conc: Since most isFocused : false emails are marketing emails --> filtering better isMarketingEmail from LM (as opposed to TopTopics report) --> will not help much.
LanguageModel: Only if 1.1 <lowerRatio < 1.5 --> lookup freqWords list
LanguageModel: Only fixed 2 incorrectBadTopics - ok and notice.
LanguageModel: Consider if we want to risk using it if it helps so little (2 fixed/348)
LanguageModel: general - Dups, isMarketingEmail/generated email (see below)
LanguageModel: view - isMarketingEmail
LanguageModel: Ex: Business Intelligence,Artificial Intelligence, 2nd Intelligence Analytics Summit
LanguageModel: Correctly detected Bad Topics, which are borderline (sensitive)
LanguageModel: network - sub-topic - appears a lot in upper inside 'C-Suite Networks' - similar to 'User Story'
LanguageModel: Remove Sub-Topics (cx network - 50, c-suite network - 36)
LanguageModel: Good Topics incorrectly (detected as Bad)
LanguageModel: 'klipse' - more lower than upper (9 total)
LanguageModel: Filter out urls (http://blog.klipse.tech/assets/yehonathan_profile.jpeg)
LanguageModel: 'word' - 'Word' as a shortcut for MS Word vs. word (high lowerRatio)
LanguageModel:
LanguageModel: Org Dictionary + contextual Wikipedia (Social Tags)
LanguageModel: 'build' - the Build Conference vs. the verb to-build
LanguageModel: Is a common english word (position 409/5000)
LanguageModel: 1300 (allLower) / 622 (startsUpper - startsUpperFirstInSentence)
LanguageModel: Conc: build is context senistive, mostly not a topic but sometimes (Conference) is.
LanguageModel: 'groups' - 321/130
LanguageModel: Note: Stanford thinks all occurrences are NNS (PROPN) ! even those in lower case ('Meetup groups').
LanguageModel: 'harmon.ie mobile' - appears 32 times, but doesn't starts with Upper (hence not an LM Topic ) - harmon.ie Mobile/harmon.ie mobile
LanguageModel: 'ios' - total: 463, allLower:13, startsUpper: 7, startsUpperFirstInSentence: 3
LanguageModel: 'iOS' - not all lower, but doesn't starts with upper either.
LanguageModel: 'machine learning' - lower 164 / upper 110. It is used a lot in a middle of a sentence with lower.
LanguageModel: 'deep work' - 34/(32-3)=1.17 loweratio total: 66, allLower : 34, startsUpper : 32, startsUpperFirstInSentence: 3,
LanguageModel: Starts with upper (SU) > 2 times after a, an or possive --> makes it badTopic
LanguageModel: Non LM issues:
LanguageModel: english - the lang-name (English) is a real PROPN
LanguageModel: Ex: excellent English skills, English message follows, a typo in English in
LanguageModel: president - 'total': 260, 'allLower': 35,'startsUpper': 225,'startsUpperFirstInSentence': 21,
LanguageModel: President (by itself) may not be a topic (may be in some particular context), but it does appear almost always in Upper (part of a title)
LanguageModel: Q: Can we count grams that usually do not standalone (in Upper) --> conclude President or VP are too general ?
LanguageModel: Note: Different from 'Network' and 'Story' in that 'President' is not preceded by Upper
LanguageModel: vp - title. not a real topic (by itself), but always upper.
LanguageModel: Similar to president above.
LanguageModel: Good Topics that are currently correct, but problematic (close to boundary - model parameter sensitivity implies problems with the model)
LanguageModel:"DONE LM
LanguageModel: Kickoff presentation
LanguageModel:Yehonathan reports.js
LanguageModel: Change --prod
LanguageModel: Use topics collection
LanguageModel: Refactor: Move /5000_most_frequent_english_words_lemmas.csv --> isNonPerson (email-util)
LanguageModel: Cleanup at End: disconnect from DB:
LanguageModel: Change every extractor/update/enricher to be an object (not a function) containing .extract/.update/.enrich + .cleanup
LanguageModel: Iterate modules keys (extractors) --> concat all arrays of modules --> call cleanup (Promise.all style)
LanguageModel: at each cleanup:"await langModelStorage.disconnect();
LanguageModel: lmConversations - is updated even without --save (because otherwise it will extract from duplicate conversation subjects)
LanguageModel: Ops: When deleting languagemodels collection need also to delete lmConversations
LanguageModel:: -t '<text>' - Add extractors and enrichers to output
LanguageModel: Postpone: Only LM requires pure-text. The other requires metadata. -u is a good workaround
LanguageModel: Problem: PER extractor requires rcpts --> extractPersonStats({ artifact }) --> but artifact is undefined
LanguageModel: <text> --> tokens --> passed to personExtractor
LanguageModel: --include/--exclude --> will also affect -t
LanguageModel: Tests and Q/A:
LanguageModel: First run and debug of undefined stuff (mongoUrl, extractor is not a function ...)
LanguageModel: Collection: Updater will corrupt original languageModels collection (because GenStorage is not yet impl) --> Copy_languageModels
LanguageModel: Compare results with current langModels collection (even several manually + some aggregate stats) --> ensure 0 diff
LanguageModel: Found { gName: /missing followed sites/} - new total : 110, old total : 80
LanguageModel: { gName: 'support request' } - total old: 3 new:5
LanguageModel: mongo_diff - change query
LanguageModel:diff = await diff2Collections({ ...options, queryNew : { total : { $gte : 2} } });
LanguageModel: New has fewer recs: 318,052 { total : { $gte : 2} }, old: 819,336 (after garbageCollect removed total < 2)
LanguageModel: Problem: Same 1 day query: Still 4721 (new - { total : { $gte : 2} }) vs. 4705 old
LanguageModel: Create diff in single update -u <> (in mongo term, then revert)
LanguageModel: maybeSave is passed logStr by value --> need byRef so it can += append to log inside
LanguageModel: Bug: Super Slow updateRecChunks
LanguageModel: Seems to be fixed by Yehonathan commit that eliminate the memory leak (results.push<everything>)
LanguageModel: Index ? Is it a function of langmodels collection size ?
LanguageModel: Test: Try new extractTerms - with langmodels already filled.
LanguageModel: Memory ? Why does exractTerms process takes 2 GB when it is paging + online ?
LanguageModel: Test: --noPer --noLM --save: Does it still take so much memory ?
LanguageModel: If LM - try without bulk-update (without --save)
LanguageModel: If updateRecChunks --> try append instead (a different write DB api)
LanguageModel: Monogoose taking all memory ? Replace it with another lib ?
LanguageModel: Test: Try --max_old_space_size=3500000
LanguageModel: A: Very fast: most updateRecChunks takes < 25 ms but sometimes it pauses a little (updateRecsChunk durationMs: 5189)
LanguageModel: ConversationId: genStorage = require --> new GenericStorage
LanguageModel: Q/A: tokens are mutated by lm countStats --> normText --> written to update.token.allLower, startsUpper ... - bug
LanguageModel: .map all tokens to tokenX at the beginning of countStats --> then use tokenX array instead of tokens array.
LanguageModel: termsDiff: Remove console.log(`******* updated
LanguageModel: Change it to deep (but efficient)
LanguageModel: diff JSON.stringify of 2 tokens / terms --> meaning very exact diff (Even order of keys matters !!!)
LanguageModel: Note: Term now includes nested occur - array of objects
LanguageModel: Bug: Why 'Terms Review' newItem doesn't have occurs[0].insideBrackets ?
LanguageModel: Create topics collection and use it in Mongo-storage-worker
LanguageModel: --noPER --noLM
LanguageModel: Motivation: Test in isolation only the part you work on (saves time and doesn't change DBs). Also turn off temp-broken code.
LanguageModel: Default: Terms + All algos (we change it if something breaks and we need to disable)
LanguageModel: Q/A: --save: If not --save --> do not write to mongo. If --save: need to delete the whole myContacts collection
LanguageModel: mongoDB config (options ?)
LanguageModel: Change in Enricher and Updater
LanguageModel: --save is broken.
LanguageModel: fixed maybeSave(options.save, ... ) --> maybeSave(options, ... )
LanguageModel: Fix broken unitTest of termsExtractor
LanguageModel: Extractor: countStats
LanguageModel: Predictor: getGramsLMScore
LanguageModel: A: Incremenetal mongodb stat updateds $inc --> no need --> increased memory to 3.5GB for now.
LanguageModel: Test with few records query
LanguageModel:Debug:Restore pageSize --> 5000
LanguageModel: updateRecs - $inc
LanguageModel: garbageCollectLowStatsFromDb
LanguageModel: Bug: automatic: total 130 in new (.tokens) and total 96 in prev langModel - how ?
LanguageModel:A: Counted all tokens.originalText' : /^automatic$/i --> extactly 130, while if counting old bodyTokens + subjectTokes /^automatic$/i --> 97 (~96)
LanguageModel: Use in reports.js
LanguageModel: adjustTopicScore --> nlp-helpers: Lookup normText(topicText) --> If num bads > num goods (Upper...) -->
LanguageModel: Add to Top Topics Summmary: dup,sig,lm-bad/good/lowstat/nf
LanguageModel: 20000 mailUpdates --> new LM.
LanguageModel: A: Incremenetal mongodb stat updateds $inc --> no need --> increased memory to 3.5GB for now.
LanguageModel: Change test3 barrier to 8 months --> delete tokens + refresh --> import more mailUpdates for LM.
LanguageModel:1003BFFDA43EFAAA
LanguageModel: Convert to array
LanguageModel: Mongoose LangModel (collection)
LanguageModel: Possesive - can we trust it to kill a Topic ? Maybe as uncertain factor only (ML) ?
LanguageModel: My MS Graph subscription, Accelerate your GDPR compliance
LanguageModel: Seems that it cannot disqualify 'General' (startsUpper >> allLower)
LanguageModel: Fix bugs - generate report Good/Bad on 8500
LanguageModel: Count stats
LanguageModel: unigram: lower / upper n-gram: all upper / first word upper
LanguageModel: upper following possessive
LanguageModel: upper following number
LanguageModel: Too large model
LanguageModel: Memory should suffice for 15000-30000 emails --> at the end, do not write to DB the bigram,unigram that only have 1 occurrence
LanguageModel: Say 15000 emails with Avg 200 tokesn each --> 100000 unigram (with most stat), 200 bi-gram per email --> 15000 * 200 = 3M --> 6M tokens + 3M trigrams --> addditoinal 9M tokens
LanguageModel: Since we only want to query the LM with existing Terms - Let's only build it for uni/bi/tri-gram matching terms (case-insensitive)
LanguageModel: Problem: This will work for reports, but not for terms-processing, where it finds new terms every second --> LM don't yet have them.
LanguageModel:
LanguageModel: Lazy LM: FullText search (or regexp search for /(porche|qa|sharepoint|story|select)/) --> tokenize --> loop every result subject+body searching for the hit (can do with same regex)
LanguageModel: bi-gram terms --> split to 2 single --> tokens regex --> check result that the 2 query tokens are consequtive.
LanguageModel: Merge dict-topics with topTopics1 to new-branch topicsExpr and do not delete update.tokens
LanguageModel: Neural LM ? How does it work ?
LanguageModel: LM Toolkits?
LanguageModel: http://www.speech.sri.com/projects/srilm/papers/icslp2002-srilm.pdf
LanguageModel::Bug: 'Accepted' is a Topic according to LM: 80 allLower, 194 startsUpper, 194 startsUpperFirstInSentence
LanguageModel: many template emails with subject 'Accepted: Sprints and Stories Review'
LanguageModel: Note: Not a duplicate subject (the whole subject is 'Accepcted: <different text>' --> left nbr is empty and right nbr is not duplicated
LanguageModel: Detect duplicated patterns at subject prefix.
LanguageModel:Register:
LanguageModel: Report
LanguageModel: Add Good Topics + Refactor expected
LanguageModel: Add Accuracy + Accuracy in Bad + Accuracy in Good
LanguageModel: afterPos (without su), total
LanguageModel: Add POS histogram. / PROPN vs. Non-PROPN
LanguageModel: Cleanup Bad Topics (Account Executive, office online, online services, microsoft way ...)
JVMJavaScriptEngineforPorting: Mvn build + Deploy --> embbed as build resoure in topics-job.jar
JVMJavaScriptEngineforPorting: Add npm run build-report --> webpack + babel on report --> generates terms_processing/report/dist/topTopicsEs5.js --> commit --> subtree split to collage_stable
JVMJavaScriptEngineforPorting: topics-job fetch and load .js
JVMJavaScriptEngineforPorting: Download .js
JVMJavaScriptEngineforPorting: Sparse checkout mvn plugin: https://github.com/gastaldi/git-checkout-plugin
JVMJavaScriptEngineforPorting: To have mvn add the .js file to the .jar, copy .js into src/main/resources.
JVMJavaScriptEngineforPorting: See https://maven.apache.org/guides/getting-started/index.html#How_do_I_add_resources_to_my_JAR
JVMJavaScriptEngineforPorting: mvn lingo: part of generate-resources phase in the build lifecycle. a phase is a list of goals. specify phase will also cause exec of all phases preceding it
JVMJavaScriptEngineforPorting: Idea Pre Launcher UI --> Add above maven goal
JVMJavaScriptEngineforPorting: Load stream of .js file from .Jar
JVMJavaScriptEngineforPorting: Keep ScriptEngine in per-worker global variable (ScriptEngine is not serializable and we do not need it to transfer between processes)
JVMJavaScriptEngineforPorting: If not --> x2-x6 slowdown.
JVMJavaScriptEngineforPorting: Refactor Report to be called both from research and Spark
JVMJavaScriptEngineforPorting: Special mode, that is different from research and from prod
JVMJavaScriptEngineforPorting: Sig: If Graph input (as opposed to mongo Input) --> options.dontRemoveSignature = true --> as it was already removed in terms-processing
JVMJavaScriptEngineforPorting: Duplicate: Refactor to use terms nbr (will be avail in both Graph and mongo)
JVMJavaScriptEngineforPorting: isPerson, isAutomated -
JVMJavaScriptEngineforPorting: Input: array of join-lines per-user
JVMJavaScriptEngineforPorting: Research wrapper will query mongo, create joined-artifacts
JVMJavaScriptEngineforPorting: Person, badLMTopic and isAutomated are provided in Input (not loaded from mongo)
JVMJavaScriptEngineforPorting: Output: per-user ranked topics + factors
JVMJavaScriptEngineforPorting: Port to Nashorn / GraalVM
JVMJavaScriptEngineforPorting: Motivation: We need a single source for Ranker and Algorithms --> so we need it either in JS (as we have today) --> Nashorn + later GraalVM, or rewrite all Report + Algos in Java.
JVMJavaScriptEngineforPorting: Same test but with GraalVM: https://amarszalek.net/blog/2018/06/08/evaluating-javascript-in-java-graalvm/
JVMJavaScriptEngineforPorting: Pass array of strings + array of arrays in InvokeFunction
JVMJavaScriptEngineforPorting: See Array.asList(1,2,3,4) + Foo extends AbstractJSObject - https://stackoverflow.com/questions/30571711/seamlessly-pass-arrays-and-lists-to-and-from-nashorn
JVMJavaScriptEngineforPorting: Babel report.js (or some subset of topTopics.js, without mongo or fs calls) --> examine if portable to Nashorn.
JVMJavaScriptEngineforPorting:Problem: Nashorn is deprecated https://openjdk.java.net/jeps/335
JVMJavaScriptEngineforPorting: ? jjs tool is not in JDK 11 at all ? We will upgrade Spark to newer libs requiring newer JDK soon ...
JVMJavaScriptEngineforPorting: Nashorn successor is GraalVM - an advanced Oracle VM combining JVM with many other Programming languages (JS, Python, R ....). It has --nashorn-compat flag
JVMJavaScriptEngineforPorting: Problem: Its community edition is free, but its Enterprise edition costs money (call us for pricing ...)
JVMJavaScriptEngineforPorting: Problem: Not sure our version of Spark (compat with JDK 8, but not with JDK 11) will run on GraalVM.
JVMJavaScriptEngineforPorting: findTopicInText --> change special regex '(?<!\\w)' + escapedTopicText + '(?!\\w)' --> not supported.
JVMJavaScriptEngineforPorting: escapedTopicText = '(?:^|\\W)' + escapedTopicText + '(?!\\w)';
JVMJavaScriptEngineforPorting: webpack + babel
JVMJavaScriptEngineforPorting: Refactor nlp-helpers to move fs-extra functions to another util file
Maven:set JAVA_HOME=D:\Program Files\Java\jdk1.8.0_191
Maven:mvn install
Maven: mvn deploy
Maven:Deploy to local repo after build.
Maven: mvn -B archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.harmonie.topics -DartifactId=duplicate-detector
Maven: Add local repo (commit .jar in git)
Maven: https://maven.apache.org/plugins/maven-deploy-plugin/usage.html
Maven:<project>
Maven:...
Maven:<distributionManagement>
Maven:<id>internal.repo</id>
Maven:<name>Java Algorithms Internal Repository</name>
Maven:</repository>
Maven:</distributionManagement>
Maven:"Milestone 2"
Milestone2:? checkout feature/topics_spark_integration and cherrypick 2 commits form topics_sql_integration
Milestone2:: Revert Aug only change +
Milestone2: Merge Collage.Topics: develop --> master
Milestone2:: Eliyahu: Schedule topicStats Job every 3 days + AdvancedTopicsProcessor instead of basic.
Milestone2:: SQL integration
Milestone2: Merge from topics_spark_integration + develop --> diffs --> pull request
PostMilestone: nonEuropean - merge / complete test
PostMilestone: Signature fixes (sent from my + Get Outlook for XXX ...) --> update package.json of terms-processing --> consume new version of signature from github.
PostMilestone: YS: signature token enricher
PostMilestone: Yair: dict-topic boost
PostMilestone: Jul:
PostMilestone:node reports --user davidl -d jul --prod --userDataDbURL mongodb://localhost:27099/july
PostMilestone:: Merge from develop BEFORE PostgreSQL: 0f0db960a361d441f95d3fa4396489281d749e1d (New Scaling)
PostMilestone: node mongo_diff --collectionOld topics --dbUrlOld mongodb://localhost:27099/july --collectionNew topics --dbUrlNew mongodb://localhost:27017/collage --projection "{\"_id\": false, \"id\": 1}" > output\topics_july_rearch_vs_prod_diff.txt
PostMilestone:: Why large diff between july_master.topics.count = 3748 and july.topics.count = 2381 in number of ?
PostMilestone: Ex: master 'Invoice INV-0998' vs. develop split to term1 Invoice and term2: INV-0998
PostMilestone: { updateId: '<DB6PR0601MB232618066EAD0367499C335AAC4C0@DB6PR0601MB2326.eurprd06.prod.outlook.com>'}
PostMilestone: -t --> both master and develop extract 'Invoice INV-0998' --> doesn't repro
PostMilestone:node ..\batchExtractor\extractTerms.js -t "RE: Invoice INV-0998 from Fifty Five and Five Ltd for harmon.ie"
PostMilestone: Small databse july (Hodaya July.json):
PostMilestone: token filter - deleted encoded R&D !== R%26D --> split vp R&D (research) to vp (prod)
PostMilestone: <DB5PR06MB156054DD1CEA4D62644AA606AF330@DB5PR06MB1560.eurprd06.prod.outlook.com> - has vp topic in Graph
PostMilestone:<DB5PR06MB1560ABA165F6CF271BE65DE6AF330@DB5PR06MB1560.eurprd06.prod.outlook.com>
PostMilestone: Mongo - terms 'VP R&D'
PostMilestone:"Postmortem"
Postmortem: Diff to prev Milestone - keep in devfiles a Milestone2 folder with results of 2 Spark Jobs, Mongo prod + research collections exports
Postmortem: Add createdAt in addition to updatedAt (Ex: Lavi is created and updated at a very short interval)
Postmortem: lmEnricher - write in topics collection debuginfo to understand why badLMTopic : true/false - langModel[gName].total, allLower, startsUpper, ...
Postmortem: Env: Everybody should have all env (maybe except special production env)
Postmortem: Start early Q/A / build --> deployment to Azure --> sreaming datasets --> live Data
Postmortem: Do not wait for Algorithms Dev-complete.
Postmortem: Timestamp batchExtractor
Postmortem: Saved Queries - SQL Query Tool (complex queries with lots of joins)
Postmortem: One change at a time - easier to explain diffs
Postmortem:Diff between Mongo prod (terms-processing) and Mongo Research (batchExtractor)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Serializable (1 worker) in Production --> Slower but eliminate Concurrency diffs
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Signature is not removed in research --> topics.count in research is larger
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):1) Details: The whole longer compound terms appear in sig, but only Microsoft appear in both sig in in body outside sig, so when token filter (only in prod terms-processing) filters out all sig tokens except Microsoft.
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): topics collection - sig topics only in research (~1500 in collage_new)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): languageModels collection - sig tokens only in research
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): findTopicInText has bugs: Subject: RCC_<someting> --> doesn't find topic RCC --> incorrectly assume sig += 1
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Duplicate: Doesn't care about Sig --> so more Duplicates inside Sig text in research
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Token filter - May create diffs in dups (affect rightNbr of office365 term in <AM2PR06MB0612B07237D56F14B98D3597DD4F0@AM2PR06MB0612.eurprd06.prod.outlook.com>)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Concurrecny in writes to mongo - only in Prod there are multiple readers/writers
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): isAutomated total = 2.
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Parent - Children containedTopicsTopicKeys (Concurrecny)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Same topicKey: 'spexpo': 'SP Expo' has containedTopicsTopicKeys 'SP' vs. 'SPExpo' (has Zero containedTopicsTopicKeys)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Graph as a single global topic node for 'spexpo'. It creates containedTopicsTopicKeys in the topic node first time (depending if it got it from SPExpo or SP Expo)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Concurrecny: Depend on which worker created topic node - the one with 'SP Expo' or the one with 'SPExpo'
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Duplicate topic in Subject Conversations bug
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Research Report removes duplicates (same email in several inboxes), but doesn't necessarilly take the duplicate with the current report user conversationId --> subject topics
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):can be duplicate / non-dup
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): isMarketing / autoEmail: JS - if isAutomated --> continue --> never reaches autoEmail --> meaning the first filter to catch the <artifact,topicId> hides the others
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):Spark: isMarketing is counted seperatedly from isAutomated --> meaning isMarketing : 1, isAutomated: 1 --> maybe the same filtered artifact
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Date range: Prod / Spark uses 1/3 month ago --> today --> need to fix that to the same dateRange as JS report
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierFrom = 1533070800000L; //AUG 01-Aug-2018 00:00:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierTo = 1535749140000L; //AUG 31-Aug-2018 23:59:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierFrom = 1527800400000L; //JUN 01-Jun-2018 00:00:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierTo = 1530478740000L; //JUN 31-Jun-2018 23:59:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor)://long dateBarrierTo = 1533070740000L; //JUL 31-Jul-2018 23:59:00
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):How to run
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):0) Prepare mailUpdates: node extractTerms.js --save --userDataDbURL mongodb://localhost:27099/july
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):1) Delete Graph / SQL DB
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):3) node convertFormats.js --userDataDbURL mongodb://localhost:27099/july --tokens
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor): Working with extractTerms when k8s is up (and listens to port 9000)
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):set NLPServiceURL=http://collage-dev-nlp.westeurope.azurecontainer.io:9000
DiffbetweenMongoprod(terms-processing)andMongoResearch(batchExtractor):"Duplicate"
Duplicate: office365 dup diff:
Duplicate:davidl Jun (hodaya sent files in chat)
Duplicate:dup1 in research vs 0 in prod
Duplicate: vp subject topic: count + dup diff: yc aug --prod
Duplicate: Latest change to not allow subject topics newLine (retry if failed dup) --> threshold 0.35 (easier) --> resulted in 12 dups, which are now gone.
Duplicate:Research new (my machine): vp rank: 35.6 count: 18 fromMe: 5 sig: 3 childRank: 15.6 childCount: 11
Duplicate: totalSentByAutomatedActor": 0,
Duplicate:totalIsMarketing: 0,
Duplicate:rank: 34.2,
Duplicate:childCount: 10,
Duplicate:dup: 4,
Duplicate:countChildRanks: 9,
Duplicate:totalArtifacts: 17
Duplicate: ESOP - dup 3 in research and dup : 0 in prod
Duplicate: Q: How come 'Noam .' is 11 chars (MIN_LEN_COMPARED = 10 for body)
Duplicate: Research: esop rank: 4.3 count: 7 dup: 3
Duplicate:"totalSentByAutomatedActor": 0,
Duplicate:"general": false,
Duplicate:"topicId": "esop",
Duplicate:"totalIsMarketing": 0,
Duplicate:"rank": 7.0,
Duplicate:"childRank": 0.0,
Duplicate:"childCount": 0,
Duplicate:"totalSentByMe": 0,
Duplicate:"dup": 0,
Duplicate:"countChildRanks": 0,
Duplicate:"totalArtifacts": 7
Duplicate: Duplicate bug: topicId trit yc jul is dup : 1 in JS and dup : 0 in Java
Duplicate: text-duplcate-detector --> commit to github + update package.lock of report.js
Duplicate:
Duplicate:1) leftNbr includes part of 'Accepted' --> ted: Harmon.ie and
Duplicate:2) inSubject: First calcNbrDist duplicate = false --> calcNbrDist with DUPLICATE_THRESHOLD_NL = 0.35 (incorrect - shouldn't be called for Subject)
Duplicate:- - - - - - - - -
Duplicate: Phase 3
Duplicate: Port getDuplicateScore + new code --> Java
Duplicate: Port duplicate.js --> Java and prepare call from topics-job
Duplicate: Spark infra calls EmailsDuplicate.java (which is )
Duplicate: Spark cutoff at 100 topTopics --> prepare Row objects --> call EmailsDuplicate.java --> handles Conversations + Artifact related --> call DuplicateDetector algo --> rerank)
Duplicate: Problem: reCalcTopicRank --> adjustTopicScore (which we do not have in java package)
Duplicate: Solution Alt: implement reCalcTopicRank outside package (but we anyway need adjustTopicScore)
Duplicate: node reports --user ramt -d jul --prod --userDataDbURL mongodb://localhost:27099/collage_test
Duplicate: g.V().hasLabel('user').as('user').out('owns').has('email','[email protected]').select('user')
Duplicate: ram puser = 026da82f-3dcc-41cf-b13b-1d3582024ef5
Duplicate:davidl puser = e4c91e9b-4a9c-4173-b268-9dc0c8e73c89
Duplicate:yaacovc puser = 468a410e-7bb7-416e-8bd5-e6a037c6b5f7
Duplicate: Build new small Graph (dekel-dev)
Duplicate: Change COMPUTERNAME in storage-worker index.js
Duplicate: extractTermsInner: Change query to -d jul
Duplicate: node extractTerms.js --save --userDataDbURL mongodb://localhost:27099/collage_test
Duplicate: Change query in convertFormats.js
Duplicate: node convertFormats.js --userDataDbURL mongodb://localhost:27099/collage_test --tokens
Duplicate: Change Spark dateBarrierFrom / To
Duplicate: Robustness fix in duplicate.js (+ java) - may not have nbrs due to some physcial
Duplicate: Graal regression tests for Algo change (+ new unit tests)
Duplicate: Graal interop from report --> replace duplcate.js with EmailDuplicate.java
Duplicate: Topic.artifacts --> populate with a list of JsonNode + ensure artifact.filtered is marked.
Duplicate: Commit: "Production:
Duplicate: Do not filter out tokens between endOfSubject and bodyStart indexes
Duplicate: Integration Test: Generate Tiny Graph with nbrs of few topics --> replace Mock
Duplicate: Phase 1 -> as in JS --> pass unitTests
Duplicate: Complete numDplicates / getDuplicatesScore
Duplicate: Add Wrapper to text-duplicate-detector::index.js (commited in github and requires original JS) --> to call Java.isDuplicate instead of JS isDuplicate / numDuplicates
Duplicate: reports.js GraalVM integration with JS to replace github/text-duplicate-detector
Duplicate: Test Chrome debugger - can it debug also Java ?
Duplicate: Mocha: Add node_modules/mocha/bin/mocha -inspect-brk (and not after $GRAALVM_HOME/bin/node !!!)
Duplicate: Commit Port to mocha - package.json devDependecies + port commented test() jest specific calls
Duplicate:export GRAALVM_HOME=~/graalvm-ce-1.0.0-rc9/
Duplicate:$GRAALVM_HOME/bin/node --jvm --polyglot --jvm.cp=$dupPath node_modules/mocha/bin/mocha __tests__/duplicate.test.js
Duplicate: Note: jest seems to hang - port tests to mocha
Duplicate: Problem: TypeError: Access to host class Main.DuplicateDetector is not allowed or does not exist.
Duplicate: Rebuild .jar without errors - it is not currently built.
Duplicate:Bug: Pattern.quote(topicText) --> Adds \QDekel Cohen\E --> \QDekel( {1,2}|%20 )Cohen middle 2-space RE is also escaped --> replace with JS escaping code
Duplicate: Q/A: [ and all other specials
Duplicate: Changes Java vs. JS
Duplicate: Proto now explicitly has topic1 and topic2 for 2 different topic about text (JS topic - can be string or array of 2)
Duplicate: GraalVM integration with JS to replace github/text-duplicate-detector
Duplicate: Ex:
Duplicate: $GRAALVM_HOME/bin/node --polyglot --jvm server.js
Duplicate: cd ~/graalvm-ce-1.0.0-rc9/graalvm-demos/polyglot-javascript-java-r
Duplicate: Phase 2
Duplicate: Conversation pair datastruct for inSubject topics
Duplicate: Bug: Us Submission: conversationPair causes only of nbrOccur to get duplicated - why only 12 ?
Duplicate: Invitation: dup: 3->undefined - Why? It appears in subjects of several conversations
Duplicate: Q/A: nbrDistShortInSubject (Us Submission)
Duplicate: THRESHOLD_SCORE_DUP_IN_SUBJECT --> 3 ?
Duplicate: Q/A:
Duplicate: Compare to reports before changes: old version: 9ef7cbdfdade36930c2c3d9dca003e2954775dc4
Duplicate:Bug: YC Jul - missing duplicates in diff
Duplicate: azuredatafactory rank: 2.4->4.2 count: 5 fromMe: 1 dup: 4->2
Duplicate: Run reports and commit in branch (duplicate_occur) --use Conversation Ids
Duplicate: occurs: findTopicInText (first in body only and if appear in both subject and body, take the subject) -->
Duplicate: New: all occur of topic are considered
Duplicate: No dups between .occur of same artifact
Duplicate: New: Subject topics are compared only against other subject topics
Duplicate: conversationId vs. normalized subject
Duplicate: Take max dupScore of a single artifact.occur[<any>] --> If above threshold (3) --> all the artifact is discounted (for this topic)
Duplicate: Update topicRank
Duplicate: getNewlineNbr when the diff of normal nbrs --> no duplicate
Duplicate: use left,right nbr from .occur array (no findTopicInText) --> inTitle
Duplicate: { nbrLeft, nbrRight } --> { left, right }
Duplicate: artsReShaped = topic.artifacts.map --> artifact.about.occur -->
Duplicate:[artifact, { inSubject, leftNbr, rightNbr }]
Duplicate: keepAndConvertRelevantArtifacts --> !isSameArtifact + !isSameConversation (already exist)
Duplicate: Diff Algo
Duplicate: https://github.com/java-diff-utils/java-diff-utils
Duplicate: https://github.com/google/diff-match-patch
Duplicate: https://github.com/google/diff-match-patch/wiki/Language:-Java
Duplicate: Productization:
Duplicate: Test Perf: Does query to about edges of topicId=='Google Calendar', filtered by last 100 (ordered by timestamp) --> Expensive scan (use explain) or indexed fast query ?
Duplicate: Alt A: Redis stores List key=topicId last 100 (see LTRIM)
Duplicate: Q: How much storage memory required ?
Duplicate:A: 250MB. Assume 20000 distinct Terms --> 5000 after filter out long tail (count =< 3) --> each of 5000 terms has 50 occurs on Avg --> Each Term-occur requires 1KB with nbr
Duplicate: Online: Terms Processing - Detect Duplicate of new Term occur against the last 100 occur of this Term.
Duplicate: Keep JS code
Duplicate: More Complex
Duplicate: Less Context Sensitive: If a user duplicates-same-nbr a Term (Google) and it also occur in many other user's emails (but not same-nbr) --> last 100 may not be enough
Duplicate:--> other users nbrs push out the duplicated nbrs
Duplicate: Offline: TopTopics: rank += 1/5 or filterOut for each of the Dup edges
Duplicate: Simpler - similar to today's logic
Duplicate: More Context Aware: Per User / Per Affinity
Duplicate: Dup --> port to Spark Java
Duplicate: sent from my Smasung Galaxy smartphone
Duplicate: Problem: Not enough support from left side (diff 70 chars on left and only 12 are identical)
Duplicate: If diffRatio is near threshold, but not low enogh (0.3) --> return a duplicate probabilty score (in addition to duplicate=false)
Duplicate:--> require higher count
Duplicate: Problem: left nbr match is very small (12 chars)
Duplicate: In addition to diff score, return also the matched text chunks ('sent from my') from func that compare 2 to func that compares array of N
Duplicate:Bug: Should not count Duplicates in SP Urls
Duplicate: Note: SP Urls shouldn't occur in generated paragraphs --> so low risk of missing a duplicate.
Duplicate: Ex: projectvenice dist: {"left":0,"right":1,"duplicate":true} inSubject:true sub: RE: Harmon.ie/Project Venice ("Euclid") sync oSub: RE: Harmon.ie/Project Venice sync
Duplicate:Q: Increase inSubject min dupCount to ~ 5 ? --> we want only spammers
Duplicate: maybe should increase threshold > 2 ?
Duplicate: Still catches noise such as 'Industry News'
Duplicate: Do not kill important Topics that repeats in 3,4 emails
Duplicate:: Why moving tpStat.artifacts.push(jArt); changes dup of outlook ?
Duplicate:: Are all topics in sigs are bad ?
Duplicate: Detect dups after removing signature --> otherwise 'Product Strategy' (david sig) --> detected as dup
Duplicate:"DONE Duplicate
Duplicate:--------------------
Duplicate: ussubmission (Subject:Contact Us Submission) --> ussubmission in many different threads (Contact is blacklist)
Duplicate:--> but rightNbr is empty (end of subject) --> duplicate = false
Duplicate: Count too short matches as maybe Duplicate (0.5) --> require 6 matches instead of 6.
Duplicate: 9511 Extract neighborhood for each term
Duplicate: Problem: tokens indices are in original tokens array (not in subjectTokens and bodyTokens)
Duplicate: Problem: getNormalizedSubject - how to (re) implement using tokens only ?
Duplicate: return the result of getBody
Duplicate: Pass it to duplicateEnricher
Duplicate: Problem: It contains RE: (need to normalize ) + contains subjectEndToken (need to remove or to stop)
Duplicate: body topics: minOffset
Duplicate: tokens[idxTokenStartBody].characterOffsetBegin
Duplicate: subject topics: maxOffset
Duplicate: tokens[idxSubjectEndToken].characterOffsetEnd
Duplicate: Trim leftNbr in subject using getNormalizedSubject
Duplicate: Problem: tokens cached mode - where to get body from ?
Duplicate:
Duplicate: Q/A: Bug: left,right are incorrect --> tokens
Duplicate: token.before sometimes missing (Crash) ?
Duplicate: Ex: Privacy Statement <DB5PR06MB156054DD1CEA4D62644AA606AF330@DB5PR06MB1560.eurprd06.prod.outlook.com>
Duplicate: Drafts: node --inspect-brk extractTerms.js --userDataDbURL mongodb://localhost:27099/collage_new -u "<AM5PR0601MB24347F1674046465748CD223C52C0@AM5PR0601MB2434.eurprd06.prod.outlook.com>" --noLM --noPer
Duplicate:tokens[i].originalText + "---" + body.substr(tokens[i].characterOffsetBegin, 70)
Duplicate:for (i = 0; i < 184; ++i) {
Duplicate:if (body.substr(tokens[i].characterOffsetBegin, tokens[i].originalText.length) !== tokens[i].originalText) { console.log(i); }
Duplicate:}
Duplicate: writes to terms.occur (array instead of cell level)
Duplicate: Note: Not a blocker, if can mock an array of topic nbrs as an input to getDuplicates in Spark
Duplicate: Duplicate-Subject: Why don't we use ConversationId ?
Duplicate: VIP access:
Duplicate: Bryan Oct-Nov has 67 mails with subject 'VIP access', of which 27 isAutomated and 40 were forwarded or replied
Duplicate: Problem: Sender is not Automated, but Subject was created by automated systems --> therefore very common
Duplicate: Remaining 40 in 12 conversations
Duplicate: Move Reports/java --> Reports/terms_processing/java
Duplicate: npm run collage_stable
Duplicate: Remove branch + remote branch collage_stable_all
Duplicate: run npm_install_all
Duplicate: commit new package.json + .lock file
Duplicate: Add the /path/to/sparse-checkout/repo to <repositories> - see https://devcenter.heroku.com/articles/local-maven-dependencies
Duplicate: Git sparseCheckout the repo from Collage repo topics-job