- A powerful benchmark, paper, medium - normalizing data sets allows us to see that there wasn't any advancement in terms of metrics in many NLP algorithms.
- Vidhaya on spacy vs ner - tutorial + code on how to use spacy for pos, dep, ner, compared to nltk/corenlp (sner etc). The results reflect a global score not specific to LOC for example.
- The spaCy course
- SPACY OPTIMIZATION - LP using CYTHON and SPACY.
- The bid bad 600, medium
- Has all the known libraries
- Comparison between spacy, pytorch, allenlp - very basic info
- Comparison spacy,nltk core nlp
- Comparing Production grade nlp libs
- nltk vs spacy
- Fb’s laser
- Xlm, xlm-r
- Google universal embedding space.
- Synonyms, similar embedded words (w2v), back translation, contextualized word embeddings, text generation
- Yonatan hadar also has a medium post about this
TF-IDF - how important is a word to a document in a corpus
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
Frequency of word in doc / all words in document (normalized bcz docs have diff sizes)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
measures how important a term is
TF-IDF is TF*IDF
Data sets:
- mean(IDF(i) * w2v word vectors (i)) with or without reducing PC1 from the whole w2 average (amir pupko)
def mean_weighted_embedding(model, words, idf=1.0):
**if words:**
**return np.mean\(idf \* model\[words\], axis=0\)a**
**else:**
**print\('we have an empty list'\)**
**return \[\]**
idf_mapping = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
logs_sequences_df['idf_vectors'] = logs_sequences_df.message.apply(lambda x: [idf_mapping[token] for token in splitter(x)])
logs_sequences_df['mean_weighted_idf_w2v'] = [mean_weighted_embedding(ft, splitter(logs_sequences_df['message'].iloc[i]), 1 / np.array(logs_sequences_df['idf_vectors'].iloc[i]).reshape(-1,1)) for i in range(logs_sequences_df.shape[0])]
- Multiply by TFIDF
- Enriching using lstm-next word (char or word-wise)
- Using external wiktionary/pedia data for certain words, phrases
- Finding clusters of relevant data and figuring out if you can enrich based on the content of the clusters
- Applying deep nlp methods without big data, i.e., sparseness
- Benchmarking tokenizers for optimalprocessing speed
- Using nltk with gensim
- Multiclass text classification with svm/nb/mean w2v/d2v - tutorial with code and notebook.
- Basic pipeline for keyword extraction
- DL for text classification
- Logistic regression with word ngrams
- Logistic regression with character ngrams
- Logistic regression with word and character ngrams
- Recurrent neural network (bidirectional GRU) without pre-trained embeddings
- Recurrent neural network (bidirectional GRU) with GloVe pre-trained embeddings
- Multi channel Convolutional Neural Network
- RNN (Bidirectional GRU) + CNN model
- LexNLP - glorified regex extractor
- How to convert between verb/noun/adjective/adverb forms using Wordnet
- Complete guide for training your own Part-Of-Speech Tagger - using Penn Treebank tagset. Using nltk or stanford pos taggers, creating features from actual words (manual stemming, etc0 using the tags as labels, on a random forest, thus creating a classifier for POS on our own. Not entirely sure why we need to create a classifier from a “classifier”.
- Word net introduction - POS, lemmatize, synon, antonym, hypernym, hyponym
- Sentence similarity using wordnet - using synonyms cumsum for comparison. Today replaced with w2v mean sentence similarity.
- Stemmers vs lemmatizers - stemmers are faster, lemmatizers are POS / dictionary based, slower, converting to base form.
- Chunking - shallow parsing, compared to deep, similar to NER
- NER - using nltk chunking as a labeller for a classifier, training one of our own. Using IOB features as well as others to create a new ner classifier which should be better than the original by using additional features. Aso uses a new english dataset GMB.
- Building nlp pipelines, functions coroutines etc..
- Training ner using generators
- Metrics, tp/fp/recall/precision/micro/weighted/macro f1
- Tf-idf
- Nltk for beginners
- Nlp corpora corpuses
- bow/bigrams
- Textrank
- Word cloud
- Topic modelling using gensim, lsa, lsi, lda,hdp
- Spacy full tutorial
- POS using CRF
- Python Module to get Meanings, Synonyms and what not for a given word using vocabulary (also a comparison against word net) https://vocabulary.readthedocs.io/en/…
For a given word, using Vocabulary, you can get its
- Meaning
- Synonyms
- Antonyms
- Part of speech : whether the word is a noun, interjection or an adverb et el
- Translate : Translate a phrase from a source language to the desired language.
- Usage example : a quick example on how to use the word in a sentence
- Pronunciation
- Hyphenation : shows the particular stress points(if any)
Swiss army knife libraries
- textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library. With the fundamentals — tokenization, part-of-speech tagging, dependency parsing, etc. — delegated to another library, textacy focuses on the tasks that come before and follow after.
Collocation
- What is collocation? - “the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance.”Medium tutorial, quite good, comparing freq/t-test/pmi/chi2 with github code
- A website dedicated to collocations, methods, references, metrics.
- Text analysis for sentiment, doing feature selection a tutorial with chi2(IG?), part 2 with bi-gram collocation in ntlk
- Text2vec in R - has ideas on how to use collocations, for downstream tasks, LDA, W2V, etc. also explains about PMI and other metrics, note that gensim metric is unsupervised and probablistic.
- NLTK on collocations
- A blog post about keeping or removing stopwords for collocation, usefull but no firm conclusion. Imo we should remove it before
- A blog post with code of using nltk-based collocation
- Small code for using nltk collocation
- Another code / score example for nltk collocation
- Jupyter notebook on manually finding collocation - not useful
- Paper: Ngram2Vec - Github We introduce ngrams into four representation methods. The experimental results demonstrate ngrams’ effectiveness for learning improved word representations. In addition, we find that the trained ngram embeddings are able to reflect their semantic meanings and syntactic patterns. To alleviate the costs brought by ngrams, we propose a novel way of building co-occurrence matrix, enabling the ngram-based models to run on cheap hardware
- Youtube on bigrams, collocation, mutual info and collocation
Language detection
- Using google lang detect - 55 languages af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Stemming
How to measure a stemmer?
Phrase modelling
- Phrase Modeling - using gensim and spacy
Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens AA and BB constitute a phrase is:
count(A B)−countmincount(A)∗count(B)∗N>threshold
Document classification
Hebrew NLP tools
- HebMorph last update 7y ago
- Hebmorph elastic search ****Hebmorph blog post, and other blog posts, youtube
- Awesome hebrew nlp git, git
- Hebrew-nlp service ****docs ****the features (morphological analysis, normalization etc), git
- Apache solr stop words (dead)
- SO on hebrew analyzer/stemming, here too
- Neural sentiment benchmark using two algorithms, for character and word level lstm/gru - the paper
- Hebrew word embeddings
- Paper for rich morphological datasets for comparison - rivlin
Semantic roles:
- Snorkel - using weak supervision to create less noisy labelled datasets
- Snorkel metal weak supervision for multi-task learning. Conversation, git
- Yes, the Snorkel project has included work before on hierarchical labeling scenarios. The main papers detailing our results include the DEEM workshop paper you referenced (https://dl.acm.org/doi/abs/10.1145/3209889.3209898) and the more complete paper presented at AAAI (https://arxiv.org/abs/1810.02840). Before the Snorkel and Snorkel MeTaL projects were merged in Snorkel v0.9, the Snorkel MeTaL project included an interface for explicitly specifying hierarchies between tasks which was utilized by the label model and could be used to automatically compile a multi-task end model as well (demo here: https://github.com/HazyResearch/metal/blob/master/tutorials/Multitask.ipynb). That interface is not currently available in Snorkel v0.9 (no fundamental blockers; just hasn't been ported over yet).
- There are, however, still a number of ways to model such situations. One way is to treat each node in the hierarchy as a separate task and combine their probabilities post-hoc (e.g., P(credit-request) = P(billing) * P(credit-request | billing)). Another is to treat them as separate tasks and use a multi-task end model to implicitly learn how the predictions of some tasks should affect the predictions of others (e.g., the end model we use in the AAAI paper). A third option is to create a single task with all the leaf categories and modify the output space of the LFs you were considering for the higher nodes (the deeper your hierarchy is or the larger the number of classes, the less apppealing this is w/r/t to approaches 1 and 2).
- mechanical turk calculator
- Mturk alternatives
- Workforce / onespace
- Jobby
- Shorttask
- Samasource
- Figure 8 - pricing - definite guide
- Brat nlp annotation tool
- Prodigy by spacy,
- Doccano - prodigy open source alternative butwith users management & statistics out of the box
- Medium Lighttag - has some cool annotation metrics\tests
- Loopr.ai - An AI powered semi-automated and automated annotation process for high quality data.object detection, analytics, nlp, active learning.
- Medium Assessing annotator disagreement
- A great python package for measuring disagreement on GH
- Reliability is key, and not just mechanical turk
- 7 myths about annotation
- Annotating twitter sentiment using humans, 3 classes, 55% accuracy using SVMs. they talk about inter agreement etc. and their DS is partially publicly available.
- Exploiting disagreement
- Vader annotation
- They must pass an english exam
- They get control questions to establish their reliability
- They get a few sentences over and over again to establish inter disagreement
- Two or more people get a overlapping sentences to establish disagreement
- 5 judges for each sentence (makes 4 useless)
- They dont know each other
- Simple rules to follow
- Random selection of sentences
- Even classes
- No experts
- Measuring reliability kappa/the other kappa.
- Label studio
Ideas:
- Active learning for a group (or single) of annotators, we have to wait for all annotations to finish each big batch in order to retrain the model.
- Annotate a small group, automatic labelling using knn
- Find a nearest neighbor for out optimal set of keywords per “category,
- For a group of keywords, find their knn neighbors in w2v-space, alternatively find k clusters in w2v space that has those keywords. For a new word/mean sentence vector in the ‘category’ find the minimal distance to the new cluster (either one of approaches) and this is new annotation.
- Myth One: One Truth Most data collection efforts assume that there is one correct interpretation for every input example.
- Myth Two: Disagreement Is Bad To increase the quality of annotation data, disagreement among the annotators should be avoided or reduced.
- Myth Three: Detailed Guidelines Help When specific cases continuously cause disagreement, more instructions are added to limit interpretations.
- Myth Four: One Is Enough Most annotated examples are evaluated by one person.
- Myth Five: Experts Are Better Human annotators with domain knowledge provide better annotated data.
- Myth Six: All Examples Are Created Equal The mathematics of using ground truth treats every example the same; either you match the correct result or not.
- Myth Seven: Once Done, Forever Valid Once human annotated data is collected for a task, it is used over and over with no update. New annotated data is not aligned with previous data.
- Conclusions:
- Experts are the same as a crowd
- Costs a lot less $$$.
*** The best tutorial on agreements, cohen, david, kappa, krip etc.
- Cohens kappa (two people)
but you can use it to map a group by calculating agreement for each pair
- Why cohens kappa should be avoided as a performance measure in classification
- Why it should be used as a measure of classification
- Kappa in plain english
- Multilabel using kappa
- Kappa and the relation with accuracy (redundant, % above chance, should not be used due to other reasons researched here)
The Kappa statistic varies from 0 to 1, where.
- 0 = agreement equivalent to chance.
- 0.1 – 0.20 = slight agreement.
- 0.21 – 0.40 = fair agreement.
- 0.41 – 0.60 = moderate agreement.
- 0.61 – 0.80 = substantial agreement.
- 0.81 – 0.99 = near perfect agreement
- 1 = perfect agreement.
- Fleiss’ kappa, from 3 people and above.
Kappa ranges from 0 to 1, where:
- 0 is no agreement (or agreement that you would expect to find by chance),
- 1 is perfect agreement.
- Fleiss’s Kappa is an extension of Cohen’s kappa for three raters or more. In addition, the assumption with Cohen’s kappa is that your raters are deliberately chosen and fixed. With Fleiss’ kappa, the assumption is that your raters were chosen at random from a larger population.
- Kendall’s Tau is used when you have ranked data, like two people ordering 10 candidates from most preferred to least preferred.
- Krippendorff’s alpha is useful when you have multiple raters and multiple possible ratings.
- Krippendorfs alpha
- Ignores missing data entirely.
- Can handle various sample sizes, categories, and numbers of raters.
- Applies to any measurement level (i.e. (nominal, ordinal, interval, ratio).
- Values range from 0 to 1, where 0 is perfect disagreement and 1 is perfect agreement. Krippendorff suggests: “[I]t is customary to require α ≥ .800. Where tentative conclusions are still acceptable, α ≥ .667 is the lowest conceivable limit (2004, p. 241).”
- Supposedly multi label
- MACE - the new kid on the block. -
learns in an unsupervised fashion to
- a) identify which annotators are trustworthy and
- b) predict the correct underlying labels. We match performance of more complex state-of-the-art systems and perform well even under adversarial conditions
- MACE does exactly that. It tries to find out which annotators are more trustworthy and upweighs their answers.
- Git -
When evaluating redundant annotatio
ns (like those from Amazon's MechanicalTurk), we usually want to
- aggregate annotations to recover the most likely answer
- find out which annotators are trustworthy
- evaluate item and task difficulty
MACE solves all of these problems, by learning competence estimates for each annotators and computing the most likely answer based on those competences.
Calculating agreement
- Compare against researcher-ground-truth
- Self-agreement
- Inter-agreement
- Medium
- Kappa cohen
- Multi annotator with kappa (which isnt), is this okay?
- Github computer Fleiss Kappa 1
- Fleiss Kappa Example
- GWET AC1, paper: as an alternative to kappa, and why
- Website, krippensorf vs fleiss calculator
Machine Vision annotation
Troubling shooting agreement metrics
- Imbalance data sets, i.e., why my Why is reliability so low when percentage of agreement is high?
- Interpretation of kappa values
- Interpreting agreement, Accuracy precision kappa
- Cnn for text - tal perry
- 1D CNN using KERAS
- Automatic creation of KG using spacy and networx Knowledge graphs can be constructed automatically from text using part-of-speech and dependency parsing. The extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using NLP library SpaCy.
- Medium on Reconciling your data and the world of knowledge graphs
- Medium Series:
-
- Email summarization but with a great intro (see image above)
- With nltk - words assigned weighted frequency, summed up in sentences and then selected based on the top K scored sentences.
- Awesome-text-summarization on github
- Methodical review of abstractive summarization
- Medium on extractive and abstractive - overview with the abstractive code
- NAMAS - Neural attention model for abstractive summarization, -Neural Attention Model for Abstractive Sentence Summarization - summarizes single sentences quite well, github
- Abstractive vs extractive, blue intro
- Intro to text summarization
- Paper: survey on text summ, arxiv
- Very short intro
- Intro on encoder decoder
- Unsupervised methods using sentence emebeddings (long and good) - using sent2vec, clustering, picking by rank
- Abstractive summarization using bert for sota
- Abstractive
- Git1: uses pytorch 0.7, fails to work no matter what i did
- Git2, keras code for headlines, missing dataset
- Encoder decoder in keras using rnn, claims cherry picked results, the majority is prob not as good
- A lot of Text summarization algos on git, using seq2seq, using many methods, glove, etc -
- Summarization with point generator networks on git
- Summarization based on gigaword claims SOTA
- Facebooks neural attention network NAMAS on git
- Medium on summarization with tensor flow on news articles from cnn
- Keywords extraction
- The best text rank presentation
- Text rank by gensim on medium
- Text rank 2
- Text rank - custom code, extractive vs abstractive, how to use, some more theoretical info and page rank intuition.
- Text rank paper
- Improving textrank using adjectival and noun compound modifiers
- Unread - New similarity function paper for textrank
- Extractive summarization
- Text rank with glove vectors instead of tf-idf as in the paper (sam)
- Medium with code on extractive using word occurrence similarity + cosine, pick top based on rank
- Medium on methods, freq, LSA, linking words, sentences,bayesian, graph ranking, hmm, crf,
- Wiki on automatic summarization, abstractive vs extractive,
- Pyteaser, textteaset, lexrank, pytextrank summarization models & rouge-1/n and blue metrics to determine quality of summarization models Bottom line is that textrank is competitive to sumy_lex
- Sumy
- Pyteaser
- Pytextrank
- Lexrank
- Gensim tutorial on textrank
- Email summarization
- Sentiment databases ****
- Movie reviews: IMDB reviews dataset on Kaggle
- Sentiwordnet – mapping wordnet senses to a polarity model: SentiWordnet Site
- Twitter airline sentiment on Kaggle
- First GOP Debate Twitter Sentiment
- Amazon fine foods reviews
- ** Many Sentiment tools,
- NTLK sentiment analyzer
- Vader (NTLK, standalone):
- Text BLob:
- Comparative opinion mining a review paper - has some info about unsupervised as well
- Another reference list, has some unsupervised.
- Sentiwordnet3.0 paper
- presentation
Reference papers:
- For sentiment In Vader -
- “Screening for English language reading comprehension – each rater had to individually score an 80% or higher on a standardized college-level reading comprehension test.
- Complete an online sentiment rating training and orientation session, and score 90% or higher for matching the known (prevalidated) mean sentiment rating of lexical items which included individual words, emoticons, acronyms, sentences, tweets, and text snippets (e.g., sentence segments, or phrases).
- Every batch of 25 features contained five “golden items” with a known (pre-validated) sentiment rating distribution. If a worker was more than one standard deviation away from the mean of this known distribution on three or more of the five golden items, we discarded all 25 ratings in the batch from this worker.
- Bonus to incentivize and reward the highest quality work. Asked workers to select the valence score that they thought “most other people” would choose for the given lexical feature (early/iterative pilot testing revealed that wording the instructions in this manner garnered a much tighter standard deviation without significantly affecting the mean sentiment rating, allowing us to achieve higher quality (generalized) results while being more economical).
- Compensated AMT workers $0.25 for each batch of 25 items they rated, with an additional $0.25 incentive bonus for all workers who successfully matched the group mean (within 1.5 standard deviations) on at least 20 of 25 responses in each batch. Using these four quality control methods, we achieved remarkable value in the data obtained from our AMT workers – we paid incentive bonuses for high quality to at least 90% of raters for most batches.
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
- 1.6 million tweets labelled
- 13 languages
- Evaluated 6 pretrained classification models
- 10 CFV
- SVM / NB
- Annotator agreements.
- about 15% were intentionally duplicated to be annotated twice,
- by the same annotator
- by two different annotators
- Self-agreement from multiple annotations of the same annotator
- Inter-agreement from multiple annotations by different annotators
- The confidence intervals for the agreements are estimated by bootstrapping [12].
- It turns out that the self-agreement is a good measure to identify low quality annotators,
- the inter-annotator agreement provides a good estimate of the objective difficulty of the task, unless it is too low.
Alpha was developed to measure the agreement between human annotators, but can also be used to measure the agreement between classification models and a gold standard. It generalizes several specialized agreement measures, takes ordering of classes into account, and accounts for the agreement by chance. Alpha is defined as follows:
Method cont here in a second paper
****A very good article about LSA (TFIDV X SVD), pLSA, LDA, and LDA2VEC. Including code and explanation about dirichlet probability. Lda2vec code
A descriptive comparison for LSA pLSA and LDA
A great summation about topic modeling, Pros and Cons! (LSA, pLSA, LDA)
Word cloud for topic modelling
Sklearn LDA and NMF for topic modelling
Topic modelling with sentiment per topic according to the data in the topic
LDA is already taken by the above algorithm!
Latent Dirichlet allocation (LDA) - This algorithm takes a group of documents (anything that is made of up text), and returns a number of topics (which are made up of a number of words) most relevant to these documents.
- LDA is an example of topic modelling
- ?- can this be used for any set of features, not just text?
Medium Article about LDA and NMF (Non-negative Matrix factorization)+ code
Medium article on LDA - a good one with pseudo algorithm and proof
In case LDA groups together two topics, we can influence the algorithm in a way that makes those two topics separable - this is called Semi Supervised Guided LDA
- LDA tutorials plus code, used this to build my own classes - using gensim mallet wrapper, doesn't work on pyLDAviz, so use this to fix it
- Introduction to LDA topic modelling, really good, ****plus git code
- Sklearn examples using LDA and NMF
- Tutorial on lda/nmf on medium - using tfidf matrix as input!
- Gensim and sklearn LDA variants, comparison, python 3
- Medium article on lda/nmf with code
- One of the best explanation about Tf-idf vs bow for LDA/NMF - tf for lda, tfidf for nmf, but tfidf can be used for top k selection in lda + visualization, important paper
- LDA is a probabilistic generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.
- NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.
- From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.
- As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold.
- The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search.
- So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix.
Very important:
- How to measure the variance for LDA and NMF, against PCA. 1. Variance score the transformation and inverse transformation of data, test for 1,2,3,4 PCs/LDs/NMs.
- Matching lda mallet performance with gensim and sklearn lda via hyper parameters
- What is LDA?
- It is unsupervised natively; it uses joint probability method to find topics(user has to pass # of topics to LDA api). If “Doc X word” is size of input data to LDA, it transforms it to 2 matrices:
- Doc X topic
- Word X topic
- further if you want, you can feed “Doc X topic” matrix to supervised algorithm if labels were given.
- Medium on LDA, explains the random probabilistic nature of LDA
- Machinelearningplus on LDA in sklearn - a great read, dont forget to read the mallet article.
- Medium on LSA pLSA, LDA LDA2vec, high level theoretical - not clear
- Medium on LSI vs LDA vs HDP, HDP wins..
- Medium on LDA, some historical reference and general high level how to use exapmles.
- Incredibly useful response on LDA grid search params and about LDA expectations. Must read.
- Lda vs pLSA, talks about the sampling from a distribution of distributions in LDA
- BLog post on topic modelling - has some text about overfitting - undiscussed in many places.
- Perplexity vs coherence on held out unseen data, not okay and okay, respectively. Due to how we measure the metrics, ie., read the formulas. Also this and this
- LDA as dimentionality reduction
- LDA on alpha and beta to control density of topics
- Jupyter notebook on hacknews LDA topic modelling - missing code?
- Jupyter notebook for kmeans, lda, svd,nmf comparison - advice is to keep nmf or other as a baseline to measure against LDA.
- Gensim on LDA with code
- Medium on lda with sklearn
- Selecting the number of topics in LDA, blog 1, blog2, using preplexity, prep and aic bic, coherence, coherence2, coherence 3 with tutorial, unclear, unclear with analysis of stopword % inclusion, unread, paper: heuristic approach, elbow method, using cv, Paper: new stability metric + gh code,
- Selecting the top K words in LDA
- Presentation: best practices for LDA
- Medium on guidedLDA - switching from LDA to a variation of it that is guided by the researcher / data
- Medium on lda - another introductory, la times
- Topic modelling through time
- Mallet vs nltk, params, params
- Paper: improving feature models
- Lda vs w2v (doesn't make sense to compare, again here
- Adding lda features to w2v for classification
- Spacy and gensim on 20 news groups
- The best topic modelling explanation including Usages, insights, a great read, with code - shows how to find similar docs by topic in gensim, and shows how to transform unseen documents and do similarity using sklearn:
- Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
- Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
- Uncovering Themes in Texts – Useful for detecting trends in online publications for example
- A Form of Tagging - If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.
- Topic Modelling for Feature Selection - Sometimes LDA can also be used as feature selection technique. Take an example of text classification problem where the training data contain category wise documents. If LDA is running on sets of category wise documents. Followed by removing common topic terms across the results of different categories will give the best features for a category.
How to interpret topics using pyldaviz:
Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:
- Larger topics are more frequent in the corpus.
- Topics closer together are more similar, topics further apart are less similar.
- When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
- Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.
pyLDAviz - what am i looking at ? by spacy
There are a lot of moving parts in the visualization. Here's a brief summary:
- On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)
- The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.
- The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
- An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
- On the right, there is a bar chart showing top terms.
- When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
- When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λλ, which can be adjusted with a slider above the bar chart.
- Setting the λλ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
- Setting λλ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.
- Setting λλ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
- Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.
A more detailed explanation of the pyLDAvis visualization can be found here. Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.
- Another great article about LDA, including algorithm, parameters!! And
Parameters of LDA
- Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.
- Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.
- Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.
- Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.
Ways to improve LDA:
- Reduce dimentionality of document-term matrix
- Frequency filter
- POS filter
- Batch wise LDA
- History of LDA - by the frech guy
- Diff between lda and mallet - The inference algorithms in Mallet and Gensim are indeed different. Mallet uses Gibbs Sampling which is more precise than Gensim's faster and online Variational Bayes. There is a way to get relatively performance by increasing number of passes.
- Mallet in gensim blog post
- Alpha beta in mallet: contribution
- The default for alpha is 5.0 divided by the number of topics. You can think of this as five "pseudo-words" of weight on the uniform distribution over topics. If the document is short, we expect to stay closer to the uniform prior. If the document is long, we would feel more confident moving away from the prior.
- With hyperparameter optimization, the alpha value for each topic can be different. They usually become smaller than the default setting.
- The default value for beta is 0.01. This means that each topic has a weight on the uniform prior equal to the size of the vocabulary divided by 100. This seems to be a good value. With optimization turned on, the value rarely changes by more than a factor of two.
- Multilingual - alpha is divided by topic count, reaffirms 7
- Topic modelling with lda and nmf on medium - has a very good simple example with probabilities
- Code: great for top docs, terms, topics etc.
- Great article: Many ways of evaluating topics by running LDA
- Youtube on LDAvis explained
- Presentation: More visualization options including ldavis
- A pointer to the ldaviz fix -> fix, git code
- Difference between lda in gensim and sklearn a post on rare
- The best code article on LDA/MALLET, and using sklearn (using clustering for getting group of sentences in each topic)
- LDA in gensim, a tutorial by gensim
- ****Lda on medium ****
- ****What are the pros and cons of LDA and NMF in topic modeling? Under what situations should we choose LDA or NMF? Is there comparison of two techniques in topic modeling?
- What is the difference between NMF and LDA? Why are the priors of LDA sparse-induced?
- Exploring Topic Coherence over many models and many topics lda nmf svd, using umass and uci coherence measures
- *** Practical topic findings for short sentence text code
- What's the difference between SVD/NMF and LDA as topic model algorithms essentially? Deterministic vs prob based
- What is the difference between NMF and LDA? Why are the priors of LDA sparse-induced?
- What are the relationships among NMF, tensor factorization, deep learning, topic modeling, etc.?
- Code: lda nmf
- Unread a comparison of lda and nmf
- Presentation: lda sparse coding matrix factorization
- An experimental comparison between NMF and LDA for active cross-situational object-word learning
- Topic coherence in gensom with jupyter code
- Topic modelling dynamic presentation
- Paper: Topic modelling and event identification from twitter data, says LDA vs NMI (NMF?) and using coherence to analyze
- Just another medium article about ™
- What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE) LDADE's tunings dramatically reduces topic instability.
- Talk about topic modelling
- Intro to topic modelling
- Detecting topics in twitter github code
- Another topic model tutorial
- (didnt read) NTM - neural topic modeling using embedded spaces with github code
- Another lda tutorial
- Comparing tweets using lda
- Lda and w2v as features for some classification task
- Improving ™ with embeddings
- w2v/doc2v for topic clustering - need to see the code to understand how they got clean topics, i assume a human rewrote it
Topic coherence (lda/nmf)
- What is?, Wiki on pmi
- Datacamp on coherence metrics, a comparison, read me.
- Paper: explains what is coherence
- Umass vs C_v, what are the diff?
- Paper: umass, uci, nmpi, cv, cp etv Exploring the Space of Topic Coherence Measures
- Paper: Automatic evaluation of topic coherence ****
- Paper: exploring the space of topic coherence methods
- Paper: Relation between mutial information / entropy and pmi
- Stackexchange: coherence / pmi how to calc
- Paper: Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality - perplexity needs unseen data, coherence doesnt
- Evaluation of topic modelling techniques for twitter lda lda-u btm w2vgmm
- Paper: Topic coherence measures
- topic modelling from different domains
- Paper: Optimizing Semantic Coherence in Topic Models
- Paper: L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization
- Paper: Content matching between TV shows and advertisements through Latent Dirichlet Allocation
- Paper: Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation
- Paper: Evaluating topic coherence - Abstract: Topic models extract representative word sets—called topics—from word counts in documents without requiring any semantic annotations. Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics. Studies of topic coherence so far are limited to measures that score pairs of individual words. For the first time, we include coherence measures from scientific philosophy that score pairs of more complex word subsets and apply them to topic scoring.
Conclusion: The results of the first experiment show that if we are using the one-any, any-any and one-all coherences directly for optimization they are leading to meaningful word sets. The second experiment shows that these coherence measures are able to outperform the UCI coherence as well as the UMass coherence on these generated word sets. For evaluating LDA topics any-any and one-any coherences perform slightly better than the UCI coherence. The correlation of the UMass coherence and the human ratings is not as high as for the other coherences.
- Code: Evaluating topic coherence, using gensim umass or cv parameter - To conclude, there are many other approaches to evaluate Topic models such as Perplexity, but its poor indicator of the quality of the topics.Topic Visualization is also a good way to assess topic models. Topic Coherence measure is a good way to compare difference topic models based on their human-interpretability.The u_mass and c_v topic coherences capture the optimal number of topics by giving the interpretability of these topics a number called coherence score.
- Formulas: UCI vs UMASS
- Inferring the number of topics for gensim's LDA - perplexity, CM, AIC, and BIC
- Perplexity as a measure for LDA
- Finding number of topics using perplexity
- Coherence for tweets
- Presentation Twitter DLA, tweet pooling improvements, hierarchical summarization of tweets, twitter LDA in java ****on github Papers: TM of twitter timeline, in twitter aggregation by conversatoin, twitter topics using LDA, empirical study ,
- Using regularization to improve PMI score and in turn coherence for LDA topics
- Improving model precision - coherence using turkers for LDA
- Gensim - paper about their algorithm and PMI/UCI etc.
- Advice for coherence, then Good vs bad model (50 vs 1 iterations) measuring u_mass coherence - 2nd code - “In your data we can see that there is a peak between 0-100 and a peak between 400-500. What I would think in this case is that "does ~480 topics make sense for the kind of data I have?" If not, you can just do an np.argmax for 0-100 topics and trade-off coherence score for simpler understanding. Otherwise just do an np.argmax on the full set.”
- Diff term weighting schemas for topic modelling, code plus paper
- Workaround for pyLDAvis using LDA-Mallet
- pyLDAvis paper
- Visualizing LDA topics results
- Visualizing trends, topics, sentiment, heat maps, entities - really good
- Topic stability Metric, a novel method, compared against jaccard, spearman, silhouette.: Measuring LDA Topic Stability from Clusters of Replicated Runs
Medium Article about LDA and NMF (Non-negative Matrix factorization)+ code
- “if you want to rework your own topic models that, say, jointly correlate an article’s topics with votes or predict topics over users then you might be interested in lda2vec.”
- Datacamp intro
- Original blog - I just learned about these papers which are quite similar: Gaussian LDA for Topic Word Embeddings and Nonparametric Spherical Topic Modeling with Word Embeddings.
- Moody’s Slide Share (excellent read)
- Docs
- Original Git + Excellent notebook example
- Tf implementation, another more recent one tf 1.5
- Another blog explaining about lda etc, post, post
- Lda2vec in tf, tf 1.5,
- Comparing lda2vec to lda
- Youtube: lda/doc2vec with pca examples
- Example on gh on jupyter
- Git, paper
- Topic modeling with distillibert on medium, bertTopic!, c-tfidf, umap, hdbscan, merging similar topics, visualization, berTopic (same method as the above)
- Medium with the same general method
- State of the art LSTM architectures using NN
- Medium: Ner free datasets and bilstm implementation using glove embeddings
Easy to implement in keras! They are based on the following paper
- Medium: NLTK entities, polyglot entities, sner entities, finally an ensemble method wins all!
- Comparison between spacy and SNER - for terms.
- *** Unsupervised NER using Bert
- Custom NER using spacy
- Spacy Ner with custom data
- How to create a NER from scratch using kaggle data, using crf, and analysing crf weights using external package
- Another comparison between spacy and SNER - both are the same, for many classes.
- Vidhaya on spacy vs ner - tutorial + code on how to use spacy for pos, dep, ner, compared to nltk/corenlp (sner etc). The results reflect a global score not specific to LOC for example.
Stanford NER (SNER)
- SNER presentation - combines HMM and MaxEnt features, distributional features, NER has
- many applications.
- How to train SNER, a FAQ with many other answers (read first before doing anything with SNER)
- SNER demo - capital letters matter, a minimum of one. ****
- State of the art NER benchmark
- Review paper, SNER, spacy, stanford wins
- Review paper SNER, others on biographical text, stanford wins
- Another NER DL paper, 90%+
Spacy & Others
- Spacy - using prodigy and spacy to train a NER classifier using active learning
- Ner using DL BLSTM, using glove embeddings, using CRF layer against another CRF.
- Another medium paper on the BLSTM CRF with guillarue’s code
- Guillaume blog post, detailed explanation
- For Italian
- Another 90+ proposed solution
- A promising python implementation based on one or two of the previous papers
- Quora advise, the first is cool, the second is questionable
- Off the shelf solutions benchmark
- Parallel api talk about bilstm and their 2mil tagged ner model (washington passes)
- Bert search engine, cosine between paragraphs and question.
- Semantic search, autom completion, filtering, augmentation, scoring. Problems: Token matching, contextualization, query misunderstanding, image search, metric. Solutions: synonym generation, query autocompletion, alternate query generation, word and doc embedding, contextualization, ranking, ensemble, multilingual search
- Keras blog - char-level, token-using embedding layer, teacher forcing
- Teacher forcing explained
- Same as keras but with token-level
- Medium on char, word, byte-level
- Mastery on enc-dec using the keras method, and on neural translation
- Machine translation git from eng to jap, another, and its medium
- Incorporating Copying Mechanism in Sequence-to-Sequence Learning - In this paper, we incorporate copying into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. CopyNet can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence.
- Pythia, qna for images on colab
- Building a Q&A system part 1
- Building a Q&A model
- Vidhya on Q&A
- Q&A system using ****milvus - An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
- Q&A system
- Using RNN
- Using language modeling
- Word based vs char based - Word-based LMs display higher accuracy and lower computational cost than char-based LMs. However, char-based RNN LMs better model languages with a rich morphology such as Finish, Turkish, Russian etc. Using word-based RNN LMs to model such languages is difficult if possible at all and is not advised. Char-based RNN LMs can mimic grammatically correct sequences for a wide range of languages, require bigger hidden layer and computationally more expensive while word-based RNN LMs train faster and generate more coherent texts and yet even these generated texts are far from making actual sense.
- mediu m on Char based with code, leads to better grammer
- Git, keras language models, char level word level and sentence using VAE
- A qualitative comparison of google, azure, amazon, ibm LD LI
- CLD2, CLD3, PYCLD2, POLYGLOT wraps CLD, alex ott cld stats, cld comparison vs tika langid
- Fast text LI, facebook post
- OPENNLP
- Google detect language, github code, v3beta
- Microsoft azure LD, ****2
- Ibm watson, 2
- Amazon, **** 2
- Lingua - most accurate for java… doesn't seem like its accurate enough
- LD with infinity gram 99.1 on a lot of data a benchmark for this 2012 method, LD with infinity gram
- WiLI dataset for LD, comparison of CLD vs others
- Comparison of CLD vs FT vs OPEN NLP - beware based on 200 samples per language!!
Full results for every language that I tested are in table at the end of blog post & on Github. From them I can make following conclusions:
- all detectors are equally good on some languages, such as, Japanese, Chinese, Vietnamese, Greek, Arabic, Farsi, Georgian, etc. - for them the accuracy of detection is between 98 & 100%;
- CLD is much better in detection of "rare" languages, especially for languages, that are similar to more frequently used - Afrikaans vs Dutch, Azerbaijani vs. Turkish, Malay vs. Indonesian, Nepali vs. Hindi, Russian vs Bulgarian, etc. (it could be result of imbalance of training data - I need to check the source dataset);
- for "major" languages not mentioned above (English, French, German, Spanish, Portuguese, Dutch) the fastText results are much better than CLD's, and in many cases lingid.py's & OpenNLP's;
- for many languages results for "compressed" fastText model are slightly worse than results from "full" model (mostly only by 1-2%, but could be higher, like for Kazakh when difference is 33%), but there are languages where the situation is different - results for compressed are slight better than for full (for example, for German or Dutch);
OpenNLP has many misclassifications for Cyrillic languages - Russian/Ukrainian, ...
Rafael Oliveira posted on FB a simple diagram that shows what languages are detected better by CLD & what is better handled by fastText
Here are some additional notes about differences in behavior of detectors that I observe during analyzing results:
- fastText is more reliable than CLD on the short texts;
- fastText models & langid.py detect language as Hebrew instead of Jewish as in CLD. Similarly, CLD uses 'in' for Indonesian language instead of standard 'id' used by fastText & langid.py;
- fastText distinguish between Cyrillic- & Latin-based versions of Serbian;
- CLD tends to incorporate geographical & person's names into detection results - for example, blog post in German about travel to Iceland is detected as Icelandic, while fastText detects it as German;
- In extended detection mode CLD tends to select more rare language, like, Galician or Catalan over Spanish, Serbian instead of Russian, etc.
- OpenNLP isn't very good in detection for short texts.
The models released by fastText development team provides very good alternative to existing language detection tools, like, Google's CLD & langid.py - for most of "popular" languages, these models provides higher detection accuracy comparing to other tools, combined with high speed of detection (drawback of langid.py). Even using "compressed" model it's possible to reach good detection accuracy. Although for some less frequently used languages, CLD & langid.py may show better results.
Performance-wise, the langid.py is much slower than both CLD & fastText. On average, CLD requires 0.5-1 ms to perform language detection. For fastText & langid.py I don't have precise numbers yet, only approximates based on speed of execution of corresponding programs.
GIT:
Articles:
Papers:
- A comparison of lang ident approaches
- lI on code switch social media
- Comparing LI methods, has 6 big languages
- Comparing LI techniques
- Radim rehurek LI on the web extending the dictionary
- Comparative study of LI methods, 2
- State of the art methods for neural machine translation - a review of papers
- LASER: Zero shot multi lang-translation by facebook, github
- How to use laser on medium
- Stanford coreNLP language POS/NER/DEP PARSE etc for 53 languages
- Using embedding spaces w2v by gensim
- The risk of using bleu
- Really good: BLUE - what is it, how it is calculated?
“[BLEU] looks at the presence or absence of particular words, as well as the ordering and the degree of distortion—how much they actually are separated in the output.”
BLEU’s evaluation system requires two inputs: (i) a numerical translation closeness metric, which is then assigned and measured against (ii) a corpus of human reference translations.
BLEU averages out various metrics using an n-gram method, a probabilistic language model often used in computational linguistics.
The result is typically measured on a 0 to 1 scale, with 1 as the hypothetical “perfect” translation. Since the human reference, against which MT is measured, is always made up of multiple translations, even a human translation would not score a 1, however. Sometimes the score is expressed as multiplied by 100 or, as in the case of Google mentioned above, by 10.
a BLEU score offers more of an intuitive rather than an absolute meaning and is best used for relative judgments: “If we get a BLEU score of 35 (out of 100), it seems okay, but it actually has no correlation to the quality of the output in any meaningful sense. If it’s less than 15, we can probably safely say it’s very bad. If it’s greater than 60, we probably have some mistake in our testing! So it will generally fall in there.”
“Typically, if you have multiple [human translation] references, the BLEU score tends to be higher. So if you hear a very large BLEU score—someone gives you a value that seems very high—you can ask them if there are multiple references being used; because, then, that is the reason that the score is actually higher.”
- General talk about FAMG (fb, ama, micro, goog) and research direction atm, including some info about BLUE scores and the comparison issues with reports of BLUE (boils down to diff unmentioned parameters)
- One proposed solution is sacreBLUE, pip install sacreblue
Named entity language transliteration
- Why re is slow
- Benchmark, comparisons, more, many more,
- Split on separator but keep the separator, in Python
- Semantic versioning
- Hyperscan - Hyperscan ****(paper) is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre library, but is a standalone library with its own C API.
- Re2 - python This is the source code repository for RE2, a regular expression library.
- Spacy’s Matcher & “regex”
- Flashtext- This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm.