Natural Language Processing

NLP - a reality check

A powerful benchmark, paper, medium - normalizing data sets allows us to see that there wasn't any advancement in terms of metrics in many NLP algorithms.

TOOLS

SPACY

Vidhaya on spacy vs ner - tutorial + code on how to use spacy for pos, dep, ner, compared to nltk/corenlp (sner etc). The results reflect a global score not specific to LOC for example.
The spaCy course
SPACY OPTIMIZATION - LP using CYTHON and SPACY.

NLP embedding repositories

Nlpl

NLP DATASETS

The bid bad 600, medium

NLP Libraries

Has all the known libraries
Comparison between spacy, pytorch, allenlp - very basic info
Comparison spacy,nltk core nlp
Comparing Production grade nlp libs
nltk vs spacy

Multilingual models

Fb’s laser
Xlm, xlm-r
Google universal embedding space.

Augmenting text in NLP

Synonyms, similar embedded words (w2v), back translation, contextualized word embeddings, text generation
Yonatan hadar also has a medium post about this

TF-IDF

TF-IDF - how important is a word to a document in a corpus

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Frequency of word in doc / all words in document (normalized bcz docs have diff sizes)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

measures how important a term is

TF-IDF is TF*IDF

A much clearer explanation plus python code, part 2
Get top tfidf keywords
Print top features

Data sets:

Fast text multilingual
NLP embeddings

Sparse textual content

mean(IDF(i) * w2v word vectors (i)) with or without reducing PC1 from the whole w2 average (amir pupko)

def mean_weighted_embedding(model, words, idf=1.0):

**if words:**

    **return np.mean\(idf \* model\[words\], axis=0\)a**

**else:**

    **print\('we have an empty list'\)**

    **return \[\]**

idf_mapping = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

logs_sequences_df['idf_vectors'] = logs_sequences_df.message.apply(lambda x: [idf_mapping[token] for token in splitter(x)])

logs_sequences_df['mean_weighted_idf_w2v'] = [mean_weighted_embedding(ft, splitter(logs_sequences_df['message'].iloc[i]), 1 / np.array(logs_sequences_df['idf_vectors'].iloc[i]).reshape(-1,1)) for i in range(logs_sequences_df.shape[0])]

Multiply by TFIDF
Enriching using lstm-next word (char or word-wise)
Using external wiktionary/pedia data for certain words, phrases
Finding clusters of relevant data and figuring out if you can enrich based on the content of the clusters
Applying deep nlp methods without big data, i.e., sparseness

Basic nlp

Benchmarking tokenizers for optimalprocessing speed
Using nltk with gensim
Multiclass text classification with svm/nb/mean w2v/d2v - tutorial with code and notebook.
Basic pipeline for keyword extraction
DL for text classification
1. Logistic regression with word ngrams
2. Logistic regression with character ngrams
3. Logistic regression with word and character ngrams
4. Recurrent neural network (bidirectional GRU) without pre-trained embeddings
5. Recurrent neural network (bidirectional GRU) with GloVe pre-trained embeddings
6. Multi channel Convolutional Neural Network
7. RNN (Bidirectional GRU) + CNN model
LexNLP - glorified regex extractor

Chunking

Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO+

NLP for hackers tutorials

How to convert between verb/noun/adjective/adverb forms using Wordnet
Complete guide for training your own Part-Of-Speech Tagger - using Penn Treebank tagset. Using nltk or stanford pos taggers, creating features from actual words (manual stemming, etc0 using the tags as labels, on a random forest, thus creating a classifier for POS on our own. Not entirely sure why we need to create a classifier from a “classifier”.
Word net introduction - POS, lemmatize, synon, antonym, hypernym, hyponym
Sentence similarity using wordnet - using synonyms cumsum for comparison. Today replaced with w2v mean sentence similarity.
Stemmers vs lemmatizers - stemmers are faster, lemmatizers are POS / dictionary based, slower, converting to base form.
Chunking - shallow parsing, compared to deep, similar to NER
NER - using nltk chunking as a labeller for a classifier, training one of our own. Using IOB features as well as others to create a new ner classifier which should be better than the original by using additional features. Aso uses a new english dataset GMB.
Building nlp pipelines, functions coroutines etc..
Training ner using generators
Metrics, tp/fp/recall/precision/micro/weighted/macro f1
Tf-idf
Nltk for beginners
Nlp corpora corpuses
bow/bigrams
Textrank
Word cloud
Topic modelling using gensim, lsa, lsi, lda,hdp
Spacy full tutorial
POS using CRF

Synonyms

Python Module to get Meanings, Synonyms and what not for a given word using vocabulary (also a comparison against word net) https://vocabulary.readthedocs.io/en/…

For a given word, using Vocabulary, you can get its

Meaning
Synonyms
Antonyms
Part of speech : whether the word is a noun, interjection or an adverb et el
Translate : Translate a phrase from a source language to the desired language.
Usage example : a quick example on how to use the word in a sentence
Pronunciation
Hyphenation : shows the particular stress points(if any)

Swiss army knife libraries

textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library. With the fundamentals — tokenization, part-of-speech tagging, dependency parsing, etc. — delegated to another library, textacy focuses on the tasks that come before and follow after.

Collocation

What is collocation? - “the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance.”Medium tutorial, quite good, comparing freq/t-test/pmi/chi2 with github code
A website dedicated to collocations, methods, references, metrics.
Text analysis for sentiment, doing feature selection a tutorial with chi2(IG?), part 2 with bi-gram collocation in ntlk
Text2vec in R - has ideas on how to use collocations, for downstream tasks, LDA, W2V, etc. also explains about PMI and other metrics, note that gensim metric is unsupervised and probablistic.
NLTK on collocations
A blog post about keeping or removing stopwords for collocation, usefull but no firm conclusion. Imo we should remove it before
A blog post with code of using nltk-based collocation
Small code for using nltk collocation
Another code / score example for nltk collocation
Jupyter notebook on manually finding collocation - not useful
Paper: Ngram2Vec - Github We introduce ngrams into four representation methods. The experimental results demonstrate ngrams’ effectiveness for learning improved word representations. In addition, we find that the trained ngram embeddings are able to reflect their semantic meanings and syntactic patterns. To alleviate the costs brought by ngrams, we propose a novel way of building co-occurrence matrix, enabling the ngram-based models to run on cheap hardware
Youtube on bigrams, collocation, mutual info and collocation

Language detection

Using google lang detect - 55 languages af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw

Stemming

How to measure a stemmer?

References [1 ****2(apr11) 3(Index compression factor ICF) 4 ****5]

Phrase modelling

Phrase Modeling - using gensim and spacy

Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens AA and BB constitute a phrase is:

count(A B)−countmincount(A)∗count(B)∗N>threshold

SO on PE.

Document classification

Using hierarchical attention network

Hebrew NLP tools

HebMorph last update 7y ago
Hebmorph elastic search ****Hebmorph blog post, and other blog posts, youtube
Awesome hebrew nlp git, git
Hebrew-nlp service ****docs ****the features (morphological analysis, normalization etc), git
Apache solr stop words (dead)
SO on hebrew analyzer/stemming, here too
Neural sentiment benchmark using two algorithms, for character and word level lstm/gru - the paper
Hebrew word embeddings
Paper for rich morphological datasets for comparison - rivlin

Semantic roles:

http://language.worldofcomputing.net/semantics/semantic-roles.html

ANNOTATION

Snorkel - using weak supervision to create less noisy labelled datasets
1. Git
2. Medium
Snorkel metal weak supervision for multi-task learning. Conversation, git
1. Yes, the Snorkel project has included work before on hierarchical labeling scenarios. The main papers detailing our results include the DEEM workshop paper you referenced (https://dl.acm.org/doi/abs/10.1145/3209889.3209898) and the more complete paper presented at AAAI (https://arxiv.org/abs/1810.02840). Before the Snorkel and Snorkel MeTaL projects were merged in Snorkel v0.9, the Snorkel MeTaL project included an interface for explicitly specifying hierarchies between tasks which was utilized by the label model and could be used to automatically compile a multi-task end model as well (demo here: https://github.com/HazyResearch/metal/blob/master/tutorials/Multitask.ipynb). That interface is not currently available in Snorkel v0.9 (no fundamental blockers; just hasn't been ported over yet).
2. There are, however, still a number of ways to model such situations. One way is to treat each node in the hierarchy as a separate task and combine their probabilities post-hoc (e.g., P(credit-request) = P(billing) * P(credit-request | billing)). Another is to treat them as separate tasks and use a multi-task end model to implicitly learn how the predictions of some tasks should affect the predictions of others (e.g., the end model we use in the AAAI paper). A third option is to create a single task with all the leaf categories and modify the output space of the LFs you were considering for the higher nodes (the deeper your hierarchy is or the larger the number of classes, the less apppealing this is w/r/t to approaches 1 and 2).
mechanical turk calculator
Mturk alternatives
1. Workforce / onespace
2. Jobby
3. Shorttask
4. Samasource
5. Figure 8 - pricing - definite guide
Brat nlp annotation tool
Prodigy by spacy,
1. seed-small sample, many sample tutorial on youtube by ines
2. How to use prodigy, tutorial on medium plus notebook code inside
Doccano - prodigy open source alternative butwith users management & statistics out of the box
Medium Lighttag - has some cool annotation metrics\tests
Loopr.ai - An AI powered semi-automated and automated annotation process for high quality data.object detection, analytics, nlp, active learning.
Medium Assessing annotator disagreement
A great python package for measuring disagreement on GH
Reliability is key, and not just mechanical turk
7 myths about annotation
Annotating twitter sentiment using humans, 3 classes, 55% accuracy using SVMs. they talk about inter agreement etc. and their DS is partially publicly available.
Exploiting disagreement
Vader annotation
1. They must pass an english exam
2. They get control questions to establish their reliability
3. They get a few sentences over and over again to establish inter disagreement
4. Two or more people get a overlapping sentences to establish disagreement
5. 5 judges for each sentence (makes 4 useless)
6. They dont know each other
7. Simple rules to follow
8. Random selection of sentences
9. Even classes
10. No experts
11. Measuring reliability kappa/the other kappa.
Label studio

Ideas:

Active learning for a group (or single) of annotators, we have to wait for all annotations to finish each big batch in order to retrain the model.
Annotate a small group, automatic labelling using knn
Find a nearest neighbor for out optimal set of keywords per “category,
For a group of keywords, find their knn neighbors in w2v-space, alternatively find k clusters in w2v space that has those keywords. For a new word/mean sentence vector in the ‘category’ find the minimal distance to the new cluster (either one of approaches) and this is new annotation.

7 myths of annotation

Myth One: One Truth Most data collection efforts assume that there is one correct interpretation for every input example.
Myth Two: Disagreement Is Bad To increase the quality of annotation data, disagreement among the annotators should be avoided or reduced.
Myth Three: Detailed Guidelines Help When specific cases continuously cause disagreement, more instructions are added to limit interpretations.
Myth Four: One Is Enough Most annotated examples are evaluated by one person.
Myth Five: Experts Are Better Human annotators with domain knowledge provide better annotated data.
Myth Six: All Examples Are Created Equal The mathematics of using ground truth treats every example the same; either you match the correct result or not.
Myth Seven: Once Done, Forever Valid Once human annotated data is collected for a task, it is used over and over with no update. New annotated data is not aligned with previous data.

Crowd Sourcing

Conclusions:
- Experts are the same as a crowd
- Costs a lot less $$$.

Inter agreement

*** The best tutorial on agreements, cohen, david, kappa, krip etc.

Cohens kappa (two people)

but you can use it to map a group by calculating agreement for each pair

Why cohens kappa should be avoided as a performance measure in classification
Why it should be used as a measure of classification
Kappa in plain english
Multilabel using kappa
Kappa and the relation with accuracy (redundant, % above chance, should not be used due to other reasons researched here)

The Kappa statistic varies from 0 to 1, where.

0 = agreement equivalent to chance.
0.1 – 0.20 = slight agreement.
0.21 – 0.40 = fair agreement.
0.41 – 0.60 = moderate agreement.
0.61 – 0.80 = substantial agreement.
0.81 – 0.99 = near perfect agreement
1 = perfect agreement.

Fleiss’ kappa, from 3 people and above.

Kappa ranges from 0 to 1, where:

0 is no agreement (or agreement that you would expect to find by chance),
1 is perfect agreement.
Fleiss’s Kappa is an extension of Cohen’s kappa for three raters or more. In addition, the assumption with Cohen’s kappa is that your raters are deliberately chosen and fixed. With Fleiss’ kappa, the assumption is that your raters were chosen at random from a larger population.
Kendall’s Tau is used when you have ranked data, like two people ordering 10 candidates from most preferred to least preferred.
Krippendorff’s alpha is useful when you have multiple raters and multiple possible ratings.

Krippendorfs alpha

Ignores missing data entirely.
Can handle various sample sizes, categories, and numbers of raters.
Applies to any measurement level (i.e. (nominal, ordinal, interval, ratio).
Values range from 0 to 1, where 0 is perfect disagreement and 1 is perfect agreement. Krippendorff suggests: “[I]t is customary to require α ≥ .800. Where tentative conclusions are still acceptable, α ≥ .667 is the lowest conceivable limit (2004, p. 241).”
Supposedly multi label

MACE - the new kid on the block. -

learns in an unsupervised fashion to

a) identify which annotators are trustworthy and
b) predict the correct underlying labels. We match performance of more complex state-of-the-art systems and perform well even under adversarial conditions
MACE does exactly that. It tries to find out which annotators are more trustworthy and upweighs their answers.
Git -

When evaluating redundant annotatio

ns (like those from Amazon's MechanicalTurk), we usually want to

aggregate annotations to recover the most likely answer
find out which annotators are trustworthy
evaluate item and task difficulty

MACE solves all of these problems, by learning competence estimates for each annotators and computing the most likely answer based on those competences.

Calculating agreement

Compare against researcher-ground-truth
Self-agreement
Inter-agreement
1. Medium
2. Kappa cohen
3. Multi annotator with kappa (which isnt), is this okay?
4. Github computer Fleiss Kappa 1
5. Fleiss Kappa Example
6. GWET AC1, paper: as an alternative to kappa, and why
7. Website, krippensorf vs fleiss calculator

Machine Vision annotation

CVAT

Troubling shooting agreement metrics

Imbalance data sets, i.e., why my Why is reliability so low when percentage of agreement is high?
Interpretation of kappa values
Interpreting agreement, Accuracy precision kappa

CONVOLUTION NEURAL NETS (CNN)

Cnn for text - tal perry
1D CNN using KERAS

KNOWLEDGE GRAPHS

Automatic creation of KG using spacy and networx Knowledge graphs can be constructed automatically from text using part-of-speech and dependency parsing. The extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using NLP library SpaCy.
Medium on Reconciling your data and the world of knowledge graphs
Medium Series:

SUMMARIZATION

Email summarization but with a great intro (see image above)
With nltk - words assigned weighted frequency, summed up in sentences and then selected based on the top K scored sentences.
Awesome-text-summarization on github
Methodical review of abstractive summarization
Medium on extractive and abstractive - overview with the abstractive code
NAMAS - Neural attention model for abstractive summarization, -Neural Attention Model for Abstractive Sentence Summarization - summarizes single sentences quite well, github
Abstractive vs extractive, blue intro
Intro to text summarization
Paper: survey on text summ, arxiv
Very short intro
Intro on encoder decoder
Unsupervised methods using sentence emebeddings (long and good) - using sent2vec, clustering, picking by rank
Abstractive summarization using bert for sota
Abstractive
Keywords extraction
Extractive summarization

SENTIMENT ANALYSIS

Databases

Sentiment databases ****
Movie reviews: IMDB reviews dataset on Kaggle
Sentiwordnet – mapping wordnet senses to a polarity model: SentiWordnet Site
Twitter airline sentiment on Kaggle
First GOP Debate Twitter Sentiment
Amazon fine foods reviews

Tools

** Many Sentiment tools,
NTLK sentiment analyzer
Vader (NTLK, standalone):
Text BLob:
Comparative opinion mining a review paper - has some info about unsupervised as well
Another reference list, has some unsupervised.
Sentiwordnet3.0 paper
presentation

Reference papers:

Twitter as a corpus for SA and opinion mining

Ground Truth

For sentiment In Vader -
1. “Screening for English language reading comprehension – each rater had to individually score an 80% or higher on a standardized college-level reading comprehension test.
2. Complete an online sentiment rating training and orientation session, and score 90% or higher for matching the known (prevalidated) mean sentiment rating of lexical items which included individual words, emoticons, acronyms, sentences, tweets, and text snippets (e.g., sentence segments, or phrases).
3. Every batch of 25 features contained five “golden items” with a known (pre-validated) sentiment rating distribution. If a worker was more than one standard deviation away from the mean of this known distribution on three or more of the five golden items, we discarded all 25 ratings in the batch from this worker.
4. Bonus to incentivize and reward the highest quality work. Asked workers to select the valence score that they thought “most other people” would choose for the given lexical feature (early/iterative pilot testing revealed that wording the instructions in this manner garnered a much tighter standard deviation without significantly affecting the mean sentiment rating, allowing us to achieve higher quality (generalized) results while being more economical).
5. Compensated AMT workers $0.25 for each batch of 25 items they rated, with an additional $0.25 incentive bonus for all workers who successfully matched the group mean (within 1.5 standard deviations) on at least 20 of 25 responses in each batch. Using these four quality control methods, we achieved remarkable value in the data obtained from our AMT workers – we paid incentive bonuses for high quality to at least 90% of raters for most batches.

Multilingual Twitter Sentiment Classification: The Role of Human Annotators

1.6 million tweets labelled
13 languages
Evaluated 6 pretrained classification models
10 CFV
SVM / NB
Annotator agreements.
- about 15% were intentionally duplicated to be annotated twice,
- by the same annotator
- by two different annotators
Self-agreement from multiple annotations of the same annotator
Inter-agreement from multiple annotations by different annotators
The confidence intervals for the agreements are estimated by bootstrapping [12].
It turns out that the self-agreement is a good measure to identify low quality annotators,
the inter-annotator agreement provides a good estimate of the objective difficulty of the task, unless it is too low.

Alpha was developed to measure the agreement between human annotators, but can also be used to measure the agreement between classification models and a gold standard. It generalizes several specialized agreement measures, takes ordering of classes into account, and accounts for the agreement by chance. Alpha is defined as follows:

Method cont here in a second paper

TOPIC MODELING

****A very good article about LSA (TFIDV X SVD), pLSA, LDA, and LDA2VEC. Including code and explanation about dirichlet probability. Lda2vec code

A descriptive comparison for LSA pLSA and LDA

A great summation about topic modeling, Pros and Cons! (LSA, pLSA, LDA)

Word cloud for topic modelling

Sklearn LDA and NMF for topic modelling

Topic modelling with sentiment per topic according to the data in the topic

(TopSBM) topic block modeling

Topsbm

(LDA) Latent Dirichlet Allocation

LDA is already taken by the above algorithm!

Latent Dirichlet allocation (LDA) - This algorithm takes a group of documents (anything that is made of up text), and returns a number of topics (which are made up of a number of words) most relevant to these documents.

LDA is an example of topic modelling
?- can this be used for any set of features, not just text?

Medium Article about LDA and NMF (Non-negative Matrix factorization)+ code

Medium article on LDA - a good one with pseudo algorithm and proof

In case LDA groups together two topics, we can influence the algorithm in a way that makes those two topics separable - this is called Semi Supervised Guided LDA

LDA tutorials plus code, used this to build my own classes - using gensim mallet wrapper, doesn't work on pyLDAviz, so use this to fix it
Introduction to LDA topic modelling, really good, ****plus git code
Sklearn examples using LDA and NMF
Tutorial on lda/nmf on medium - using tfidf matrix as input!
Gensim and sklearn LDA variants, comparison, python 3
Medium article on lda/nmf with code
One of the best explanation about Tf-idf vs bow for LDA/NMF - tf for lda, tfidf for nmf, but tfidf can be used for top k selection in lda + visualization, important paper
- LDA is a probabilistic generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.
- NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.
- From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.
- As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold.
- The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search.
- So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix.

Very important:

How to measure the variance for LDA and NMF, against PCA. 1. Variance score the transformation and inverse transformation of data, test for 1,2,3,4 PCs/LDs/NMs.
Matching lda mallet performance with gensim and sklearn lda via hyper parameters

What is LDA?
1. It is unsupervised natively; it uses joint probability method to find topics(user has to pass # of topics to LDA api). If “Doc X word” is size of input data to LDA, it transforms it to 2 matrices:
2. Doc X topic
3. Word X topic
4. further if you want, you can feed “Doc X topic” matrix to supervised algorithm if labels were given.
Medium on LDA, explains the random probabilistic nature of LDA
Machinelearningplus on LDA in sklearn - a great read, dont forget to read the mallet article.
Medium on LSA pLSA, LDA LDA2vec, high level theoretical - not clear
Medium on LSI vs LDA vs HDP, HDP wins..
Medium on LDA, some historical reference and general high level how to use exapmles.
Incredibly useful response on LDA grid search params and about LDA expectations. Must read.
Lda vs pLSA, talks about the sampling from a distribution of distributions in LDA
BLog post on topic modelling - has some text about overfitting - undiscussed in many places.
Perplexity vs coherence on held out unseen data, not okay and okay, respectively. Due to how we measure the metrics, ie., read the formulas. Also this and this
LDA as dimentionality reduction
LDA on alpha and beta to control density of topics
Jupyter notebook on hacknews LDA topic modelling - missing code?
Jupyter notebook for kmeans, lda, svd,nmf comparison - advice is to keep nmf or other as a baseline to measure against LDA.
Gensim on LDA with code
Medium on lda with sklearn
Selecting the number of topics in LDA, blog 1, blog2, using preplexity, prep and aic bic, coherence, coherence2, coherence 3 with tutorial, unclear, unclear with analysis of stopword % inclusion, unread, paper: heuristic approach, elbow method, using cv, Paper: new stability metric + gh code,
Selecting the top K words in LDA
Presentation: best practices for LDA
Medium on guidedLDA - switching from LDA to a variation of it that is guided by the researcher / data
Medium on lda - another introductory, la times
Topic modelling through time
Mallet vs nltk, params, params
Paper: improving feature models
Lda vs w2v (doesn't make sense to compare, again here
Adding lda features to w2v for classification
Spacy and gensim on 20 news groups
The best topic modelling explanation including Usages, insights, a great read, with code - shows how to find similar docs by topic in gensim, and shows how to transform unseen documents and do similarity using sklearn:
1. Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
2. Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
3. Uncovering Themes in Texts – Useful for detecting trends in online publications for example
4. A Form of Tagging - If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.
5. Topic Modelling for Feature Selection - Sometimes LDA can also be used as feature selection technique. Take an example of text classification problem where the training data contain category wise documents. If LDA is running on sets of category wise documents. Followed by removing common topic terms across the results of different categories will give the best features for a category.

How to interpret topics using pyldaviz:

Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:

Larger topics are more frequent in the corpus.
Topics closer together are more similar, topics further apart are less similar.
When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.

****pyLDAviz paper***!

pyLDAviz - what am i looking at ? by spacy

There are a lot of moving parts in the visualization. Here's a brief summary:

On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)
- The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.
- The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
- An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
On the right, there is a bar chart showing top terms.
- When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
- When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λλ, which can be adjusted with a slider above the bar chart.
  - Setting the λλ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
  - Setting λλ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.
  - Setting λλ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found here. Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.

Another great article about LDA, including algorithm, parameters!! And

Parameters of LDA

Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.
Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.
Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.
Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.

Ways to improve LDA:

Reduce dimentionality of document-term matrix
Frequency filter
POS filter
Batch wise LDA
History of LDA - by the frech guy
Diff between lda and mallet - The inference algorithms in Mallet and Gensim are indeed different. Mallet uses Gibbs Sampling which is more precise than Gensim's faster and online Variational Bayes. There is a way to get relatively performance by increasing number of passes.
Mallet in gensim blog post
Alpha beta in mallet: contribution
1. The default for alpha is 5.0 divided by the number of topics. You can think of this as five "pseudo-words" of weight on the uniform distribution over topics. If the document is short, we expect to stay closer to the uniform prior. If the document is long, we would feel more confident moving away from the prior.
2. With hyperparameter optimization, the alpha value for each topic can be different. They usually become smaller than the default setting.
3. The default value for beta is 0.01. This means that each topic has a weight on the uniform prior equal to the size of the vocabulary divided by 100. This seems to be a good value. With optimization turned on, the value rarely changes by more than a factor of two.
Multilingual - alpha is divided by topic count, reaffirms 7
Topic modelling with lda and nmf on medium - has a very good simple example with probabilities
Code: great for top docs, terms, topics etc.
Great article: Many ways of evaluating topics by running LDA
Youtube on LDAvis explained
Presentation: More visualization options including ldavis
A pointer to the ldaviz fix -> fix, git code
Difference between lda in gensim and sklearn a post on rare
The best code article on LDA/MALLET, and using sklearn (using clustering for getting group of sentences in each topic)
LDA in gensim, a tutorial by gensim
****Lda on medium ****
****What are the pros and cons of LDA and NMF in topic modeling? Under what situations should we choose LDA or NMF? Is there comparison of two techniques in topic modeling?
What is the difference between NMF and LDA? Why are the priors of LDA sparse-induced?
Exploring Topic Coherence over many models and many topics lda nmf svd, using umass and uci coherence measures
*** Practical topic findings for short sentence text code
What's the difference between SVD/NMF and LDA as topic model algorithms essentially? Deterministic vs prob based
What is the difference between NMF and LDA? Why are the priors of LDA sparse-induced?
What are the relationships among NMF, tensor factorization, deep learning, topic modeling, etc.?
Code: lda nmf
Unread a comparison of lda and nmf
Presentation: lda sparse coding matrix factorization
An experimental comparison between NMF and LDA for active cross-situational object-word learning
Topic coherence in gensom with jupyter code
Topic modelling dynamic presentation
Paper: Topic modelling and event identification from twitter data, says LDA vs NMI (NMF?) and using coherence to analyze
Just another medium article about ™
What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE) LDADE's tunings dramatically reduces topic instability.
Talk about topic modelling
Intro to topic modelling
Detecting topics in twitter github code
Another topic model tutorial
(didnt read) NTM - neural topic modeling using embedded spaces with github code
Another lda tutorial
Comparing tweets using lda
Lda and w2v as features for some classification task
Improving ™ with embeddings
w2v/doc2v for topic clustering - need to see the code to understand how they got clean topics, i assume a human rewrote it

Topic coherence (lda/nmf)

What is?, Wiki on pmi
Datacamp on coherence metrics, a comparison, read me.
Paper: explains what is coherence

Umass vs C_v, what are the diff?
Paper: umass, uci, nmpi, cv, cp etv Exploring the Space of Topic Coherence Measures
Paper: Automatic evaluation of topic coherence ****
Paper: exploring the space of topic coherence methods
Paper: Relation between mutial information / entropy and pmi
Stackexchange: coherence / pmi how to calc
Paper: Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality - perplexity needs unseen data, coherence doesnt
Evaluation of topic modelling techniques for twitter lda lda-u btm w2vgmm
Paper: Topic coherence measures
topic modelling from different domains
Paper: Optimizing Semantic Coherence in Topic Models
Paper: L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization
Paper: Content matching between TV shows and advertisements through Latent Dirichlet Allocation
Paper: Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation
Paper: Evaluating topic coherence - Abstract: Topic models extract representative word sets—called topics—from word counts in documents without requiring any semantic annotations. Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics. Studies of topic coherence so far are limited to measures that score pairs of individual words. For the first time, we include coherence measures from scientific philosophy that score pairs of more complex word subsets and apply them to topic scoring.

Conclusion: The results of the first experiment show that if we are using the one-any, any-any and one-all coherences directly for optimization they are leading to meaningful word sets. The second experiment shows that these coherence measures are able to outperform the UCI coherence as well as the UMass coherence on these generated word sets. For evaluating LDA topics any-any and one-any coherences perform slightly better than the UCI coherence. The correlation of the UMass coherence and the human ratings is not as high as for the other coherences.

Code: Evaluating topic coherence, using gensim umass or cv parameter - To conclude, there are many other approaches to evaluate Topic models such as Perplexity, but its poor indicator of the quality of the topics.Topic Visualization is also a good way to assess topic models. Topic Coherence measure is a good way to compare difference topic models based on their human-interpretability.The u_mass and c_v topic coherences capture the optimal number of topics by giving the interpretability of these topics a number called coherence score.
Formulas: UCI vs UMASS
Inferring the number of topics for gensim's LDA - perplexity, CM, AIC, and BIC
Perplexity as a measure for LDA
Finding number of topics using perplexity
Coherence for tweets
Presentation Twitter DLA, tweet pooling improvements, hierarchical summarization of tweets, twitter LDA in java ****on github Papers: TM of twitter timeline, in twitter aggregation by conversatoin, twitter topics using LDA, empirical study ,

COHERENCE

Using regularization to improve PMI score and in turn coherence for LDA topics
Improving model precision - coherence using turkers for LDA
Gensim - paper about their algorithm and PMI/UCI etc.
Advice for coherence, then Good vs bad model (50 vs 1 iterations) measuring u_mass coherence - 2nd code - “In your data we can see that there is a peak between 0-100 and a peak between 400-500. What I would think in this case is that "does ~480 topics make sense for the kind of data I have?" If not, you can just do an np.argmax for 0-100 topics and trade-off coherence score for simpler understanding. Otherwise just do an np.argmax on the full set.”
Diff term weighting schemas for topic modelling, code plus paper
Workaround for pyLDAvis using LDA-Mallet
pyLDAvis paper
Visualizing LDA topics results
Visualizing trends, topics, sentiment, heat maps, entities - really good
Topic stability Metric, a novel method, compared against jaccard, spearman, silhouette.: Measuring LDA Topic Stability from Clusters of Replicated Runs

Non-negative Matrix factorization (NMF)

Medium Article about LDA and NMF (Non-negative Matrix factorization)+ code

LDA2VEC

“if you want to rework your own topic models that, say, jointly correlate an article’s topics with votes or predict topics over users then you might be interested in lda2vec.”
Datacamp intro
Original blog - I just learned about these papers which are quite similar: Gaussian LDA for Topic Word Embeddings and Nonparametric Spherical Topic Modeling with Word Embeddings.
Moody’s Slide Share (excellent read)
Docs
Original Git + Excellent notebook example
Tf implementation, another more recent one tf 1.5
Another blog explaining about lda etc, post, post
Lda2vec in tf, tf 1.5,
Comparing lda2vec to lda
Youtube: lda/doc2vec with pca examples
Example on gh on jupyter

TOP2VEC

Git, paper
Topic modeling with distillibert on medium, bertTopic!, c-tfidf, umap, hdbscan, merging similar topics, visualization, berTopic (same method as the above)
Medium with the same general method

NAMED ENTITY RECOGNITION (NER)

State of the art LSTM architectures using NN
Medium: Ner free datasets and bilstm implementation using glove embeddings

Easy to implement in keras! They are based on the following paper

Medium: NLTK entities, polyglot entities, sner entities, finally an ensemble method wins all!

Comparison between spacy and SNER - for terms.
*** Unsupervised NER using Bert
Custom NER using spacy
Spacy Ner with custom data

How to create a NER from scratch using kaggle data, using crf, and analysing crf weights using external package
Another comparison between spacy and SNER - both are the same, for many classes.

Vidhaya on spacy vs ner - tutorial + code on how to use spacy for pos, dep, ner, compared to nltk/corenlp (sner etc). The results reflect a global score not specific to LOC for example.

Stanford NER (SNER)

SNER presentation - combines HMM and MaxEnt features, distributional features, NER has
many applications.
How to train SNER, a FAQ with many other answers (read first before doing anything with SNER)
SNER demo - capital letters matter, a minimum of one. ****
State of the art NER benchmark
Review paper, SNER, spacy, stanford wins
Review paper SNER, others on biographical text, stanford wins
Another NER DL paper, 90%+

Spacy & Others

Spacy - using prodigy and spacy to train a NER classifier using active learning
Ner using DL BLSTM, using glove embeddings, using CRF layer against another CRF.
Another medium paper on the BLSTM CRF with guillarue’s code
Guillaume blog post, detailed explanation
For Italian
Another 90+ proposed solution
A promising python implementation based on one or two of the previous papers
Quora advise, the first is cool, the second is questionable
Off the shelf solutions benchmark
Parallel api talk about bilstm and their 2mil tagged ner model (washington passes)

SEARCH

Bert search engine, cosine between paragraphs and question.
Semantic search, autom completion, filtering, augmentation, scoring. Problems: Token matching, contextualization, query misunderstanding, image search, metric. Solutions: synonym generation, query autocompletion, alternate query generation, word and doc embedding, contextualization, ranking, ensemble, multilingual search

SEQ2SEQ SEQUENCE TO SEQUENCE

Keras blog - char-level, token-using embedding layer, teacher forcing
Teacher forcing explained
Same as keras but with token-level
Medium on char, word, byte-level
Mastery on enc-dec using the keras method, and on neural translation
Machine translation git from eng to jap, another, and its medium

Incorporating Copying Mechanism in Sequence-to-Sequence Learning - In this paper, we incorporate copying into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. CopyNet can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence.

QUESTION ANSWERING

Pythia, qna for images on colab
Building a Q&A system part 1
Building a Q&A model
Vidhya on Q&A
Q&A system using ****milvus - An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
Q&A system

LANGUAGE MODELS

Mastery on Word-based

NEURAL LANGUAGE GENERATION

Using RNN
Using language modeling
Word based vs char based - Word-based LMs display higher accuracy and lower computational cost than char-based LMs. However, char-based RNN LMs better model languages with a rich morphology such as Finish, Turkish, Russian etc. Using word-based RNN LMs to model such languages is difficult if possible at all and is not advised. Char-based RNN LMs can mimic grammatically correct sequences for a wide range of languages, require bigger hidden layer and computationally more expensive while word-based RNN LMs train faster and generate more coherent texts and yet even these generated texts are far from making actual sense.
mediu m on Char based with code, leads to better grammer
Git, keras language models, char level word level and sentence using VAE

LANGUAGE DETECTION / IDENTIFICATION

A qualitative comparison of google, azure, amazon, ibm LD LI
CLD2, CLD3, PYCLD2, POLYGLOT wraps CLD, alex ott cld stats, cld comparison vs tika langid
Fast text LI, facebook post
OPENNLP
Google detect language, github code, v3beta
Microsoft azure LD, ****2
Ibm watson, 2
Amazon, **** 2
Lingua - most accurate for java… doesn't seem like its accurate enough
LD with infinity gram 99.1 on a lot of data a benchmark for this 2012 method, LD with infinity gram
WiLI dataset for LD, comparison of CLD vs others
Comparison of CLD vs FT vs OPEN NLP - beware based on 200 samples per language!!

Full results for every language that I tested are in table at the end of blog post & on Github. From them I can make following conclusions:

all detectors are equally good on some languages, such as, Japanese, Chinese, Vietnamese, Greek, Arabic, Farsi, Georgian, etc. - for them the accuracy of detection is between 98 & 100%;
CLD is much better in detection of "rare" languages, especially for languages, that are similar to more frequently used - Afrikaans vs Dutch, Azerbaijani vs. Turkish, Malay vs. Indonesian, Nepali vs. Hindi, Russian vs Bulgarian, etc. (it could be result of imbalance of training data - I need to check the source dataset);
for "major" languages not mentioned above (English, French, German, Spanish, Portuguese, Dutch) the fastText results are much better than CLD's, and in many cases lingid.py's & OpenNLP's;
for many languages results for "compressed" fastText model are slightly worse than results from "full" model (mostly only by 1-2%, but could be higher, like for Kazakh when difference is 33%), but there are languages where the situation is different - results for compressed are slight better than for full (for example, for German or Dutch);

OpenNLP has many misclassifications for Cyrillic languages - Russian/Ukrainian, ...

Rafael Oliveira posted on FB a simple diagram that shows what languages are detected better by CLD & what is better handled by fastText

Here are some additional notes about differences in behavior of detectors that I observe during analyzing results:

fastText is more reliable than CLD on the short texts;
fastText models & langid.py detect language as Hebrew instead of Jewish as in CLD. Similarly, CLD uses 'in' for Indonesian language instead of standard 'id' used by fastText & langid.py;
fastText distinguish between Cyrillic- & Latin-based versions of Serbian;
CLD tends to incorporate geographical & person's names into detection results - for example, blog post in German about travel to Iceland is detected as Icelandic, while fastText detects it as German;
In extended detection mode CLD tends to select more rare language, like, Galician or Catalan over Spanish, Serbian instead of Russian, etc.
OpenNLP isn't very good in detection for short texts.

The models released by fastText development team provides very good alternative to existing language detection tools, like, Google's CLD & langid.py - for most of "popular" languages, these models provides higher detection accuracy comparing to other tools, combined with high speed of detection (drawback of langid.py). Even using "compressed" model it's possible to reach good detection accuracy. Although for some less frequently used languages, CLD & langid.py may show better results.

Performance-wise, the langid.py is much slower than both CLD & fastText. On average, CLD requires 0.5-1 ms to perform language detection. For fastText & langid.py I don't have precise numbers yet, only approximates based on speed of execution of corresponding programs.

GIT:

LD with infinity gram

Articles:

Medium on training LI models

Papers:

A comparison of lang ident approaches
lI on code switch social media
Comparing LI methods, has 6 big languages
Comparing LI techniques
Radim rehurek LI on the web extending the dictionary
Comparative study of LI methods, 2

LANGUAGE TRANSLATION

State of the art methods for neural machine translation - a review of papers
LASER: Zero shot multi lang-translation by facebook, github
How to use laser on medium
Stanford coreNLP language POS/NER/DEP PARSE etc for 53 languages
Using embedding spaces w2v by gensim
The risk of using bleu
Really good: BLUE - what is it, how it is calculated?

“[BLEU] looks at the presence or absence of particular words, as well as the ordering and the degree of distortion—how much they actually are separated in the output.”

BLEU’s evaluation system requires two inputs: (i) a numerical translation closeness metric, which is then assigned and measured against (ii) a corpus of human reference translations.

BLEU averages out various metrics using an n-gram method, a probabilistic language model often used in computational linguistics.

The result is typically measured on a 0 to 1 scale, with 1 as the hypothetical “perfect” translation. Since the human reference, against which MT is measured, is always made up of multiple translations, even a human translation would not score a 1, however. Sometimes the score is expressed as multiplied by 100 or, as in the case of Google mentioned above, by 10.

a BLEU score offers more of an intuitive rather than an absolute meaning and is best used for relative judgments: “If we get a BLEU score of 35 (out of 100), it seems okay, but it actually has no correlation to the quality of the output in any meaningful sense. If it’s less than 15, we can probably safely say it’s very bad. If it’s greater than 60, we probably have some mistake in our testing! So it will generally fall in there.”

“Typically, if you have multiple [human translation] references, the BLEU score tends to be higher. So if you hear a very large BLEU score—someone gives you a value that seems very high—you can ask them if there are multiple references being used; because, then, that is the reason that the score is actually higher.”

General talk about FAMG (fb, ama, micro, goog) and research direction atm, including some info about BLUE scores and the comparison issues with reports of BLUE (boils down to diff unmentioned parameters)
One proposed solution is sacreBLUE, pip install sacreblue

Named entity language transliteration

Paper, blog post: English russian, hebrew, arabic, japanese, with data set and github

CHAT BOTS

A list of chat bots, pros and cons, example code.

STRING MATCHING

Fuzzy string matching library - fuzzywuzzy - using edit-distance

Difflib, difflib2

REGEX

Why re is slow
Benchmark, comparisons, more, many more,
Split on separator but keep the separator, in Python
Semantic versioning
Hyperscan - Hyperscan ****(paper) is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre library, but is a standalone library with its own C API.
Re2 - python This is the source code repository for RE2, a regular expression library.

Spacy’s Matcher & “regex”
Flashtext- This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm.

Files

natural-language-processing.md

Latest commit

History