Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
data/TED		data/TED
01_nlp_pipeline_with_spaCy.ipynb		01_nlp_pipeline_with_spaCy.ipynb
02_nlp_with_textblob.ipynb		02_nlp_with_textblob.ipynb
03_document_term_matrix.ipynb		03_document_term_matrix.ipynb
04_news_text_classification.ipynb		04_news_text_classification.ipynb
05_sentiment_analysis_twitter.ipynb		05_sentiment_analysis_twitter.ipynb
06_sentiment_analysis_yelp.ipynb		06_sentiment_analysis_yelp.ipynb
README.md		README.md

README.md

Text Data for Trading: Sentiment Analysis

This is the first of three chapters dedicated to extracting signals for algorithmic trading strategies from text data using natural language processing (NLP) and machine learning.

Text data is very rich in content but highly unstructured so that it requires more preprocessing to enable an ML algorithm to extract relevant information. A key challenge consists of converting text into a numerical format without losing its meaning. We will cover several techniques capable of capturing nuances of language so that they can be used as input for ML algorithms.

In this chapter, we will introduce fundamental feature extraction techniques that focus on individual semantic units, i.e. words or short groups of words called tokens. We will show how to represent documents as vectors of token counts by creating a document-term matrix and then proceed to use it as input for news classification and sentiment analysis. We will also introduce the Naive Bayes algorithm that is popular for this purpose.

In the following two chapters, we build on these techniques and use ML algorithms like topic modeling and word-vector embeddings to capture the information contained in a broader context.

In particular, in this chapter we will cover:

What the fundamental NLP workflow looks like
How to build a multilingual feature extraction pipeline using spaCy and Textblob
How to perform NLP tasks like part-of-speech tagging or named entity recognition
How to convert tokens to numbers using the document-term matrix
How to classify text using the Naive Bayes model
How to perform sentiment analysis

How to extract features from text data

Speech and Language Processing, Daniel Jurafsky & James H. Martin, 3rd edition, draft, 2018
Statistical natural language processing and corpus-based computational linguistics, Annotated list of resources, Stanford University
NLP Data Sources

Challenges of Natural Language Processing

The conversion of unstructured text into a machine-readable format requires careful preprocessing to preserve the valuable semantic aspects of the data. How humans derive meaning from and comprehend the content of language is not fully understood and improving language understanding by machines remains an area of very active research.

NLP is challenging because the effective use of text data for machine learning requires an understanding of the inner workings of language as well as knowledge about the world to which it refers. Key challenges include:

ambiguity due to polysemy, i.e. a word or phrase can have different meanings that depend on context (‘Local High School Dropouts Cut in Half’)
non-standard and evolving use of language, especially in social media
idioms: ‘throw in the towel’
entity names can be tricky : ‘Where is A Bug's Life playing?’
the need for knowledge about the world: ‘Mary and Sue are sisters’ vs ‘Mary and Sue are mothers’

Use Cases

Key NLP use cases include:

Use Case	Description	Examples
Chatbots	Understand natural language from the user and return intelligent responses	Api.ai
Information retrieval	Find relevant results and similar results	Google
Information extraction	Structured information from unstructured documents	Events from Gmail
Machine translation	One language to another	Google Translate
Text simplification	Preserve the meaning of text, but simplify the grammar and vocabulary	Rewordify, Simple English Wikipedia
Predictive text input	Faster or easier typing	Phrase completion, A much better application
Sentiment analysis	Attitude of speaker	Hater News
Automatic summarization	Extractive or abstractive summarization	reddit's autotldr algo, autotldr example
Natural language generation	Generate text from data	How a computer describes a sports match, Publishers withdraw more than 120 gibberish papers
Speech recognition and generation	Speech-to-text, text-to-speech	Google's Web Speech API demo, Vocalware Text-to-Speech demo
Question answering	Determine the intent of the question, match query with knowledge base, evaluate hypotheses	How did Watson beat Jeopardy champion Ken Jennings?, Watson Trivia Challenge, The AI Behind Watson

From text to tokens – the NLP pipeline

The following table summarizes the key tasks of an NLP pipeline:

Feature	Description
Tokenization	Segment text into words, punctuations marks etc.
Part-of-speech tagging	Assign word types to tokens, like a verb or noun.
Dependency parsing	Label syntactic token dependencies, like subject <=> object.
Stemming & Lemmatization	Assign the base forms of words: "was" => "be", "rats" => "rat".
Sentence boundary detection	Find and segment individual sentences.
Named Entity Recognition	Label "real-world" objects, like persons, companies or locations.
Similarity	Evaluate similarity of words, text spans, and documents.

NLP pipeline with spaCy and textacy

The notebook nlp_pipeline_with_spaCy demonstrates how to construct an NLP pipeline using the open-source python library spaCy. The textacy library builds on spaCy and provides easy access to spaCy attributes and additional functionality.

spaCy docs and installation instructions
textacy relies on spaCy to solve additional NLP tasks - see documentation

Code Examples

The code for this section is in the notebook nlp_pipeline_with_spaCy

Data

BBC Articles, use raw text files
TED2013, a parallel corpus of TED talk subtitles in 15 langugages

NLP with TextBlob

The TextBlob library provides a simplified interface for common NLP tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and others.

The notebook nlp_with_textblob illustrates its functionality.

A good alternative is NLTK, a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Natural Language ToolKit (NLTK) Documentation

From tokens to numbers – the document-term matrix

This section introduces the bag-of-words model that converts text data into a numeric vector space representation that permits the comparison of documents using their distance. We demonstrate how to create a document-term matrix using the sklearn library.

TF-IDF is about what matters

Document-term matrix with sklearn

The scikit-learn preprocessing module offers two tools to create a document-term matrix.

The CountVectorizer uses binary or absolute counts to measure the term frequency tf(d, t) for each document d and token t.
The TfIDFVectorizer, in contrast, weighs the (absolute) term frequency by the inverse document frequency (idf). As a result, a term that appears in more documents will receive a lower weight than a token with the same frequency for a given document but lower frequency across all documents

The notebook document_term_matrix demonstrate usage and configuration.

Text classification and sentiment analysis

Once text data has been converted into numerical features using the natural language processing techniques discussed in the previous sections, text classification works just like any other classification task.

In this section, we will apply these preprocessing technique to news articles, product reviews, and Twitter data and teach various classifiers to predict discrete news categories, review scores, and sentiment polarity.

First, we will introduce the Naive Bayes model, a probabilistic classification algorithm that works well with the text features produced by a bag-of-words model.

Daily Market News Sentiment and Stock Prices, David E. Allen & Michael McAleer & Abhay K. Singh, 2015, Tinbergen Institute Discussion Paper
Predicting Economic Indicators from Web Text Using Sentiment Composition, Abby Levenberg, et al, 2014
JP Morgan NLP research results

The Naive Bayes classifier

The Naive Bayes algorithm is very popular for text classification because low computational cost and memory requirements facilitate training on very large, high-dimensional datasets. Its predictive performance can compete with more complex models, provides a good baseline, and is best known for successful spam detection.

The model relies on Bayes theorem and the assumption that the various features are independent of each other given the outcome class. In other words, for a given outcome, knowing the value of one feature (e.g. the presence of a token in a document) does not provide any information about the value of another feature.

News article classification

We start with an illustration of the Naive Bayes model to classify 2,225 BBC news articles that we know belong to five different categories.

The notebook text_classification contains the relevant examples.

Sentiment Analysis

Sentiment analysis is one of the most popular uses of natural language processing and machine learning for trading because positive or negative perspectives on assets or other price drivers are likely to impact returns.

Generally, modeling approaches to sentiment analysis rely on dictionaries as the TextBlob library or models trained on outcomes for a specific domain. The latter is preferable because it permits more targeted labeling, e.g. by tying text features to subsequent price changes rather than indirect sentiment scores.

Twitter Dataset

We illustrate machine learning for sentiment analysis using a Twitter dataset with binary polarity labels, and a large Yelp business review dataset with a five-point outcome scale.

The notebook sentiment_analysis_twitter contains the relevant example.

Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

Yelp Dataset

To illustrate text processing and classification at larger scale, we also use the Yelp Dataset.

The notebook sentiment_analysis_yelp contains the relevant example.

Yelp Dataset Challenge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

14_working_with_text_data

14_working_with_text_data

README.md

Text Data for Trading: Sentiment Analysis

How to extract features from text data

Challenges of Natural Language Processing

Use Cases

From text to tokens – the NLP pipeline

NLP pipeline with spaCy and textacy

Code Examples

Data

NLP with TextBlob

From tokens to numbers – the document-term matrix

Document-term matrix with sklearn

Text classification and sentiment analysis

The Naive Bayes classifier

News article classification

Sentiment Analysis

Twitter Dataset

Yelp Dataset

Files

14_working_with_text_data

Directory actions

More options

Directory actions

More options

Latest commit

History

14_working_with_text_data

Folders and files

parent directory

README.md

Text Data for Trading: Sentiment Analysis

How to extract features from text data

Challenges of Natural Language Processing

Use Cases

From text to tokens – the NLP pipeline

NLP pipeline with spaCy and textacy

Code Examples

Data

NLP with TextBlob

From tokens to numbers – the document-term matrix

Document-term matrix with sklearn

Text classification and sentiment analysis

The Naive Bayes classifier

News article classification

Sentiment Analysis

Twitter Dataset

Yelp Dataset