BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports guided, (semi-) supervised, and dynamic topic modeling. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found here and here. For a more detailed overview, you can read the paper.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with one of the examples below:

Name	Link
Topic Modeling with BERTopic
(Custom) Embedding Models in BERTopic
Advanced Customization in BERTopic
(semi-)Supervised Topic Modeling with BERTopic
Dynamic Topic Modeling with Trump's Tweets
Topic Modeling arXiv Abstracts

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
0	693	49_windows_drive_dos_file
1	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
3	381	22_key_encryption_keys_encrypted

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

We can create an overview of the most frequent topics in a way that they are easily interpretable. Horizontal barcharts typically convey information rather well and allow for an intuitive representation of the topics:

topic_model.visualize_barchart()

Find all possible visualizations with interactive examples in the documentation here.

Embedding Models

BERTopic supports many embedding models that can be used to embed the documents and words:

Sentence-Transformers
Flair
Spacy
Gensim
USE

Sentence-Transformers is typically used as it has shown great results embedding documents meant for semantic similarity. Simply select any from their documentation here and pass it to BERTopic:

topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")

Flair allows you to choose almost any 🤗 transformers model. Simply select any from here and pass it to BERTopic:

from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)

Click here for a full overview of all supported embedding models.

Dynamic Topic Modeling

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented over time. Here, we will be using all of Donald Trump's tweet to see how he talked over certain topics over time:

import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this by simply calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics:

topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)

Finally, we can visualize the topics by simply calling visualize_topics_over_time():

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)

Overview

BERTopic has quite a number of functions that quickly can become overwhelming. To alleviate this issue, you will find an overview of all methods and a short description of its purpose.

Common

For quick access to common functions, here is an overview of BERTopic's main methods:

Method	Code
Fit the model	`.fit(docs)`
Fit the model and predict documents	`.fit_transform(docs)`
Predict new documents	`.transform([new_doc])`
Access single topic	`.get_topic(topic=12)`
Access all topics	`.get_topics()`
Get topic freq	`.get_topic_freq()`
Get all topic information	`.get_topic_info()`
Get representative docs per topic	`.get_representative_docs()`
Update topic representation	`.update_topics(docs, topics, n_gram_range=(1, 3))`
Generate topic labels	`.generate_topic_labels()`
Set topic labels	`.set_topic_labels(my_custom_labels)`
Merge topics	`.merge_topics(docs, topics, topics_to_merge)`
Reduce nr of topics	`.reduce_topics(docs, topics, nr_topics=30)`
Find topics	`.find_topics("vehicle")`
Save model	`.save("my_model")`
Load model	`BERTopic.load("my_model")`
Get parameters	`.get_params()`

Variations

There are many different use cases in which topic modeling can be used. As such, a number of variations of BERTopic have been developed such that one package can be used across across many use cases:

Method	Code
(semi-) Supervised Topic Modeling	`.fit(docs, y=y)`
Topic Modeling per Class	`.topics_per_class(docs, topics, classes)`
Dynamic Topic Modeling	`.topics_over_time(docs, topics, timestamps)`
Hierarchical Topic Modeling	`.hierarchical_topics(docs, topics)`
Guided Topic Modeling	`BERTopic(seed_topic_list=seed_topic_list)`

Visualizations

Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. Visualizing different aspects of the topic model helps in understanding the model and makes it easier to tweak the model to your liking.

Method	Code
Visualize Topics	`.visualize_topics()`
Visualize Documents	`.visualize_documents()`
Visualize Document Hierarchy	`.visualize_hierarchical_documents()`
Visualize Topic Hierarchy	`.visualize_hierarchy()`
Visualize Topic Tree	`.get_topic_tree(hierarchical_topics)`
Visualize Topic Terms	`.visualize_barchart()`
Visualize Topic Similarity	`.visualize_heatmap()`
Visualize Term Score Decline	`.visualize_term_rank()`
Visualize Topic Probability Distribution	`.visualize_distribution(probs[0])`
Visualize Topics over Time	`.visualize_topics_over_time(topics_over_time)`
Visualize Topics per Class	`.visualize_topics_per_class(topics_per_class)`

Citation

To cite the BERTopic paper, please use the following bibtex reference:

@article{grootendorst2022bertopic,
  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
  author={Grootendorst, Maarten},
  journal={arXiv preprint arXiv:2203.05794},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
bertopic		bertopic
docs		docs
images		images
notebooks		notebooks
tests		tests
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERTopic

Installation

Getting Started

Quick Start

Visualize Topics

Embedding Models

Dynamic Topic Modeling

Overview

Common

Variations

Visualizations

Citation

About

Releases

Packages

Languages

License

zhawe01/BERTopic

Folders and files

Latest commit

History

Repository files navigation

BERTopic

Installation

Getting Started

Quick Start

Visualize Topics

Embedding Models

Dynamic Topic Modeling

Overview

Common

Variations

Visualizations

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages