Skip to content

Commit

Permalink
v0.6 (MaartenGr#60)
Browse files Browse the repository at this point in the history
* Add dynamic topic modeling
* Add evolutionary topic modeling
* Add self.topic_names and get_topic_info()
* Add visualize_topics_over_time()
* Improve cTFIDF stability + performance
* Add binning to topics_over_time
* Fix mapping of probs (MaartenGr#63, MaartenGr#64)
  • Loading branch information
MaartenGr authored Mar 1, 2021
1 parent e84d7d1 commit 971d612
Show file tree
Hide file tree
Showing 15 changed files with 633 additions and 54 deletions.
45 changes: 45 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,49 @@ topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs, embeddings)
```

### Dynamic Topic Modeling
Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics
over time. These methods allow you to understand how a topic is represented across different times.
Here, we will be using all of Donald Trump's tweet so see how he talked over certain topics over time:

```python
import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
```

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

```python
from bertopic import BERTopic

model = BERTopic(verbose=True)
topics, _ = model.fit_transform(tweets)
```

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this
by simply calling `topics_over_time` and pass in his tweets, the corresponding timestamps, and the related topics:

```python
topics_over_time = model.topics_over_time(tweets, topics, timestamps)
```

Finally, we can visualize the topics by simply calling `visualize_topics_over_time()`:

```python
model.visualize_topics_over_time(topics_over_time, topcs=[9, 10, 72, 83, 87, 91])
```

<img src="images/dtm.gif" width="80%" height="80%" align="center" />


### Overview

| Methods | Code |
Expand All @@ -153,6 +196,7 @@ topics, _ = topic_model.fit_transform(docs, embeddings)
| Access single topic | `topic_model.get_topic(12)` |
| Access all topics | `topic_model.get_topics()` |
| Get topic freq | `topic_model.get_topic_freq()` |
| Get all topic information| `topic_model.get_topic_info()` |
| Visualize Topics | `topic_model.visualize_topics()` |
| Visualize Topic Probability Distribution | `topic_model.visualize_distribution(probabilities[0])` |
| Update topic representation | `topic_model.update_topics(docs, topics, n_gram_range=(1, 3))` |
Expand All @@ -161,6 +205,7 @@ topics, _ = topic_model.fit_transform(docs, embeddings)
| Save model | `topic_model.save("my_model")` |
| Load model | `BERTopic.load("my_model")` |
| Get parameters | `topic_model.get_params()` |


### Citation
To cite BERTopic in your work, please use the following bibtex reference:
Expand Down
2 changes: 1 addition & 1 deletion bertopic/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from bertopic._ctfidf import ClassTFIDF
from bertopic._embeddings import languages

__version__ = "0.5.0"
__version__ = "0.6.0"

__all__ = [
"BERTopic",
Expand Down
Loading

0 comments on commit 971d612

Please sign in to comment.