Skip to content

Commit

Permalink
v0.15 (MaartenGr#1291)
Browse files Browse the repository at this point in the history
Prepare for v0.15 release by including changelog and many documentation updates.
  • Loading branch information
MaartenGr authored May 30, 2023
1 parent 307a15f commit 609d49c
Show file tree
Hide file tree
Showing 22 changed files with 1,405 additions and 350 deletions.
53 changes: 53 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Contributing to BERTopic

Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field.

Whether contributions are new features, better documentation, bug fixes, or improvement on the repository itself, anything is appreciated!

## 📚 Guidelines

### 🤖 Contributing Code

To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr).

When there is agreement on the assigned approach, a pull request can be created in which the fix/feature can be added. This follows a ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
Please do not try to push directly to this repo unless you are a maintainer.

There are exceptions to the `issue -> pull request` approach that are typically small changes that do not need agreements, such as:
* Documentation
* Spelling/grammar issues
* Docstrings
* etc.

There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request.

Note that the main focus of pull requests and code should be:
* Easy readability
* Clear communication
* Sufficient documentation

## 🚀 Quick Start

To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment.

If you are using conda, you can approach it as follows:

1. Create and activate a new conda environment (e.g., `conda create -n bertopic python=3.9`)
2. Install requirements (e.g., `pip install .[dev]`)
* This makes sure to also install documentation and testing packages
3. (Optional) Run `make docs` to build your documentation
4. (Optional) Run `make test` to run the unit tests and `make coverage` to check the coverage of unit tests

❗Note: Unit testing the package can take quite some time since it needs to run several variants of the BERTopic pipeline.

## 🤓 Collaborative Efforts

When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes.

## 🏆 Recognition

If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated.

## 🎈 Release

BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ share/python-wheels/
.installed.cfg
*.egg
MANIFEST
model_dir
model_dir/
test

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down
9 changes: 7 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
test:
pytest

coverage:
pytest --cov

install:
python -m pip install -e .

install-test:
python -m pip install -e ".[test]"
python -m pip install -e "."
python -m pip install -e ".[dev]"

docs:
mkdocs serve

pypi:
python setup.py sdist
Expand Down
69 changes: 45 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,29 @@
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters
allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports
[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html),
[**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html),
[**semi-supervised**](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html),
[**manual**](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html),
[**long-document**](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html),
[**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html),
[**class-based**](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html),
[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html),
[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html),
[**multimodal**](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html), and
[**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling. It even supports visualizations similar to LDAvis!
BERTopic supports all kinds of topic modeling techniques:
<table>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/guided/guided.html">Guided</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html">Supervised</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html">Semi-supervised</a></td>
</tr>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/manual/manual.html">Manual</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html">Multi-topic distributions</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html">Hierarchical</a></td>
</tr>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html">Class-based</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html">Dynamic</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/online/online.html">Online/Incremental</a></td>
</tr>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html">Multimodal</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html">Multi-aspect</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#text-generation-prompts">Text Generation/LLM</a></td>
</tr>
</table>

Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99), [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4) and [here](https://towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).

Expand All @@ -39,13 +50,10 @@ pip install bertopic
If you want to install BERTopic with other embedding models, you can choose one of the following:

```bash
# Embedding models
pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]
# Choose an embedding backend
pip install bertopic[flair, gensim, spacy, use]

# Vision topic modeling
# Topic modeling with images
pip install bertopic[vision]
```

Expand All @@ -61,6 +69,7 @@ with one of the examples below:
| Advanced Customization in BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |
| (semi-)Supervised Topic Modeling with BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing) |
| Dynamic Topic Modeling with Trump's Tweets | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing) |
| Topic Modeling on Large Data | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing) |
| Topic Modeling arXiv Abstracts | [![Kaggle](https://img.shields.io/static/v1?style=for-the-badge&message=Kaggle&color=222222&logo=Kaggle&logoColor=20BEFF&label=)](https://www.kaggle.com/maartengr/topic-modeling-arxiv-abstract-with-bertopic) |


Expand Down Expand Up @@ -122,8 +131,7 @@ Think! It's the SCSI card doing... 49 49_windows_drive_dos_file windows - dr
1) I have an old Jasmine drive... 49 49_windows_drive_dos_file windows - drive - docs... 0.038983 ...
```

> 🔥 **Tip**
> Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
**`🔥 Tip`**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.

## Fine-tune Topic Representations

Expand All @@ -137,8 +145,20 @@ representation_model = KeyBERTInspired()
topic_model = BERTopic(representation_model=representation_model)
```

> 🔥 **Tip**
> Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:

```python
import openai
from bertopic.representation import OpenAI

# Fine-tune topic representations with GPT
openai.api_key = "sk-..."
representation_model = OpenAI(model="gpt-3.5-turbo", chat=True)
topic_model = BERTopic(representation_model=representation_model)
```

**`🔥 Tip`**: Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.


## Visualizations
After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
Expand All @@ -153,7 +173,7 @@ topic_model.visualize_topics()
<img src="images/topic_visualization.gif" width="60%" height="60%" align="center" />

## Modularity
By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
By default, the [main steps](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:

https://user-images.githubusercontent.com/25746895/218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4

Expand All @@ -166,7 +186,6 @@ You can swap out any of these models or even remove them entirely. The following
5. [Weight](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html) tokens
6. [Represent topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with one or [multiple](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) representations

To find more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).

## Functionality
BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
Expand Down Expand Up @@ -228,12 +247,14 @@ There are many different use cases in which topic modeling can be used. As such,
| [Semi-supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html) | `.fit(docs, y=y)` |
| [Supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) | `.fit(docs, y=y)` |
| [Manual Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html) | `.fit(docs, y=y)` |
| [Multimodal Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) | ``.fit(docs, images=images)`` |
| [Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html) | `.topics_per_class(docs, classes)` |
| [Dynamic Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) | `.topics_over_time(docs, timestamps)` |
| [Hierarchical Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) | `.hierarchical_topics(docs)` |
| [Guided Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |



### Visualizations
Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation.
Visualizing different aspects of the topic model helps in understanding the model and makes it easier
Expand Down
2 changes: 1 addition & 1 deletion bertopic/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from bertopic._bertopic import BERTopic

__version__ = "0.14.1"
__version__ = "0.15.0"

__all__ = [
"BERTopic",
Expand Down
11 changes: 8 additions & 3 deletions bertopic/_bertopic.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import math
import joblib
import inspect
import collections
import numpy as np
import pandas as pd
import scipy.sparse as sp
Expand Down Expand Up @@ -3004,8 +3005,13 @@ def load(cls,
topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_files_from_hf(path)
else:
raise ValueError("Make sure to either pass a valid directory or HF model.")
topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)

# Replace embedding model if one is specifically chosen
if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:
topic_model.embedding_model = select_backend(embedding_model)

return _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
return topic_model

def push_to_hf_hub(
self,
Expand Down Expand Up @@ -3510,8 +3516,7 @@ def _update_topic_size(self, documents: pd.DataFrame):
Arguments:
documents: Updated dataframe with documents and their corresponding IDs and newly added Topics
"""
sizes = documents.groupby(['Topic']).count().sort_values("ID", ascending=False).reset_index()
self.topic_sizes_ = dict(zip(sizes.Topic, sizes.Document))
self.topic_sizes_ = collections.Counter(documents.Topic.values.tolist())
self.topics_ = documents.Topic.astype(int).tolist()

def _extract_words_per_topic(self,
Expand Down
8 changes: 6 additions & 2 deletions bertopic/_save_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,11 @@ def generate_readme(model, repo_id: str):
params = "\n".join([f"* {param}: {value}" for param, value in params.items()])
topics = sorted(list(set(model.topics_)))
nr_topics = str(len(set(model.topics_)))
nr_documents = str(model.c_tf_idf_.shape[1])

if model.topic_sizes_ is not None:
nr_documents = str(sum(model.topic_sizes_.values()))
else:
nr_documents = ""

# Topic information
topic_keywords = [" - ".join(list(zip(*model.get_topic(topic)))[0][:5]) for topic in topics]
Expand All @@ -290,7 +294,7 @@ def generate_readme(model, repo_id: str):
if not has_visual_aspect:
model_card = model_card.replace("{PIPELINE_TAG}", "text-classification")
else:
model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG} /n","") # TODO add proper tag for this instance
model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG}\n","") # TODO add proper tag for this instance

return model_card

Expand Down
Loading

0 comments on commit 609d49c

Please sign in to comment.