v0.15 (MaartenGr#1291)

Prepare for v0.15 release by including changelog and many documentation updates.
BritvaBo · May 30, 2023 · 609d49c · 609d49c
1 parent 307a15f
commit 609d49c
Show file tree

Hide file tree

Showing 22 changed files with 1,405 additions and 350 deletions.
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -0,0 +1,53 @@
+# Contributing to BERTopic
+
+Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field. 
+
+Whether contributions are new features, better documentation, bug fixes, or improvement on the repository itself, anything is appreciated!
+
+## 📚 Guidelines
+
+### 🤖 Contributing Code
+
+To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr). 
+
+When there is agreement on the assigned approach, a pull request can be created in which the fix/feature can be added. This follows a  ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
+Please do not try to push directly to this repo unless you are a maintainer.
+
+There are exceptions to the `issue -> pull request` approach that are typically small changes that do not need agreements, such as:
+* Documentation
+* Spelling/grammar issues
+* Docstrings
+* etc.
+
+There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request. 
+
+Note that the main focus of pull requests and code should be:
+* Easy readability
+* Clear communication
+* Sufficient documentation
+
+## 🚀 Quick Start
+
+To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment. 
+
+If you are using conda, you can approach it as follows:
+
+1. Create and activate a new conda environment (e.g., `conda create -n bertopic python=3.9`)
+2. Install requirements (e.g., `pip install .[dev]`)
+  * This makes sure to also install documentation and testing packages
+3. (Optional) Run `make docs` to build your documentation
+4. (Optional) Run `make test` to run the unit tests and `make coverage` to check the coverage of unit tests
+
+❗Note: Unit testing the package can take quite some time since it needs to run several variants of the BERTopic pipeline.
+
+## 🤓 Collaborative Efforts
+
+When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes. 
+
+## 🏆 Recognition
+
+If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated. 
+
+## 🎈 Release
+
+BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary. 
diff --git a/.gitignore b/.gitignore
@@ -26,6 +26,9 @@ share/python-wheels/
 .installed.cfg
 *.egg
 MANIFEST
+model_dir
+model_dir/
+test
 
 # PyInstaller
 #  Usually these files are written by a python script from a template

diff --git a/Makefile b/Makefile
@@ -1,12 +1,17 @@
 test:
 	pytest
 
+coverage:
+	pytest --cov
+
 install:
 	python -m pip install -e .
 
 install-test:
-	python -m pip install -e ".[test]"
-	python -m pip install -e "."
+	python -m pip install -e ".[dev]"
+
+docs:
+	mkdocs serve
 
 pypi:
 	python setup.py sdist

diff --git a/README.md b/README.md
@@ -13,18 +13,29 @@
 BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters
 allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
 
-BERTopic supports 
-[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html), 
-[**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html), 
-[**semi-supervised**](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html), 
-[**manual**](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html), 
-[**long-document**](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html),
-[**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html), 
-[**class-based**](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html),
-[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html), 
-[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html),
-[**multimodal**](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html), and 
-[**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling. It even supports visualizations similar to LDAvis!
+BERTopic supports all kinds of topic modeling techniques:  
+<table>
+  <tr>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/guided/guided.html">Guided</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html">Supervised</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html">Semi-supervised</a></td>
+ </tr>
+   <tr>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/manual/manual.html">Manual</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html">Multi-topic distributions</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html">Hierarchical</a></td>
+ </tr>
+ <tr>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html">Class-based</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html">Dynamic</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/online/online.html">Online/Incremental</a></td>
+ </tr>
+ <tr>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html">Multimodal</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html">Multi-aspect</a></td>
+    <td><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#text-generation-prompts">Text Generation/LLM</a></td>
+ </tr>
+</table>
 
 Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99), [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4) and [here](https://towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html). 
 
@@ -39,13 +50,10 @@ pip install bertopic
 If you want to install BERTopic with other embedding models, you can choose one of the following:
 
 ```bash
-# Embedding models
-pip install bertopic[flair]
-pip install bertopic[gensim]
-pip install bertopic[spacy]
-pip install bertopic[use]
+# Choose an embedding backend
+pip install bertopic[flair, gensim, spacy, use]
 
-# Vision topic modeling
+# Topic modeling with images
 pip install bertopic[vision]
 ```
 
@@ -61,6 +69,7 @@ with one of the examples below:
 | Advanced Customization in BERTopic  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |
 | (semi-)Supervised Topic Modeling with BERTopic  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing)  |
 | Dynamic Topic Modeling with Trump's Tweets  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing)  |
+| Topic Modeling on Large Data  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing)  |
 | Topic Modeling arXiv Abstracts | [![Kaggle](https://img.shields.io/static/v1?style=for-the-badge&message=Kaggle&color=222222&logo=Kaggle&logoColor=20BEFF&label=)](https://www.kaggle.com/maartengr/topic-modeling-arxiv-abstract-with-bertopic) |
 
 
@@ -122,8 +131,7 @@ Think! It's the SCSI card doing...	49     49_windows_drive_dos_file	windows - dr
 1) I have an old Jasmine drive...	49     49_windows_drive_dos_file	windows - drive - docs...	0.038983       ...
 ```
 
-> 🔥 **Tip**  
-> Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages. 
+**`🔥 Tip`**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages. 
 
 ## Fine-tune Topic Representations
 
@@ -137,8 +145,20 @@ representation_model = KeyBERTInspired()
 topic_model = BERTopic(representation_model=representation_model)
 ```
 
-> 🔥 **Tip**  
-> Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic. 
+However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:
+
+```python
+import openai
+from bertopic.representation import OpenAI
+
+# Fine-tune topic representations with GPT
+openai.api_key = "sk-..."
+representation_model = OpenAI(model="gpt-3.5-turbo", chat=True)
+topic_model = BERTopic(representation_model=representation_model)
+```
+
+**`🔥 Tip`**: Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic. 
+
 
 ## Visualizations
 After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good 
@@ -153,7 +173,7 @@ topic_model.visualize_topics()
 <img src="images/topic_visualization.gif" width="60%" height="60%" align="center" />
 
 ## Modularity
-By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
+By default, the [main steps](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
 
 https://user-images.githubusercontent.com/25746895/218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4
 
@@ -166,7 +186,6 @@ You can swap out any of these models or even remove them entirely. The following
 5. [Weight](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html) tokens
 6. [Represent topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with one or [multiple](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) representations
 
-To find more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html). 
 
 ## Functionality
 BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview 
@@ -228,12 +247,14 @@ There are many different use cases in which topic modeling can be used. As such,
 | [Semi-supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html) | `.fit(docs, y=y)` |
 | [Supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) | `.fit(docs, y=y)` |
 | [Manual Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html) | `.fit(docs, y=y)` |
+| [Multimodal Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) | ``.fit(docs, images=images)`` |
 | [Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html) | `.topics_per_class(docs, classes)` |
 | [Dynamic Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) | `.topics_over_time(docs, timestamps)` |
 | [Hierarchical Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) | `.hierarchical_topics(docs)` |
 | [Guided Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |
 
 
+
 ### Visualizations
 Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. 
 Visualizing different aspects of the topic model helps in understanding the model and makes it easier 

diff --git a/bertopic/__init__.py b/bertopic/__init__.py
@@ -1,6 +1,6 @@
 from bertopic._bertopic import BERTopic
 
-__version__ = "0.14.1"
+__version__ = "0.15.0"
 
 __all__ = [
     "BERTopic",

diff --git a/bertopic/_bertopic.py b/bertopic/_bertopic.py
@@ -12,6 +12,7 @@
 import math
 import joblib
 import inspect
+import collections
 import numpy as np
 import pandas as pd
 import scipy.sparse as sp
@@ -3004,8 +3005,13 @@ def load(cls,
             topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_files_from_hf(path)
         else:
             raise ValueError("Make sure to either pass a valid directory or HF model.")
+        topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
+
+        # Replace embedding model if one is specifically chosen
+        if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:
+            topic_model.embedding_model = select_backend(embedding_model)
 
-        return _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
+        return topic_model
 
     def push_to_hf_hub(
             self,
@@ -3510,8 +3516,7 @@ def _update_topic_size(self, documents: pd.DataFrame):
         Arguments:
             documents: Updated dataframe with documents and their corresponding IDs and newly added Topics
         """
-        sizes = documents.groupby(['Topic']).count().sort_values("ID", ascending=False).reset_index()
-        self.topic_sizes_ = dict(zip(sizes.Topic, sizes.Document))
+        self.topic_sizes_ = collections.Counter(documents.Topic.values.tolist())
         self.topics_ = documents.Topic.astype(int).tolist()
 
     def _extract_words_per_topic(self,

diff --git a/bertopic/_save_utils.py b/bertopic/_save_utils.py
@@ -266,7 +266,11 @@ def generate_readme(model, repo_id: str):
     params = "\n".join([f"* {param}: {value}" for param, value in params.items()])
     topics = sorted(list(set(model.topics_)))
     nr_topics = str(len(set(model.topics_)))
-    nr_documents = str(model.c_tf_idf_.shape[1])
+
+    if model.topic_sizes_ is not None:
+        nr_documents = str(sum(model.topic_sizes_.values()))
+    else:
+        nr_documents = ""
 
     # Topic information
     topic_keywords = [" - ".join(list(zip(*model.get_topic(topic)))[0][:5]) for topic in topics]
@@ -290,7 +294,7 @@ def generate_readme(model, repo_id: str):
     if not has_visual_aspect:
         model_card = model_card.replace("{PIPELINE_TAG}", "text-classification")
     else:
-        model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG} /n","") # TODO add proper tag for this instance 
+        model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG}\n","") # TODO add proper tag for this instance 
 
     return model_card