-
Notifications
You must be signed in to change notification settings - Fork 107
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Create notebook tutorials for distributed data classifiers (#415)
* split domain and quality notebooks Signed-off-by: Sarah Yurick <[email protected]> * add multilingual domain classifier Signed-off-by: Sarah Yurick <[email protected]> * add fineweb-edu classifier Signed-off-by: Sarah Yurick <[email protected]> * aegis classifier Signed-off-by: Sarah Yurick <[email protected]> * add instruction-data-guard classifier Signed-off-by: Sarah Yurick <[email protected]> * edit readmes Signed-off-by: Sarah Yurick <[email protected]> * add content type notebook Signed-off-by: Sarah Yurick <[email protected]> * add prompt task and complexity Signed-off-by: Sarah Yurick <[email protected]> * add more info to notebooks Signed-off-by: Sarah Yurick <[email protected]> * change to output_path Signed-off-by: Sarah Yurick <[email protected]> * add readme Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
- Loading branch information
1 parent
d31c29f
commit cd38de0
Showing
13 changed files
with
2,108 additions
and
135 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Distributed Data Classification | ||
The following is a set of Jupyter notebook tutorials which demonstrate how to use various text classification models supported by NeMo Curator. | ||
The goal of using these classifiers is to help with data annotation, which is useful in data blending for foundation model training. | ||
|
||
Each of these classifiers are available on Hugging Face and can be run independently with the [Transformers](https://github.com/huggingface/transformers) library. | ||
By running them with NeMo Curator, the classifiers are accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets. | ||
Each of the Jupyter notebooks in this directory demonstrate how to run the classifiers on text data and are easily scalable to large amounts of data. | ||
|
||
Before running any of these notebooks, please see this [Getting Started](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#get-started) page for instructions on how to install NeMo Curator. | ||
|
||
## List of Classifiers | ||
|
||
<div align="center"> | ||
|
||
| NeMo Curator Classifier | Hugging Face page | | ||
| --- | --- | | ||
| `AegisClassifier` | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) and [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0) | | ||
| `ContentTypeClassifier` | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) | | ||
| `DomainClassifier` | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) | | ||
| `FineWebEduClassifier` | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) | | ||
| `InstructionDataGuardClassifier` | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) | | ||
| `MultilingualDomainClassifier` | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) | | ||
| `PromptTaskComplexityClassifier` | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) | | ||
| `PyTorchClassifier` | Requires local .pth file(s) for any DeBERTa-based text classifier(s) | | ||
| `QualityClassifier` | [quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) | | ||
|
||
</div> |
Oops, something went wrong.