Skip to content

Commit

Permalink
Create notebook tutorials for distributed data classifiers (#415)
Browse files Browse the repository at this point in the history
* split domain and quality notebooks

Signed-off-by: Sarah Yurick <[email protected]>

* add multilingual domain classifier

Signed-off-by: Sarah Yurick <[email protected]>

* add fineweb-edu classifier

Signed-off-by: Sarah Yurick <[email protected]>

* aegis classifier

Signed-off-by: Sarah Yurick <[email protected]>

* add instruction-data-guard classifier

Signed-off-by: Sarah Yurick <[email protected]>

* edit readmes

Signed-off-by: Sarah Yurick <[email protected]>

* add content type notebook

Signed-off-by: Sarah Yurick <[email protected]>

* add prompt task and complexity

Signed-off-by: Sarah Yurick <[email protected]>

* add more info to notebooks

Signed-off-by: Sarah Yurick <[email protected]>

* change to output_path

Signed-off-by: Sarah Yurick <[email protected]>

* add readme

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
  • Loading branch information
sarahyurick authored Jan 23, 2025
1 parent d31c29f commit cd38de0
Show file tree
Hide file tree
Showing 13 changed files with 2,108 additions and 135 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ To get started with NeMo Curator, you can follow the tutorials [available here](

- [`tinystories`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/tinystories) which focuses on data curation for training LLMs from scratch.
- [`peft-curation`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation) which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the domain and quality classifiers to help with data annotation.
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which demonstrates how to use NVIDIA's Hugging Face classifiers to help with data annotation.
- [`single_node_tutorial`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial) which demonstrates an end-to-end data curation pipeline for curating Wikipedia data in Thai.
- [`image-curation`](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/image-curation/image-curation.ipynb) which explores the scalable image curation modules.

Expand Down
84 changes: 0 additions & 84 deletions nemo_curator/sample_dataframe.py

This file was deleted.

2 changes: 1 addition & 1 deletion tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ To get started, we recommend starting with the following tutorials to become fam
| [pretraining-data-curation](./pretraining-data-curation/) | Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment | |
| [pretraining-vietnamese-data-curation](./pretraining-vietnamese-data-curation/) | Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment | |
| [dapt-curation](./dapt-curation) | Data curation sample for domain-adaptive pre-training (DAPT), focusing on [ChipNeMo](https://blogs.nvidia.com/blog/llm-semiconductors-chip-nemo/) data curation as an example | [Blog post](https://developer.nvidia.com/blog/streamlining-data-processing-for-domain-adaptive-pretraining-with-nvidia-nemo-curator/) |
| [distributed_data_classification](./distributed_data_classification) | Demonstrates data domain and data quality classification at scale in a distributed environment | |
| [distributed_data_classification](./distributed_data_classification) | Demonstrates machine learning classification with NVIDIA's Hugging Face models at scale in a distributed environment | |
| [nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data | |
| [nemo-retriever-synthetic-data-generation](./nemo_retriever_synthetic_data_generation) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [NIM models](https://ai.nvidia.com) for generating synthetic data and perform data quality assesement on generated data using LLM-as-judge and embedding-model-as-judge. The generated data would be used to evaluate retrieval/RAG pipelines |
| [peft-curation](./peft-curation/) | Data curation sample for parameter efficient fine-tuning (PEFT) use-cases | [Blog post](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-parameter-efficient-fine-tuning-with-nvidia-nemo-curator/) |
Expand Down
27 changes: 27 additions & 0 deletions tutorials/distributed_data_classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Distributed Data Classification
The following is a set of Jupyter notebook tutorials which demonstrate how to use various text classification models supported by NeMo Curator.
The goal of using these classifiers is to help with data annotation, which is useful in data blending for foundation model training.

Each of these classifiers are available on Hugging Face and can be run independently with the [Transformers](https://github.com/huggingface/transformers) library.
By running them with NeMo Curator, the classifiers are accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.
Each of the Jupyter notebooks in this directory demonstrate how to run the classifiers on text data and are easily scalable to large amounts of data.

Before running any of these notebooks, please see this [Getting Started](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#get-started) page for instructions on how to install NeMo Curator.

## List of Classifiers

<div align="center">

| NeMo Curator Classifier | Hugging Face page |
| --- | --- |
| `AegisClassifier` | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) and [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0) |
| `ContentTypeClassifier` | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) |
| `DomainClassifier` | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) |
| `FineWebEduClassifier` | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) |
| `InstructionDataGuardClassifier` | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) |
| `MultilingualDomainClassifier` | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) |
| `PromptTaskComplexityClassifier` | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) |
| `PyTorchClassifier` | Requires local .pth file(s) for any DeBERTa-based text classifier(s) |
| `QualityClassifier` | [quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) |

</div>
Loading

0 comments on commit cd38de0

Please sign in to comment.