The following is a set of tutorials that demonstrate various functionalities and features of NeMo Curator. These tutorials are meant to provide the coding foundation for building applications that consume the data that NeMo Curator curates.
To get started, we recommend starting with the following tutorials to become familiar with various functionalities of NeMo Curator and get an idea of what a data curation pipeline might look like:
- tinystories, which overviews core functionalities such as downloading, filtering, PII removal and exact deduplication.
- peft-curation, which overviews operations suitable for curating small-scale datasets which are used for task-specific fine-tuning.
- synthetic-data-hello-world, which overviews basic synthetic data generation facilities for interfacing with external models such as Nemotron-4 340B Instruct.
- peft-curation-with-sdg, which combines data processing opeartions and synthetic data generation using Nemotron-4 340B Instruct or LLaMa 3.1 405B Instruct into a single pipeline. Additionally, this tutorial also demonstrates advanced functions such as reward score assignment via Nemotron-4 340B Reward, as well as semantic deduplication to remove semantically similar real or synthetic records.
- pretraining-data-curation, which overviews data curation pipeline for creating LLM pretraining dataset at scale.
Tutorial | Description | Additional Resources |
---|---|---|
pretraining-data-curation | Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment | |
pretraining-vietnamese-data-curation | Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment | |
dapt-curation | Data curation sample for domain-adaptive pre-training (DAPT), focusing on ChipNeMo data curation as an example | Blog post |
distributed_data_classification | Demonstrates data domain and data quality classification at scale in a distributed environment | |
nemotron_340B_synthetic_datagen | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage Nemotron-4 340B Instruct for generating synthetic preference data | |
nemo-retriever-synthetic-data-generation | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage NIM models for generating synthetic data and perform data quality assesement on generated data using LLM-as-judge and embedding-model-as-judge. The generated data would be used to evaluate retrieval/RAG pipelines | |
peft-curation | Data curation sample for parameter efficient fine-tuning (PEFT) use-cases | Blog post |
peft-curation-with-sdg | Demonstrates a pipeline to leverage external models such as Nemotron-4 340B Instruct for synthetic data generation, data quality annotation via Nemotron-4 340B Reward, as well as other data processing steps (semantic deduplication, HTML tag removal, etc.) for parameter efficient fine-tuning (PEFT) use-cases | Use this data to fine-tune your own model |
single_node_tutorial | A comprehensive example to demonstrate running various NeMo Curator functionalities locally | |
synthetic-data-hello-world | An introductory example of synthetic data generation using NeMo Curator | |
synthetic-preference-data | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage LLaMa 3.1 405B Instruct for generating synthetic preference data | |
synthetic-retrieval-evaluation | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage LLaMa 3.1 405B Instruct for generating synthetic data to evaluate retrieval pipelines | |
tinystories | A comprehensive example of curating a small dataset to use for model pre-training. | Blog post |
zyda2-tutorial | A comprehensive tutorial on how to reproduce Zyda2 dataset with NeMo Curator. | Nvidia blog post Zyphra blog post |