Name		Name	Last commit message	Last commit date
parent directory ..
bitext_cleaning		bitext_cleaning
dapt-curation		dapt-curation
distributed_data_classification		distributed_data_classification
image-curation		image-curation
nemo-retriever-synthetic-data-generation		nemo-retriever-synthetic-data-generation
nemotron_340B_synthetic_datagen		nemotron_340B_synthetic_datagen
peft-curation-with-sdg		peft-curation-with-sdg
peft-curation		peft-curation
pretraining-data-curation		pretraining-data-curation
pretraining-vietnamese-data-curation		pretraining-vietnamese-data-curation
single_node_tutorial		single_node_tutorial
synthetic-data-hello-world		synthetic-data-hello-world
synthetic-preference-data		synthetic-preference-data
synthetic-retrieval-evaluation		synthetic-retrieval-evaluation
tinystories		tinystories
zyda2-tutorial		zyda2-tutorial
README.md		README.md

README.md

Tutorials

The following is a set of tutorials that demonstrate various functionalities and features of NeMo Curator. These tutorials are meant to provide the coding foundation for building applications that consume the data that NeMo Curator curates.

Get Started

To get started, we recommend starting with the following tutorials to become familiar with various functionalities of NeMo Curator and get an idea of what a data curation pipeline might look like:

tinystories, which overviews core functionalities such as downloading, filtering, PII removal and exact deduplication.
peft-curation, which overviews operations suitable for curating small-scale datasets which are used for task-specific fine-tuning.
synthetic-data-hello-world, which overviews basic synthetic data generation facilities for interfacing with external models such as Nemotron-4 340B Instruct.
peft-curation-with-sdg, which combines data processing opeartions and synthetic data generation using Nemotron-4 340B Instruct or LLaMa 3.1 405B Instruct into a single pipeline. Additionally, this tutorial also demonstrates advanced functions such as reward score assignment via Nemotron-4 340B Reward, as well as semantic deduplication to remove semantically similar real or synthetic records.
pretraining-data-curation, which overviews data curation pipeline for creating LLM pretraining dataset at scale.

List of Tutorials

Tutorial	Description	Additional Resources
pretraining-data-curation	Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment
pretraining-vietnamese-data-curation	Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment
dapt-curation	Data curation sample for domain-adaptive pre-training (DAPT), focusing on ChipNeMo data curation as an example	Blog post
distributed_data_classification	Demonstrates data domain and data quality classification at scale in a distributed environment
nemotron_340B_synthetic_datagen	Demonstrates the use of NeMo Curator synthetic data generation modules to leverage Nemotron-4 340B Instruct for generating synthetic preference data
nemo-retriever-synthetic-data-generation	Demonstrates the use of NeMo Curator synthetic data generation modules to leverage NIM models for generating synthetic data and perform data quality assesement on generated data using LLM-as-judge and embedding-model-as-judge. The generated data would be used to evaluate retrieval/RAG pipelines
peft-curation	Data curation sample for parameter efficient fine-tuning (PEFT) use-cases	Blog post
peft-curation-with-sdg	Demonstrates a pipeline to leverage external models such as Nemotron-4 340B Instruct for synthetic data generation, data quality annotation via Nemotron-4 340B Reward, as well as other data processing steps (semantic deduplication, HTML tag removal, etc.) for parameter efficient fine-tuning (PEFT) use-cases	Use this data to fine-tune your own model
single_node_tutorial	A comprehensive example to demonstrate running various NeMo Curator functionalities locally
synthetic-data-hello-world	An introductory example of synthetic data generation using NeMo Curator
synthetic-preference-data	Demonstrates the use of NeMo Curator synthetic data generation modules to leverage LLaMa 3.1 405B Instruct for generating synthetic preference data
synthetic-retrieval-evaluation	Demonstrates the use of NeMo Curator synthetic data generation modules to leverage LLaMa 3.1 405B Instruct for generating synthetic data to evaluate retrieval pipelines
tinystories	A comprehensive example of curating a small dataset to use for model pre-training.	Blog post
zyda2-tutorial	A comprehensive tutorial on how to reproduce Zyda2 dataset with NeMo Curator.	Nvidia blog post Zyphra blog post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorials

tutorials

README.md

Tutorials

Get Started

List of Tutorials

Files

tutorials

Directory actions

More options

Directory actions

More options

Latest commit

History

tutorials

Folders and files

parent directory

README.md

Tutorials

Get Started

List of Tutorials