Speech Data Processor (SDP) Toolkit

The Speech Data Processor (SDP) is a toolkit designed to simplify the processing of speech datasets. It minimizes the boilerplate code required and allows for easy sharing of processing steps. SDP's philosophy is to represent processing operations as 'processor' classes, which take in a path to a NeMo-style data manifest as input (or a path to the raw data directory if you do not have a NeMo-style manifest to start with), apply some processing to it, and then save the output manifest file.

Features

Creating Manifests: Generate manifests for your datasets.
Running ASR Inference: Automatically run ASR inference to remove utterances where the reference text differs greatly from ASR predictions.
Text Transformations: Apply text-based transformations to lines in the manifest.
Removing Inaccurate Transcripts: Remove lines from the manifest which may contain inaccurate transcripts.
Custom Processors: Write your own processor classes if the provided ones do not meet your needs.

Installation

SDP is officially supported for Python 3.10, but might work for other versions.

Clone the repository:

   git clone https://github.com/NVIDIA/NeMo-speech-data-processor.git
   cd NeMo-speech-data-processor

Install dependencies:

   pip install -r requirements/main.txt

Optional: If you need to use ASR, NLP parts, or NeMo Text Processing, follow the NeMo installation instructions:
- NeMo Installation

Example:

In this example we will load librispeech using SDP.
- For downloading all available data - replace config.yaml with all.yaml
- For mini dataset - replace with mini.yaml.

    python NeMo-speech-data-processor/main.py \
    --config-path="dataset_configs/english/librispeech" \
    --config-name="config.yaml" \
    processors_to_run="0:" \
    workspace_dir="librispeech_data_dir"

Usage

Create a Configuration YAML File:

Here is a simplified example of a config.yaml file:

processors:
  - _target_: sdp.processors.CreateInitialManifestMCV
    output_manifest_file: "${data_split}_initial_manifest.json"
    language_id: es
  - _target_: sdp.processors.ASRInference
    pretrained_model: "stt_es_quartznet15x5"
  - _target_: sdp.processors.SubRegex
    regex_params_list:
      - {"pattern": "¡", "repl": "."}
      - {"pattern": "ó", "repl": "o"}
    test_cases:
      - {input: {text: "hey!"}, output: {text: "hey."}}
  - _target_: sdp.processors.DropNonAlphabet
    alphabet: "abcdefghijklmnopqrstuvwxyzáéiñóúüABCDEFGHIJKLMNOPQRSTUVWXYZÁÉÍÑÓÚÜ"
    test_cases:
      - {input: {text: "test Тест ¡"}, output: null}
      - {input: {text: "test"}, output: {text: "test"}}
  - _target_: sdp.processors.KeepOnlySpecifiedFields
    output_manifest_file: "${data_split}_final_manifest.json"
    fields_to_keep:
      - "audio_filepath"
      - "text"
      - "duration"

Run the Processor:

Use the following command to process your dataset:

   python <SDP_ROOT>/main.py \
     --config-path="dataset_configs/<lang>/<dataset>/" \
     --config-name="config.yaml" \
     processors_to_run="all" \
     data_split="train" \
     workspace_dir="<dir_to_store_processed_data>"

To learn more about SDP, have a look at our documentation.

Contributing

We welcome community contributions! Please refer to the CONTRIBUTING.md for the process.

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
.github/workflows		.github/workflows
dataset_configs		dataset_configs
docs		docs
requirements		requirements
sdp		sdp
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main.py		main.py
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Data Processor (SDP) Toolkit

Features

Installation

Example:

Usage

Contributing

About

Releases

Packages

Contributors 12

Languages

License

NVIDIA/NeMo-speech-data-processor

Folders and files

Latest commit

History

Repository files navigation

Speech Data Processor (SDP) Toolkit

Features

Installation

Example:

Usage

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Languages

Packages