forked from NVIDIA/NeMo-speech-data-processor
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* commiting config.yaml Signed-off-by: nune-tadevosyan <[email protected]> * New processor for MLS-Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * New processor for converting flac audio files to wav format using ffmpeg Signed-off-by: nune-tadevosyan <[email protected]> * Config file for MLS-Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * Update on MLS Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * s Signed-off-by: nune-tadevosyan <[email protected]> * Adding support for processing MTedX dataset. Provided corresponding processors and config file. Signed-off-by: nune-tadevosyan <[email protected]> * Adding processor for creating initial manifest file for Coraa Portuguese dataset Signed-off-by: nune-tadevosyan <[email protected]> * Adding rarfile package in requirements Signed-off-by: nune-tadevosyan <[email protected]> * Adding huggingface_hub in requirements Signed-off-by: nune-tadevosyan <[email protected]> * Adding huggingface_hub in requirements Signed-off-by: nune-tadevosyan <[email protected]> * Commiting config file for preprocessing Coraa dataset for Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * Adding changes to config files Signed-off-by: nune-tadevosyan <[email protected]> * Removing .idea files Signed-off-by: nune-tadevosyan <[email protected]> * Added tests for datasets and made some changes in the code Signed-off-by: nune-tadevosyan <[email protected]> * Adding data preparation for mtedx and coraa Signed-off-by: nune-tadevosyan <[email protected]> * Commiting small bug fix Signed-off-by: nune-tadevosyan <[email protected]> * Config update Signed-off-by: nune-tadevosyan <[email protected]> * Removing .swp file Signed-off-by: nune-tadevosyan <[email protected]> * Documentation update for new datasets Signed-off-by: nune-tadevosyan <[email protected]> * Commiting some doc changes Signed-off-by: nune-tadevosyan <[email protected]> * Changing requirements Signed-off-by: nune-tadevosyan <[email protected]> * Update configs Signed-off-by: nune-tadevosyan <[email protected]> * Update docs/src/sdp/existing_configs.rst Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Update dataset_configs/portuguese/mcv/config.yaml Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Update dataset_configs/portuguese/mls/config.yaml Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Update dataset_configs/portuguese/mtedx/config.yaml Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Commiting config changes Signed-off-by: nune-tadevosyan <[email protected]> * Commiting changes for new processors Signed-off-by: nune-tadevosyan <[email protected]> * Changes for SplitByVttSentence class Signed-off-by: nune-tadevosyan <[email protected]> * Small docstring change Signed-off-by: nune-tadevosyan <[email protected]> * Adding new lines between functions Signed-off-by: nune-tadevosyan <[email protected]> * Removing empty file Signed-off-by: nune-tadevosyan <[email protected]> * Adding needed space Signed-off-by: nune-tadevosyan <[email protected]> * Removing empty file Signed-off-by: nune-tadevosyan <[email protected]> * Removing repeated class Signed-off-by: nune-tadevosyan <[email protected]> * Some changes Signed-off-by: nune-tadevosyan <[email protected]> --------- Signed-off-by: nune-tadevosyan <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]>
- Loading branch information
1 parent
e63542f
commit 0b59e8a
Showing
19 changed files
with
960 additions
and
42 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
documentation: | | ||
Coraa Portuguese | ||
################ | ||
The config performs the following data processing. | ||
1. Downloads and extracts all the data from the "https://huggingface.co/datasets/gabrielrstan/CORAA-v1.1/tree/main" | ||
2. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions. | ||
3. Drops any data that contains high/low character occurence. | ||
4. Drops any data that contains symbols not in the supported alphabet. | ||
**Required arguments**. | ||
* **workspace_dir**: specify the workspace folder where all audio files will be stored. | ||
* **data_split**: should be "train", "dev" or "test". | ||
**Output format**. | ||
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``. | ||
The output manifest contains the following fields: | ||
* **audio_filepath (str)**: relative path to the audio files. | ||
* **text (str)**: transcription, including punctuation ".,?" and capitalization. | ||
* **duration (float)**: audio duration in seconds. | ||
processors_to_run: all | ||
workspace_dir: ??? | ||
data_split: ??? | ||
final_manifest: ??? | ||
|
||
|
||
processors: | ||
- _target_: sdp.processors.CreateInitialManifestCORAA | ||
raw_data_dir: ${workspace_dir} | ||
data_split: ${data_split} | ||
extract_archive_dir: ${workspace_dir}/extracted | ||
resampled_audio_dir: ${workspace_dir}/extracted/16k | ||
already_downloaded: false | ||
already_extracted: false | ||
output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json | ||
|
||
- _target_: sdp.processors.SubRegex | ||
regex_params_list: | ||
- {"pattern": "(Aplausos)", "repl": " "} | ||
- {"pattern": "(Risos)", "repl": " "} | ||
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "} | ||
- {"pattern": "'", "repl": " "} | ||
- {"pattern": '[\$\&\¡\(\)]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\‘\’\“\”\„]', "repl": " "} | ||
- {"pattern": '[\:\;\`\ʻ]', "repl": " "} | ||
- {"pattern": "!", "repl": "."} | ||
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
|
||
# remove remaining repeated periods since most of the time they are unnecessary in this data | ||
- {"pattern": "\\.{2,20}", "repl": " "} | ||
|
||
- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '} | ||
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'} | ||
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'} | ||
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' } | ||
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '} | ||
|
||
- {"pattern": " um km ", "repl" : " um quilômetro "} | ||
- {"pattern": " km ", "repl" : " quilômetros "} | ||
|
||
- _target_: sdp.processors.DropHighLowDuration | ||
high_duration_threshold: 20 | ||
low_duration_threshold: 0.5 | ||
|
||
- _target_: sdp.processors.DropHighLowCharrate | ||
high_charrate_threshold: 21 | ||
low_charrate_threshold: 1 | ||
|
||
- _target_: sdp.processors.DropNonAlphabet | ||
output_manifest_file: ${final_manifest} | ||
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
documentation: | | ||
MCV Portuguese | ||
############## | ||
This config was originally designed for the | ||
`Mozilla Common Voice (MCV) <https://commonvoice.mozilla.org/>`_ dataset | ||
15.0 release, but should work for any subsequent releases as well. | ||
It performs the following data processing. | ||
1. Extracts and converts all data to the NeMo format. | ||
2. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions. | ||
3. Drops any data that contains high/low character occurence. | ||
4. Drops any data that contains symbols not in the supported alphabet. | ||
**Required arguments**. | ||
* **workspace_dir**: specify the workspace folder where all audio files will be stored. | ||
You need to manually place the downloaded MCV Portuguese data inside | ||
``<workspace dir>/raw_data/`` subfolder. | ||
* **data_split**: should be "train", "dev" or "test". | ||
**Output format**. | ||
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``. | ||
The output manifest contains the following fields: | ||
* **audio_filepath (str)**: relative path to the audio files. | ||
* **text (str)**: transcription, including punctuation ".,?" and capitalization. | ||
* **duration (float)**: audio duration in seconds. | ||
processors_to_run: all | ||
workspace_dir: ??? | ||
data_split: ??? | ||
final_manifest: ??? | ||
|
||
|
||
processors: | ||
- _target_: sdp.processors.CreateInitialManifestMCV | ||
raw_data_dir: ${workspace_dir}/raw_data | ||
extract_archive_dir: ${workspace_dir}/raw | ||
resampled_audio_dir: ${workspace_dir}/${data_split}/audio | ||
data_split: ${data_split} | ||
language_id: pt | ||
output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json | ||
|
||
- _target_: sdp.processors.SubRegex | ||
regex_params_list: | ||
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "} | ||
- {"pattern": "'", "repl": " "} | ||
- {"pattern": '[\$\&\¡\(\)]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\‘\’\“\”\„]', "repl": " "} | ||
- {"pattern": '[\:\;\`\ʻ]', "repl": " "} | ||
- {"pattern": "!", "repl": "."} | ||
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
|
||
# remove remaining repeated periods since most of the time they are unnecessary in this data | ||
- {"pattern": "\\.{2,20}", "repl": " "} | ||
|
||
- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '} | ||
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'} | ||
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'} | ||
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' } | ||
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '} | ||
|
||
- {"pattern": " um km ", "repl" : " um quilômetro "} | ||
- {"pattern": " km ", "repl" : " quilômetros "} | ||
|
||
- _target_: sdp.processors.DropHighLowCharrate | ||
high_charrate_threshold: 21 | ||
low_charrate_threshold: 1 | ||
|
||
- _target_: sdp.processors.DropHighLowDuration | ||
high_duration_threshold: 16 | ||
low_duration_threshold: 1 | ||
|
||
- _target_: sdp.processors.DropNonAlphabet | ||
output_manifest_file: ${final_manifest} | ||
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
documentation: | | ||
MLS Portuguese | ||
############## | ||
The config performs the following data processing. | ||
1. Downloads and extracts all the data from the "https://www.openslr.org/94/" in Portuguese | ||
2. Converts all flac audio files to wav format. | ||
3. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions. | ||
4. Drops any data that contains high/low character occurence. | ||
5. Drops any data that contains symbols not in the supported alphabet. | ||
**Required arguments**. | ||
* **workspace_dir**: specify the workspace folder where all audio files will be stored. | ||
* **data_split**: should be "train", "dev" or "test". | ||
**Output format**. | ||
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``. | ||
The output manifest contains the following fields: | ||
* **audio_filepath (str)**: relative path to the audio files. | ||
* **text (str)**: transcription, including punctuation ".,?" and capitalization. | ||
* **duration (float)**: audio duration in seconds. | ||
processors_to_run: all | ||
workspace_dir: ??? | ||
data_split: ??? | ||
final_manifest: ??? | ||
|
||
processors: | ||
- _target_: sdp.processors.CreateInitialManifestMLS | ||
output_manifest_file: ${workspace_dir}/mls_portuguese_processed/${data_split}_manifest.json | ||
raw_data_dir: ${workspace_dir} | ||
language: portuguese | ||
resampled_audio_dir: "" #not passing an argument here to convert it with ffmpeg | ||
data_split: ${data_split} | ||
|
||
- _target_: sdp.processors.FfmpegConvert | ||
resampled_audio_dir: ${workspace_dir}/resampled | ||
input_field: audio_filepath | ||
output_field: audio_filepath | ||
|
||
- _target_: sdp.processors.SubRegex | ||
regex_params_list: | ||
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "} | ||
- {"pattern": "'", "repl": " "} | ||
- {"pattern": '[\$\&\¡\(\)]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\‘\’\“\”\„]', "repl": " "} | ||
- {"pattern": '[\:\;\`\ʻ]', "repl": " "} | ||
- {"pattern": "!", "repl": "."} | ||
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
|
||
# remove remaining repeated periods since most of the time they are unnecessary in this data | ||
- {"pattern": "\\.{2,20}", "repl": " "} | ||
|
||
- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '} | ||
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'} | ||
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'} | ||
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' } | ||
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '} | ||
|
||
- {"pattern": " um km ", "repl" : " um quilômetro "} | ||
- {"pattern": " km ", "repl" : " quilômetros "} | ||
- _target_: sdp.processors.DropHighLowCharrate | ||
high_charrate_threshold: 21 | ||
low_charrate_threshold: 1 | ||
|
||
- _target_: sdp.processors.DropHighLowDuration | ||
high_duration_threshold: 20 | ||
low_duration_threshold: 1 | ||
|
||
- _target_: sdp.processors.DropNonAlphabet | ||
output_manifest_file: ${final_manifest} | ||
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
documentation: | | ||
MTEDX Portuguese | ||
################ | ||
The config performs the following data processing. | ||
1. Downloads and extracts the data from the "https://www.openslr.org/100/" in Portuguese | ||
2. Converts all flac audio files to wav format. | ||
3. Splits audio by the given time steps in vtt files. | ||
4. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions. | ||
5. Drops any data that contains high/low character occurence. | ||
6. Drops any data that contains symbols not in the supported alphabet. | ||
**Required arguments**. | ||
* **workspace_dir**: specify the workspace folder where all audio files will be stored. | ||
* **raw_data_dir**: specify in which folder the data will be downladed. | ||
* **data_split**: should be "train", "valid" or "test". | ||
**Output format**. | ||
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``. | ||
The output manifest contains the following fields: | ||
* **audio_filepath (str)**: relative path to the audio files. | ||
* **text (str)**: transcription, including punctuation ".,?" and capitalization. | ||
* **duration (float)**: audio duration in seconds. | ||
processors_to_run: all | ||
workspace_dir: ??? | ||
data_split: ??? | ||
final_manifest: ??? | ||
|
||
|
||
processors: | ||
- _target_: sdp.processors.CreateInitialManifestMTEDX | ||
raw_data_dir: ${workspace_dir}/raw_data | ||
data_split: ${data_split} | ||
language_id: pt | ||
already_extracted: False | ||
output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json | ||
|
||
- _target_: sdp.processors.FfmpegConvert | ||
resampled_audio_dir: ${workspace_dir}/resampled | ||
input_field: audio_filepath | ||
output_field: audio_filepath | ||
|
||
- _target_: sdp.processors.datasets.commoncrawl.SplitByVttSentence | ||
output_manifest_file: ${workspace_dir}/manifest_vtt.json | ||
input_manifest_file: ${workspace_dir}/${data_split}_manifest0.json | ||
splited_audio_dir: ${workspace_dir}/splited | ||
source_audio_field: audio_filepath | ||
target_audio_field: audio_filepath | ||
duration_field: duration | ||
text_field: text | ||
vtt_field: vtt_filepath | ||
proxy_fields: [] | ||
duration_threshold: 20.0 | ||
|
||
- _target_: sdp.processors.SubRegex | ||
regex_params_list: | ||
- {"pattern": "(Aplausos)", "repl": " "} | ||
- {"pattern": "(Risos)", "repl": " "} | ||
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "} | ||
- {"pattern": "'", "repl": " "} | ||
- {"pattern": '[\$\&\¡\(\)]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\«\°\´\·\»]', "repl": " "} | ||
- {"pattern": '[\‘\’\“\”\„]', "repl": " "} | ||
- {"pattern": '[\:\;\`\ʻ]', "repl": " "} | ||
- {"pattern": "!", "repl": "."} | ||
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end | ||
|
||
# remove remaining repeated periods since most of the time they are unnecessary in this data | ||
- {"pattern": "\\.{2,20}", "repl": " "} | ||
|
||
- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '} | ||
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'} | ||
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'} | ||
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' } | ||
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '} | ||
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '} | ||
|
||
- {"pattern": " um km ", "repl" : " um quilômetro "} | ||
- {"pattern": " km ", "repl" : " quilômetros "} | ||
|
||
- _target_: sdp.processors.DropHighLowDuration | ||
high_duration_threshold: 20 | ||
low_duration_threshold: 1 | ||
|
||
- _target_: sdp.processors.DropHighLowCharrate | ||
high_charrate_threshold: 21 | ||
low_charrate_threshold: 1 | ||
|
||
- _target_: sdp.processors.DropNonAlphabet | ||
output_manifest_file: ${final_manifest} | ||
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.