Skip to content

Commit

Permalink
Portuguese (NVIDIA#38)
Browse files Browse the repository at this point in the history
* commiting config.yaml

Signed-off-by: nune-tadevosyan <[email protected]>

* New processor for MLS-Portuguese

Signed-off-by: nune-tadevosyan <[email protected]>

* New processor for converting flac audio files to wav format using ffmpeg

Signed-off-by: nune-tadevosyan <[email protected]>

* Config file for MLS-Portuguese

Signed-off-by: nune-tadevosyan <[email protected]>

* Update on MLS Portuguese

Signed-off-by: nune-tadevosyan <[email protected]>

* s

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding support for processing MTedX dataset. Provided corresponding processors and config file.

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding processor for creating initial manifest file for Coraa Portuguese dataset

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding rarfile package in requirements

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding huggingface_hub in requirements

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding huggingface_hub in requirements

Signed-off-by: nune-tadevosyan <[email protected]>

* Commiting config file for preprocessing Coraa dataset for Portuguese

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding changes to config files

Signed-off-by: nune-tadevosyan <[email protected]>

* Removing .idea files

Signed-off-by: nune-tadevosyan <[email protected]>

* Added tests for datasets and made some changes in the code

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding data preparation for mtedx and coraa

Signed-off-by: nune-tadevosyan <[email protected]>

* Commiting small bug fix

Signed-off-by: nune-tadevosyan <[email protected]>

* Config update

Signed-off-by: nune-tadevosyan <[email protected]>

* Removing .swp file

Signed-off-by: nune-tadevosyan <[email protected]>

* Documentation update for new datasets

Signed-off-by: nune-tadevosyan <[email protected]>

* Commiting some doc changes

Signed-off-by: nune-tadevosyan <[email protected]>

* Changing requirements

Signed-off-by: nune-tadevosyan <[email protected]>

* Update configs

Signed-off-by: nune-tadevosyan <[email protected]>

* Update docs/src/sdp/existing_configs.rst

Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: nune-tadevosyan <[email protected]>

* Update dataset_configs/portuguese/mcv/config.yaml

Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: nune-tadevosyan <[email protected]>

* Update dataset_configs/portuguese/mls/config.yaml

Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: nune-tadevosyan <[email protected]>

* Update dataset_configs/portuguese/mtedx/config.yaml

Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: nune-tadevosyan <[email protected]>

* Commiting config changes

Signed-off-by: nune-tadevosyan <[email protected]>

* Commiting changes for new processors

Signed-off-by: nune-tadevosyan <[email protected]>

* Changes for SplitByVttSentence class

Signed-off-by: nune-tadevosyan <[email protected]>

* Small docstring change

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding new lines between functions

Signed-off-by: nune-tadevosyan <[email protected]>

* Removing empty file

Signed-off-by: nune-tadevosyan <[email protected]>

* Adding needed space

Signed-off-by: nune-tadevosyan <[email protected]>

* Removing empty file

Signed-off-by: nune-tadevosyan <[email protected]>

* Removing repeated class

Signed-off-by: nune-tadevosyan <[email protected]>

* Some changes

Signed-off-by: nune-tadevosyan <[email protected]>

---------

Signed-off-by: nune-tadevosyan <[email protected]>
Co-authored-by: Elena Rastorgueva <[email protected]>
  • Loading branch information
nune-tadevosyan and erastorgueva-nv authored May 14, 2024
1 parent e63542f commit 0b59e8a
Show file tree
Hide file tree
Showing 19 changed files with 960 additions and 42 deletions.
82 changes: 82 additions & 0 deletions dataset_configs/portuguese/coraa/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
documentation: |
Coraa Portuguese
################
The config performs the following data processing.
1. Downloads and extracts all the data from the "https://huggingface.co/datasets/gabrielrstan/CORAA-v1.1/tree/main"
2. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.
3. Drops any data that contains high/low character occurence.
4. Drops any data that contains symbols not in the supported alphabet.
**Required arguments**.
* **workspace_dir**: specify the workspace folder where all audio files will be stored.
* **data_split**: should be "train", "dev" or "test".
**Output format**.
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
The output manifest contains the following fields:
* **audio_filepath (str)**: relative path to the audio files.
* **text (str)**: transcription, including punctuation ".,?" and capitalization.
* **duration (float)**: audio duration in seconds.
processors_to_run: all
workspace_dir: ???
data_split: ???
final_manifest: ???


processors:
- _target_: sdp.processors.CreateInitialManifestCORAA
raw_data_dir: ${workspace_dir}
data_split: ${data_split}
extract_archive_dir: ${workspace_dir}/extracted
resampled_audio_dir: ${workspace_dir}/extracted/16k
already_downloaded: false
already_extracted: false
output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json

- _target_: sdp.processors.SubRegex
regex_params_list:
- {"pattern": "(Aplausos)", "repl": " "}
- {"pattern": "(Risos)", "repl": " "}
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
- {"pattern": "'", "repl": " "}
- {"pattern": '[\$\&\¡\(\)]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\‘\’\“\”\„]', "repl": " "}
- {"pattern": '[\:\;\`\ʻ]', "repl": " "}
- {"pattern": "!", "repl": "."}
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end

# remove remaining repeated periods since most of the time they are unnecessary in this data
- {"pattern": "\\.{2,20}", "repl": " "}

- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}

- {"pattern": " um km ", "repl" : " um quilômetro "}
- {"pattern": " km ", "repl" : " quilômetros "}

- _target_: sdp.processors.DropHighLowDuration
high_duration_threshold: 20
low_duration_threshold: 0.5

- _target_: sdp.processors.DropHighLowCharrate
high_charrate_threshold: 21
low_charrate_threshold: 1

- _target_: sdp.processors.DropNonAlphabet
output_manifest_file: ${final_manifest}
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"
87 changes: 87 additions & 0 deletions dataset_configs/portuguese/mcv/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
documentation: |
MCV Portuguese
##############
This config was originally designed for the
`Mozilla Common Voice (MCV) <https://commonvoice.mozilla.org/>`_ dataset
15.0 release, but should work for any subsequent releases as well.
It performs the following data processing.
1. Extracts and converts all data to the NeMo format.
2. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.
3. Drops any data that contains high/low character occurence.
4. Drops any data that contains symbols not in the supported alphabet.
**Required arguments**.
* **workspace_dir**: specify the workspace folder where all audio files will be stored.
You need to manually place the downloaded MCV Portuguese data inside
``<workspace dir>/raw_data/`` subfolder.
* **data_split**: should be "train", "dev" or "test".
**Output format**.
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
The output manifest contains the following fields:
* **audio_filepath (str)**: relative path to the audio files.
* **text (str)**: transcription, including punctuation ".,?" and capitalization.
* **duration (float)**: audio duration in seconds.
processors_to_run: all
workspace_dir: ???
data_split: ???
final_manifest: ???


processors:
- _target_: sdp.processors.CreateInitialManifestMCV
raw_data_dir: ${workspace_dir}/raw_data
extract_archive_dir: ${workspace_dir}/raw
resampled_audio_dir: ${workspace_dir}/${data_split}/audio
data_split: ${data_split}
language_id: pt
output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json

- _target_: sdp.processors.SubRegex
regex_params_list:
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
- {"pattern": "'", "repl": " "}
- {"pattern": '[\$\&\¡\(\)]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\‘\’\“\”\„]', "repl": " "}
- {"pattern": '[\:\;\`\ʻ]', "repl": " "}
- {"pattern": "!", "repl": "."}
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end

# remove remaining repeated periods since most of the time they are unnecessary in this data
- {"pattern": "\\.{2,20}", "repl": " "}

- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}

- {"pattern": " um km ", "repl" : " um quilômetro "}
- {"pattern": " km ", "repl" : " quilômetros "}

- _target_: sdp.processors.DropHighLowCharrate
high_charrate_threshold: 21
low_charrate_threshold: 1

- _target_: sdp.processors.DropHighLowDuration
high_duration_threshold: 16
low_duration_threshold: 1

- _target_: sdp.processors.DropNonAlphabet
output_manifest_file: ${final_manifest}
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"

81 changes: 81 additions & 0 deletions dataset_configs/portuguese/mls/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
documentation: |
MLS Portuguese
##############
The config performs the following data processing.
1. Downloads and extracts all the data from the "https://www.openslr.org/94/" in Portuguese
2. Converts all flac audio files to wav format.
3. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.
4. Drops any data that contains high/low character occurence.
5. Drops any data that contains symbols not in the supported alphabet.
**Required arguments**.
* **workspace_dir**: specify the workspace folder where all audio files will be stored.
* **data_split**: should be "train", "dev" or "test".
**Output format**.
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
The output manifest contains the following fields:
* **audio_filepath (str)**: relative path to the audio files.
* **text (str)**: transcription, including punctuation ".,?" and capitalization.
* **duration (float)**: audio duration in seconds.
processors_to_run: all
workspace_dir: ???
data_split: ???
final_manifest: ???

processors:
- _target_: sdp.processors.CreateInitialManifestMLS
output_manifest_file: ${workspace_dir}/mls_portuguese_processed/${data_split}_manifest.json
raw_data_dir: ${workspace_dir}
language: portuguese
resampled_audio_dir: "" #not passing an argument here to convert it with ffmpeg
data_split: ${data_split}

- _target_: sdp.processors.FfmpegConvert
resampled_audio_dir: ${workspace_dir}/resampled
input_field: audio_filepath
output_field: audio_filepath

- _target_: sdp.processors.SubRegex
regex_params_list:
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
- {"pattern": "'", "repl": " "}
- {"pattern": '[\$\&\¡\(\)]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\‘\’\“\”\„]', "repl": " "}
- {"pattern": '[\:\;\`\ʻ]', "repl": " "}
- {"pattern": "!", "repl": "."}
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end

# remove remaining repeated periods since most of the time they are unnecessary in this data
- {"pattern": "\\.{2,20}", "repl": " "}

- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}

- {"pattern": " um km ", "repl" : " um quilômetro "}
- {"pattern": " km ", "repl" : " quilômetros "}
- _target_: sdp.processors.DropHighLowCharrate
high_charrate_threshold: 21
low_charrate_threshold: 1

- _target_: sdp.processors.DropHighLowDuration
high_duration_threshold: 20
low_duration_threshold: 1

- _target_: sdp.processors.DropNonAlphabet
output_manifest_file: ${final_manifest}
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"
101 changes: 101 additions & 0 deletions dataset_configs/portuguese/mtedx/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
documentation: |
MTEDX Portuguese
################
The config performs the following data processing.
1. Downloads and extracts the data from the "https://www.openslr.org/100/" in Portuguese
2. Converts all flac audio files to wav format.
3. Splits audio by the given time steps in vtt files.
4. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.
5. Drops any data that contains high/low character occurence.
6. Drops any data that contains symbols not in the supported alphabet.
**Required arguments**.
* **workspace_dir**: specify the workspace folder where all audio files will be stored.
* **raw_data_dir**: specify in which folder the data will be downladed.
* **data_split**: should be "train", "valid" or "test".
**Output format**.
This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
The output manifest contains the following fields:
* **audio_filepath (str)**: relative path to the audio files.
* **text (str)**: transcription, including punctuation ".,?" and capitalization.
* **duration (float)**: audio duration in seconds.
processors_to_run: all
workspace_dir: ???
data_split: ???
final_manifest: ???


processors:
- _target_: sdp.processors.CreateInitialManifestMTEDX
raw_data_dir: ${workspace_dir}/raw_data
data_split: ${data_split}
language_id: pt
already_extracted: False
output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json

- _target_: sdp.processors.FfmpegConvert
resampled_audio_dir: ${workspace_dir}/resampled
input_field: audio_filepath
output_field: audio_filepath

- _target_: sdp.processors.datasets.commoncrawl.SplitByVttSentence
output_manifest_file: ${workspace_dir}/manifest_vtt.json
input_manifest_file: ${workspace_dir}/${data_split}_manifest0.json
splited_audio_dir: ${workspace_dir}/splited
source_audio_field: audio_filepath
target_audio_field: audio_filepath
duration_field: duration
text_field: text
vtt_field: vtt_filepath
proxy_fields: []
duration_threshold: 20.0

- _target_: sdp.processors.SubRegex
regex_params_list:
- {"pattern": "(Aplausos)", "repl": " "}
- {"pattern": "(Risos)", "repl": " "}
- {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
- {"pattern": "'", "repl": " "}
- {"pattern": '[\$\&\¡\(\)]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\«\°\´\·\»]', "repl": " "}
- {"pattern": '[\‘\’\“\”\„]', "repl": " "}
- {"pattern": '[\:\;\`\ʻ]', "repl": " "}
- {"pattern": "!", "repl": "."}
- {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
- {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end

# remove remaining repeated periods since most of the time they are unnecessary in this data
- {"pattern": "\\.{2,20}", "repl": " "}

- {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
- {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
- {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
- {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
- {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
- {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
- {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}

- {"pattern": " um km ", "repl" : " um quilômetro "}
- {"pattern": " km ", "repl" : " quilômetros "}

- _target_: sdp.processors.DropHighLowDuration
high_duration_threshold: 20
low_duration_threshold: 1

- _target_: sdp.processors.DropHighLowCharrate
high_charrate_threshold: 21
low_charrate_threshold: 1

- _target_: sdp.processors.DropNonAlphabet
output_manifest_file: ${final_manifest}
alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"

12 changes: 12 additions & 0 deletions docs/src/sdp/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,18 @@ SLR83
.. autodata:: sdp.processors.CustomDataSplitSLR83
:annotation:

MTEDx
'''

.. autodata:: sdp.processors.CreateInitialManifestMTEDX
:annotation:

Coraa
'''

.. autodata:: sdp.processors.CreateInitialManifestCORAA
:annotation:

.. TODO: Fisher config is not accessible - should we require moving everything to SDP
.. Probably need some policy on shat lives in main folder vs configs.
.. To control the number of processors we support.
Expand Down
Loading

0 comments on commit 0b59e8a

Please sign in to comment.