Portuguese (NVIDIA#38)

* commiting config.yaml Signed-off-by: nune-tadevosyan <[email protected]> * New processor for MLS-Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * New processor for converting flac audio files to wav format using ffmpeg Signed-off-by: nune-tadevosyan <[email protected]> * Config file for MLS-Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * Update on MLS Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * s Signed-off-by: nune-tadevosyan <[email protected]> * Adding support for processing MTedX dataset. Provided corresponding processors and config file. Signed-off-by: nune-tadevosyan <[email protected]> * Adding processor for creating initial manifest file for Coraa Portuguese dataset Signed-off-by: nune-tadevosyan <[email protected]> * Adding rarfile package in requirements Signed-off-by: nune-tadevosyan <[email protected]> * Adding huggingface_hub in requirements Signed-off-by: nune-tadevosyan <[email protected]> * Adding huggingface_hub in requirements Signed-off-by: nune-tadevosyan <[email protected]> * Commiting config file for preprocessing Coraa dataset for Portuguese Signed-off-by: nune-tadevosyan <[email protected]> * Adding changes to config files Signed-off-by: nune-tadevosyan <[email protected]> * Removing .idea files Signed-off-by: nune-tadevosyan <[email protected]> * Added tests for datasets and made some changes in the code Signed-off-by: nune-tadevosyan <[email protected]> * Adding data preparation for mtedx and coraa Signed-off-by: nune-tadevosyan <[email protected]> * Commiting small bug fix Signed-off-by: nune-tadevosyan <[email protected]> * Config update Signed-off-by: nune-tadevosyan <[email protected]> * Removing .swp file Signed-off-by: nune-tadevosyan <[email protected]> * Documentation update for new datasets Signed-off-by: nune-tadevosyan <[email protected]> * Commiting some doc changes Signed-off-by: nune-tadevosyan <[email protected]> * Changing requirements Signed-off-by: nune-tadevosyan <[email protected]> * Update configs Signed-off-by: nune-tadevosyan <[email protected]> * Update docs/src/sdp/existing_configs.rst Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Update dataset_configs/portuguese/mcv/config.yaml Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Update dataset_configs/portuguese/mls/config.yaml Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Update dataset_configs/portuguese/mtedx/config.yaml Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> * Commiting config changes Signed-off-by: nune-tadevosyan <[email protected]> * Commiting changes for new processors Signed-off-by: nune-tadevosyan <[email protected]> * Changes for SplitByVttSentence class Signed-off-by: nune-tadevosyan <[email protected]> * Small docstring change Signed-off-by: nune-tadevosyan <[email protected]> * Adding new lines between functions Signed-off-by: nune-tadevosyan <[email protected]> * Removing empty file Signed-off-by: nune-tadevosyan <[email protected]> * Adding needed space Signed-off-by: nune-tadevosyan <[email protected]> * Removing empty file Signed-off-by: nune-tadevosyan <[email protected]> * Removing repeated class Signed-off-by: nune-tadevosyan <[email protected]> * Some changes Signed-off-by: nune-tadevosyan <[email protected]> --------- Signed-off-by: nune-tadevosyan <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]>
SmartDigitalNetworks · May 14, 2024 · 0b59e8a · 0b59e8a
1 parent e63542f
commit 0b59e8a
Show file tree

Hide file tree

Showing 19 changed files with 960 additions and 42 deletions.
diff --git a/dataset_configs/portuguese/coraa/config.yaml b/dataset_configs/portuguese/coraa/config.yaml
@@ -0,0 +1,82 @@
+documentation: |
+  Coraa Portuguese
+  ################
+
+  The config  performs the following data processing.
+
+  1. Downloads and extracts all the data from the "https://huggingface.co/datasets/gabrielrstan/CORAA-v1.1/tree/main"
+  2. Replaces certain non-supported characters, abbreviations and  punctuation marks with equivalent supported versions.
+  3. Drops any data that contains high/low character occurence.
+  4. Drops any data that contains symbols not in the supported alphabet.
+
+  **Required arguments**.
+
+  * **workspace_dir**: specify the workspace folder where all audio files will be stored.
+  * **data_split**: should be "train", "dev" or "test".
+
+  **Output format**.
+
+  This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
+  The output manifest contains the following fields:
+
+  * **audio_filepath (str)**: relative path to the audio files.
+  * **text (str)**: transcription, including punctuation ".,?" and capitalization.
+  * **duration (float)**: audio duration in seconds.
+  
+
+processors_to_run: all 
+workspace_dir: ???
+data_split: ???
+final_manifest: ???
+
+
+processors:
+  - _target_: sdp.processors.CreateInitialManifestCORAA
+    raw_data_dir: ${workspace_dir}
+    data_split: ${data_split}
+    extract_archive_dir: ${workspace_dir}/extracted
+    resampled_audio_dir:  ${workspace_dir}/extracted/16k
+    already_downloaded: false
+    already_extracted: false
+    output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json
+
+  - _target_: sdp.processors.SubRegex
+    regex_params_list:
+      - {"pattern": "(Aplausos)", "repl": " "}
+      - {"pattern": "(Risos)", "repl": " "}
+      - {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
+      - {"pattern": "'", "repl": " "}
+      - {"pattern": '[\$\&\¡\(\)]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\‘\’\“\”\„]', "repl": " "}
+      - {"pattern": '[\:\;\`\ʻ]', "repl": " "}
+      - {"pattern": "!", "repl": "."}
+      - {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+      - {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+
+      # remove remaining repeated periods since most of the time they are unnecessary in this data
+      - {"pattern": "\\.{2,20}", "repl": " "}
+
+      - {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
+      - {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
+      - {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
+      - {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
+      - {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}
+
+      - {"pattern": " um km ", "repl" : " um quilômetro "}
+      - {"pattern": " km ", "repl" : " quilômetros "}
+
+  - _target_: sdp.processors.DropHighLowDuration
+    high_duration_threshold: 20
+    low_duration_threshold: 0.5
+
+  - _target_: sdp.processors.DropHighLowCharrate
+    high_charrate_threshold: 21
+    low_charrate_threshold: 1
+
+  - _target_: sdp.processors.DropNonAlphabet
+    output_manifest_file: ${final_manifest}
+    alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"
diff --git a/dataset_configs/portuguese/mcv/config.yaml b/dataset_configs/portuguese/mcv/config.yaml
@@ -0,0 +1,87 @@
+documentation: |
+  MCV Portuguese
+  ##############
+
+  This config was originally designed for the
+  `Mozilla Common Voice (MCV) <https://commonvoice.mozilla.org/>`_ dataset
+  15.0 release, but should work for any subsequent releases as well.
+
+  It performs the following data processing.
+
+  1. Extracts and converts all data to the NeMo format.
+  2. Replaces certain non-supported characters, abbreviations and  punctuation marks with equivalent supported versions.
+  3. Drops any data that contains high/low character occurence.
+  4. Drops any data that contains symbols not in the supported alphabet.
+
+  **Required arguments**.
+
+  * **workspace_dir**: specify the workspace folder where all audio files will be stored.
+    You need to manually place the downloaded MCV Portuguese data inside
+    ``<workspace dir>/raw_data/`` subfolder.
+  * **data_split**: should be "train", "dev" or "test".
+
+  **Output format**.
+
+  This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
+  The output manifest contains the following fields:
+
+  * **audio_filepath (str)**: relative path to the audio files.
+  * **text (str)**: transcription, including punctuation ".,?" and capitalization.
+  * **duration (float)**: audio duration in seconds.
+
+
+
+processors_to_run: all
+workspace_dir: ???
+data_split: ???
+final_manifest: ???
+
+
+processors:
+  - _target_: sdp.processors.CreateInitialManifestMCV
+    raw_data_dir: ${workspace_dir}/raw_data
+    extract_archive_dir: ${workspace_dir}/raw
+    resampled_audio_dir: ${workspace_dir}/${data_split}/audio
+    data_split: ${data_split}
+    language_id: pt
+    output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json
+
+  - _target_: sdp.processors.SubRegex
+    regex_params_list:
+      - {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
+      - {"pattern": "'", "repl": " "}
+      - {"pattern": '[\$\&\¡\(\)]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\‘\’\“\”\„]', "repl": " "}
+      - {"pattern": '[\:\;\`\ʻ]', "repl": " "}
+      - {"pattern": "!", "repl": "."}
+      - {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+      - {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+
+      # remove remaining repeated periods since most of the time they are unnecessary in this data
+      - {"pattern": "\\.{2,20}", "repl": " "}
+
+      - {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
+      - {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
+      - {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
+      - {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
+      - {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}
+
+      - {"pattern": " um km ", "repl" : " um quilômetro "}
+      - {"pattern": " km ", "repl" : " quilômetros "}
+
+  - _target_: sdp.processors.DropHighLowCharrate
+    high_charrate_threshold: 21
+    low_charrate_threshold: 1
+
+  - _target_: sdp.processors.DropHighLowDuration
+    high_duration_threshold: 16
+    low_duration_threshold: 1
+
+  - _target_: sdp.processors.DropNonAlphabet
+    output_manifest_file: ${final_manifest}
+    alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"
+
diff --git a/dataset_configs/portuguese/mls/config.yaml b/dataset_configs/portuguese/mls/config.yaml
@@ -0,0 +1,81 @@
+documentation: |
+  MLS Portuguese
+  ##############
+  The config  performs the following data processing.
+
+  1. Downloads and extracts all the data from the "https://www.openslr.org/94/" in Portuguese
+  2. Converts all flac audio files to wav format.
+  3. Replaces certain non-supported characters, abbreviations and  punctuation marks with equivalent supported versions.
+  4. Drops any data that contains high/low character occurence.
+  5. Drops any data that contains symbols not in the supported alphabet.
+
+  **Required arguments**.
+
+  * **workspace_dir**: specify the workspace folder where all audio files will be stored.
+  * **data_split**: should be "train", "dev" or "test".
+
+  **Output format**.
+
+  This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
+  The output manifest contains the following fields:
+
+  * **audio_filepath (str)**: relative path to the audio files.
+  * **text (str)**: transcription, including punctuation ".,?" and capitalization.
+  * **duration (float)**: audio duration in seconds.
+
+
+processors_to_run: all
+workspace_dir: ???
+data_split: ???
+final_manifest: ???
+
+processors:
+  - _target_: sdp.processors.CreateInitialManifestMLS
+    output_manifest_file: ${workspace_dir}/mls_portuguese_processed/${data_split}_manifest.json
+    raw_data_dir: ${workspace_dir}
+    language: portuguese
+    resampled_audio_dir: "" #not passing an argument here to convert it with ffmpeg
+    data_split: ${data_split}
+
+  - _target_: sdp.processors.FfmpegConvert
+    resampled_audio_dir: ${workspace_dir}/resampled
+    input_field: audio_filepath
+    output_field: audio_filepath
+
+  - _target_: sdp.processors.SubRegex
+    regex_params_list:
+      - {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
+      - {"pattern": "'", "repl": " "}
+      - {"pattern": '[\$\&\¡\(\)]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\‘\’\“\”\„]', "repl": " "}
+      - {"pattern": '[\:\;\`\ʻ]', "repl": " "}
+      - {"pattern": "!", "repl": "."}
+      - {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+      - {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+
+      # remove remaining repeated periods since most of the time they are unnecessary in this data
+      - {"pattern": "\\.{2,20}", "repl": " "}
+
+      - {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
+      - {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
+      - {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
+      - {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
+      - {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}
+
+      - {"pattern": " um km ", "repl" : " um quilômetro "}
+      - {"pattern": " km ", "repl" : " quilômetros "}
+  - _target_: sdp.processors.DropHighLowCharrate
+    high_charrate_threshold: 21
+    low_charrate_threshold: 1
+
+  - _target_: sdp.processors.DropHighLowDuration
+    high_duration_threshold: 20
+    low_duration_threshold: 1
+
+  - _target_: sdp.processors.DropNonAlphabet
+    output_manifest_file: ${final_manifest}
+    alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"
diff --git a/dataset_configs/portuguese/mtedx/config.yaml b/dataset_configs/portuguese/mtedx/config.yaml
@@ -0,0 +1,101 @@
+documentation: |
+  MTEDX Portuguese
+  ################
+  The config  performs the following data processing.
+
+  1. Downloads and extracts the data from the "https://www.openslr.org/100/" in Portuguese
+  2. Converts all flac audio files to wav format.
+  3. Splits audio by the given time steps in vtt files.
+  4. Replaces certain non-supported characters, abbreviations and  punctuation marks with equivalent supported versions.
+  5. Drops any data that contains high/low character occurence.
+  6. Drops any data that contains symbols not in the supported alphabet.
+
+  **Required arguments**.
+
+  * **workspace_dir**: specify the workspace folder where all audio files will be stored.
+  * **raw_data_dir**: specify in which folder the data will be downladed.
+  * **data_split**: should be "train", "valid" or "test".
+
+  **Output format**.
+
+  This config dumps the final manifest at ``${workspace_dir}/${data_split}_manifest.json``.
+  The output manifest contains the following fields:
+
+  * **audio_filepath (str)**: relative path to the audio files.
+  * **text (str)**: transcription, including punctuation ".,?" and capitalization.
+  * **duration (float)**: audio duration in seconds.
+
+
+
+processors_to_run: all
+workspace_dir: ???
+data_split: ???
+final_manifest: ???
+
+
+processors:
+  - _target_: sdp.processors.CreateInitialManifestMTEDX
+    raw_data_dir: ${workspace_dir}/raw_data
+    data_split: ${data_split}
+    language_id: pt
+    already_extracted: False
+    output_manifest_file: ${workspace_dir}/${data_split}_manifest0.json
+
+  - _target_: sdp.processors.FfmpegConvert
+    resampled_audio_dir: ${workspace_dir}/resampled
+    input_field: audio_filepath
+    output_field: audio_filepath
+
+  - _target_: sdp.processors.datasets.commoncrawl.SplitByVttSentence
+    output_manifest_file: ${workspace_dir}/manifest_vtt.json
+    input_manifest_file: ${workspace_dir}/${data_split}_manifest0.json
+    splited_audio_dir: ${workspace_dir}/splited
+    source_audio_field: audio_filepath
+    target_audio_field: audio_filepath
+    duration_field: duration
+    text_field: text
+    vtt_field: vtt_filepath
+    proxy_fields: []
+    duration_threshold: 20.0
+
+  - _target_: sdp.processors.SubRegex
+    regex_params_list:
+      - {"pattern": "(Aplausos)", "repl": " "}
+      - {"pattern": "(Risos)", "repl": " "}
+      - {"pattern": '[\-\‐\‑\–\—\―\"]', "repl": " "}
+      - {"pattern": "'", "repl": " "}
+      - {"pattern": '[\$\&\¡\(\)]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\«\°\´\·\»]', "repl": " "}
+      - {"pattern": '[\‘\’\“\”\„]', "repl": " "}
+      - {"pattern": '[\:\;\`\ʻ]', "repl": " "}
+      - {"pattern": "!", "repl": "."}
+      - {"pattern": "…\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+      - {"pattern": "\\.{2,20}\\s$", "repl": "."} # '\\s' is to to account for the fact that SDP insert spaces at start and end
+
+      # remove remaining repeated periods since most of the time they are unnecessary in this data
+      - {"pattern": "\\.{2,20}", "repl": " "}
+
+      - {"pattern": " ([Pp])rofa ", "repl" : ' \1rofessora '}
+      - {"pattern": " ([Ss])ra.", "repl" : ' \1enhora'}
+      - {"pattern": " ([Ss])rta.", "repl": '\1enhorita'}
+      - {"pattern": " ([Ss])r.", 'repl': '\1enhor' }
+      - {"pattern": " ([Dd])r ", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])r.", "repl" : ' \1octor '}
+      - {"pattern": " ([Dd])ra ", "repl" : ' \1octora '}
+
+      - {"pattern": " um km ", "repl" : " um quilômetro "}
+      - {"pattern": " km ", "repl" : " quilômetros "}
+
+  - _target_: sdp.processors.DropHighLowDuration
+    high_duration_threshold: 20
+    low_duration_threshold: 1
+
+  - _target_: sdp.processors.DropHighLowCharrate
+    high_charrate_threshold: 21
+    low_charrate_threshold: 1
+
+  - _target_: sdp.processors.DropNonAlphabet
+    output_manifest_file: ${final_manifest}
+    alphabet: " ÁÃÀÂÇÉÊÍÕÓÔÚÜáãàâçéêíõóôúüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.?"
+
diff --git a/docs/src/sdp/api.rst b/docs/src/sdp/api.rst
@@ -69,6 +69,18 @@ SLR83
 .. autodata:: sdp.processors.CustomDataSplitSLR83
    :annotation:
 
+MTEDx
+'''
+
+.. autodata:: sdp.processors.CreateInitialManifestMTEDX
+   :annotation:
+
+Coraa
+'''
+
+.. autodata:: sdp.processors.CreateInitialManifestCORAA
+   :annotation:
+
 .. TODO: Fisher config is not accessible - should we require moving everything to SDP
 ..       Probably need some policy on shat lives in main folder vs configs.
 ..       To control the number of processors we support.