Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpnv/cc #35

Open
wants to merge 137 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
40d271e
commoncrawl
karpnv Sep 12, 2023
c4ea5e4
batch
karpnv Sep 12, 2023
4ebc195
rm filter
karpnv Sep 12, 2023
d9b3473
add caption
karpnv Sep 25, 2023
9a74b30
proxy_fields
karpnv Sep 25, 2023
f2c8f2b
duration_threshold
karpnv Sep 25, 2023
199bc22
big
karpnv Sep 25, 2023
f450f42
small
karpnv Sep 25, 2023
1952828
yaml
karpnv Sep 29, 2023
c9614f8
FfmpegConvert
karpnv Sep 29, 2023
e911070
ASR_HF
karpnv Oct 9, 2023
1097672
args
karpnv Oct 9, 2023
d90cd61
duration_key
karpnv Oct 20, 2023
d3973c8
nfa
karpnv Oct 20, 2023
6170682
source_audio
karpnv Oct 20, 2023
bf5ada0
dsalign
karpnv Oct 23, 2023
075a08a
audio_duration
karpnv Oct 26, 2023
e065608
EvalBandwidth and AlignerSubprocess
karpnv Oct 26, 2023
dd9f260
split CreateInitialManifestCC
karpnv Nov 2, 2023
1282ffb
split CreateInitialManifestCC
karpnv Nov 2, 2023
c1396ad
key_field
karpnv Nov 2, 2023
fbee380
offline_diar_infer
karpnv Nov 8, 2023
7fc4c1e
arm
karpnv Nov 8, 2023
7661d8e
duplicates
karpnv Nov 9, 2023
718a812
drop_text_duplicates
karpnv Nov 9, 2023
8bfdfc9
mcv
karpnv Nov 9, 2023
ad51e5d
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Nov 10, 2023
6b4a9a6
split
karpnv Nov 10, 2023
860ed6a
it nl eu
karpnv Nov 10, 2023
17953c4
TrainDevTestSplitCC
karpnv Nov 13, 2023
b6f16c0
merge
karpnv Nov 13, 2023
b69bfc1
en split
karpnv Nov 15, 2023
6942e61
Merge branch 'karpnv/cc' of github.com:NVIDIA/NeMo-speech-data-proces…
karpnv Nov 15, 2023
3813285
rm pandas
karpnv Nov 16, 2023
5979d5d
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Nov 16, 2023
5d30d6a
text processing for MCV PR
karpnv Nov 17, 2023
aa21b87
path
karpnv Nov 17, 2023
9d5e195
RandomPart
karpnv Nov 21, 2023
ad10221
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Nov 22, 2023
5b7700f
random_state
karpnv Nov 24, 2023
c724601
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Nov 24, 2023
2473855
docstring
karpnv Nov 24, 2023
96dfaed
split common processors
karpnv Nov 28, 2023
424edf7
langs
karpnv Nov 28, 2023
0e2ca51
audio_books
karpnv Nov 28, 2023
293648b
mv
karpnv Nov 28, 2023
970b9e7
mv todata_to_data
karpnv Dec 1, 2023
f8c5961
mv torch
karpnv Dec 1, 2023
a412180
PR comments
karpnv Dec 5, 2023
2208d03
paths
karpnv Dec 5, 2023
460cbbb
rename
karpnv Dec 5, 2023
f3cebd2
import
karpnv Dec 5, 2023
9a8d4f2
docs
karpnv Dec 5, 2023
c3ba8c9
subprocess
karpnv Dec 6, 2023
21005a2
Subprocess
karpnv Dec 11, 2023
9995909
fix docs
karpnv Dec 11, 2023
8cfdf39
CreateInitialManifestByExt doc
karpnv Dec 11, 2023
af7ca03
drop_abs_path
karpnv Dec 11, 2023
289a52f
Merge branch 'karpnv/cc' of github.com:NVIDIA/NeMo-speech-data-proces…
karpnv Dec 11, 2023
c12b732
add lang
karpnv Dec 12, 2023
f310b01
deps
karpnv Dec 14, 2023
7b7df73
PreserveByValue
karpnv Dec 14, 2023
a688b8a
GetSourceFolder
karpnv Dec 14, 2023
c196b50
drop Attributes
karpnv Dec 14, 2023
4f22ff2
args
karpnv Dec 14, 2023
9a9831c
rm methods
karpnv Dec 14, 2023
f869773
rm Note
karpnv Dec 14, 2023
4224052
more fixes
karpnv Dec 15, 2023
8289f82
header
karpnv Dec 18, 2023
bd42a6c
ASRWhisper
karpnv Dec 18, 2023
b2c1f0d
AudioLid args
karpnv Jan 9, 2024
a4d1173
merge
karpnv Jan 9, 2024
8537977
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Jan 9, 2024
7935e44
GetSpecificFiles CopyFiles
karpnv Jan 16, 2024
cdf330a
Merge branch 'karpnv/cc' of github.com:NVIDIA/NeMo-speech-data-proces…
karpnv Jan 16, 2024
39ec1a3
separate dev test
karpnv Jan 18, 2024
980eeb7
rm
karpnv Jan 22, 2024
f39d1d3
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Jan 22, 2024
165c295
black
karpnv Jan 22, 2024
8cd5896
self.torch_dtype
karpnv Jan 22, 2024
43ff82d
mv to cv
karpnv Jan 22, 2024
a24e6c9
mv configs
karpnv Jan 22, 2024
157de3a
rename
karpnv Jan 23, 2024
af69829
ManifestToUtf8
karpnv Jan 26, 2024
4c8a210
black
karpnv Feb 8, 2024
b1b45bc
not in
karpnv Feb 8, 2024
fc30b34
black
karpnv Feb 9, 2024
ee1c52e
add ASRWhisper
karpnv Feb 9, 2024
3dcc2f7
requirements
karpnv Feb 9, 2024
61c8fe7
test audio_books.yaml
karpnv Feb 9, 2024
f7182f2
add docs
karpnv Feb 26, 2024
2d5ee5b
black
karpnv Feb 26, 2024
81716ba
lanID
karpnv Feb 27, 2024
33b4f62
srt
karpnv Feb 27, 2024
e4ebfa7
load_manifest
karpnv Feb 27, 2024
dadf773
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Feb 28, 2024
ab7e1d9
docs
karpnv Feb 29, 2024
e63f250
black
karpnv Feb 29, 2024
714a7d1
key
karpnv Mar 1, 2024
fcee183
nemo file
karpnv Mar 15, 2024
c076199
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Mar 15, 2024
464c44b
black
karpnv Mar 15, 2024
5efdbcd
key style
karpnv Mar 15, 2024
981d5bd
rm PreserveByValue
karpnv Mar 15, 2024
a1e3fab
black
karpnv Mar 15, 2024
ab8c685
rm operator
karpnv Mar 19, 2024
f31f7d1
batch_size > 1
karpnv Mar 19, 2024
b734db4
Merge branch 'armenian' of github.com:NVIDIA/NeMo-speech-data-process…
karpnv Mar 19, 2024
02b35a8
German Youtube with new processors (#49)
ssh-meister Mar 19, 2024
f862b2a
black
karpnv Mar 19, 2024
c1ee056
merge
karpnv Mar 19, 2024
e2fe178
black
karpnv Mar 20, 2024
8f99da0
proxy
karpnv Mar 20, 2024
df15c33
New processors for calculating metrics (#50)
ssh-meister Mar 21, 2024
3434b7c
beamsearch
karpnv May 9, 2024
082d168
yaml
karpnv May 9, 2024
c6fe2a5
chunk_manifest
karpnv May 14, 2024
e68d3fe
get_capitalisation_from_target
karpnv May 16, 2024
0b34b9f
ConcatManifests
karpnv May 17, 2024
4aeb88c
utf8
karpnv May 21, 2024
421bad6
shell bool
karpnv May 27, 2024
a3e56d7
LangIdWhisper
karpnv May 29, 2024
52c8552
black
karpnv May 31, 2024
3614723
Merge branch 'karpnv/cc' of github.com:NVIDIA/NeMo-speech-data-proces…
karpnv May 31, 2024
dc64941
Updated LangIDWhisper processor (#62)
ssh-meister Jun 1, 2024
d16ff14
resolve conflicts
karpnv Jun 13, 2024
47a6875
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Jun 27, 2024
bb28efc
kenlm_path fix
karpnv Sep 4, 2024
7163a88
merge main
karpnv Sep 4, 2024
c71f558
add ApplyLlama3 and pnc pipeline
karpnv Sep 27, 2024
4441a2a
Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…
karpnv Oct 14, 2024
1b8c189
black
karpnv Nov 24, 2024
3810d3b
rm yt'
karpnv Nov 24, 2024
f02f37a
rm sdp/processors/datasets/yt
karpnv Nov 24, 2024
9d492e9
whitespace
karpnv Nov 24, 2024
82f58e0
rm llm
karpnv Nov 24, 2024
e9bb5db
rm extra langs
karpnv Nov 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
it nl eu
Signed-off-by: Nikolay Karpov <[email protected]>
  • Loading branch information
karpnv committed Nov 10, 2023
commit 860ed6a6f0c60bbb1daa5860ac9215454ef73531
113 changes: 113 additions & 0 deletions dataset_configs/commoncrawl/big_eu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
processors_to_run: "0:"
workspace_dir: /mnt/md0/common_crawl/cc_sdp/eu

processors:
- _target_: sdp.processors.datasets.commoncrawl.PreserveByValue
input_manifest_file: /mnt/md0/common_crawl/cc_sdp/manifest8.json
output_manifest_file: ${workspace_dir}/manifest0.json
input_field: audio_lang
target_value: eu

- _target_: sdp.processors.datasets.commoncrawl.PreserveByValue
output_manifest_file: ${workspace_dir}/manifest1.json
input_field: text_lang
target_value: eu

- _target_: sdp.processors.datasets.commoncrawl.ASR_HF
output_manifest_file: ${workspace_dir}/manifest2.json
pretrained_model: cahya/wav2vec2-large-xlsr-basque
output_text_field: pred_text
batch_size: 16

- _target_: sdp.processors.DuplicateFields
output_manifest_file: ${workspace_dir}/manifest3.json
duplicate_fields: {"text":"orig_text"}

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest4.json
text_key: text
regex_params_list:
- {"pattern": '\[(.*?)\]', "repl": ' '}
- {"pattern": "^[\\s]*\\*(.*?)\\*[\\s]*$", "repl": "\\1"}
- {"pattern": '‚', "repl": ","}
- {"pattern": "’", "repl": "'"}
- {"pattern": "[-–—]", "repl": " "}
- {"pattern": '―', "repl": "-"}
- {"pattern": '—', "repl": "-"}
- {"pattern": '⁺', "repl": "+"}
- {"pattern": '“', "repl": '"'}
- {"pattern": '”', "repl": '"'}
- {"pattern": '…', "repl": '.'}
- {"pattern": '‘', "repl": "'"}
- {"pattern": '′', "repl": "'"}
- {"pattern": '`', "repl": "'"}
- {"pattern": '⁻', "repl": "-"}
- {"pattern": '‑', "repl": "-"}
- {"pattern": '¶', "repl": ' '}
- {"pattern": '«', "repl": '"'}
- {"pattern": '»', "repl": '"'}
- {"pattern": '„', "repl": '"'}
- {"pattern": '®', "repl": ' '}
- {"pattern": '@', "repl": " "}
- {"pattern": " ", "repl": " "}

- _target_: sdp.processors.DropHighLowWordrate
output_manifest_file: ${workspace_dir}/manifest5.json
high_wordrate_threshold: 100
low_wordrate_threshold: 0.01

- _target_: sdp.processors.DropIfRegexMatch
output_manifest_file: ${workspace_dir}/manifest6.json
text_key: text
regex_patterns:
- "^\\s*$"

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest7.json
text_key: text
regex_params_list:
- {"pattern": "^\\s*'+\\s(.*?)\\s*'+\\s*$", "repl": "\\1"}
- {"pattern": "^\\s*'*\\s*", "repl": ""}
- {"pattern": "'{2,}", "repl": "'"}
- {"pattern": '!', "repl": '.'}
- {"pattern": '\s(\\x[a-h][0-9]){1,}\s', "repl": ' '}
- {"pattern": '(\\x[a-h][0-9]){1,}', "repl": ''}
- {"pattern": '\.{3}', "repl": '.'}
- {"pattern": '\$', "repl": ""}
- {"pattern": "[^a-zA-ZóÓáÁéÉíÍñÑúÚüÜçÇ'.,?]", "repl": " "}
- {"pattern": ' ', "repl": " "}

- _target_: sdp.processors.DuplicateFields
output_manifest_file: ${workspace_dir}/manifest8.json
duplicate_fields: {"text":"text_pc"}

- _target_: sdp.processors.SubMakeLowercase
output_manifest_file: ${workspace_dir}/manifest9.json
text_key: text

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest10.json
text_key: text
regex_params_list:
- {"pattern": "[\\?\\.]", "repl": " "}
- {"pattern": ",", "repl": " "}
- {"pattern": " ", "repl": " "}

- _target_: sdp.processors.DropIfRegexMatch
output_manifest_file: ${workspace_dir}/manifest11.json
text_key: text
regex_patterns:
- "^\\s*$"

- _target_: sdp.processors.DropHighWER
output_manifest_file: ${workspace_dir}/manifest12.json
text_key: text
pred_text_key: pred_text
wer_threshold: 75

- _target_: sdp.processors.DropHighCER
output_manifest_file: ${workspace_dir}/manifest13.json
text_key: text
pred_text_key: pred_text
cer_threshold: 30

150 changes: 150 additions & 0 deletions dataset_configs/commoncrawl/big_it.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
processors_to_run: "0:"
workspace_dir: /mnt/md0/common_crawl/cc_sdp/it

processors:
- _target_: sdp.processors.datasets.commoncrawl.PreserveByValue
input_manifest_file: /mnt/md0/common_crawl/cc_sdp/manifest8.json
output_manifest_file: ${workspace_dir}/manifest0.json
input_field: audio_lang
target_value: it

- _target_: sdp.processors.datasets.commoncrawl.PreserveByValue
output_manifest_file: ${workspace_dir}/manifest1.json
input_field: text_lang
target_value: it

- _target_: sdp.processors.ASRInference
output_manifest_file: ${workspace_dir}/manifest2.json
pretrained_model: nvidia/stt_it_fastconformer_hybrid_large_pc
batch_size: 64

- _target_: sdp.processors.DuplicateFields
output_manifest_file: ${workspace_dir}/manifest3.json
duplicate_fields: {"text":"orig_text"}

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest4.json
text_key: text
regex_params_list:
- {"pattern": '\[(.*?)\]', "repl": ' '}
- {"pattern": "^[\\s]*\\*(.*?)\\*[\\s]*$", "repl": "\\1"}
- {"pattern": '‚', "repl": ","}
- {"pattern": "’", "repl": "'"}
- {"pattern": "[-–—]", "repl": " "}
- {"pattern": '―', "repl": "-"}
- {"pattern": '—', "repl": "-"}
- {"pattern": '⁺', "repl": "+"}
- {"pattern": '“', "repl": '"'}
- {"pattern": '”', "repl": '"'}
- {"pattern": '…', "repl": '.'}
- {"pattern": '‘', "repl": "'"}
- {"pattern": '′', "repl": "'"}
- {"pattern": '`', "repl": "'"}
- {"pattern": '⁻', "repl": "-"}
- {"pattern": '‑', "repl": "-"}
- {"pattern": '¶', "repl": ' '}
- {"pattern": '«', "repl": '"'}
- {"pattern": '»', "repl": '"'}
- {"pattern": '„', "repl": '"'}
- {"pattern": '®', "repl": ' '}
- {"pattern": '•', "repl": " "}
- {"pattern": '●', "repl": " "}
- {"pattern": '@', "repl": " "}
- {"pattern": " ", "repl": " "}

- _target_: sdp.processors.DropHighLowWordrate
output_manifest_file: ${workspace_dir}/manifest5.json
text_key: text
high_wordrate_threshold: 100
low_wordrate_threshold: 0.01

- _target_: sdp.processors.DropIfRegexMatch
output_manifest_file: ${workspace_dir}/manifest6.json
text_key: text
regex_patterns:
- "^\\s*$"

- _target_: sdp.processors.datasets.commoncrawl.Subprocess
output_manifest_file: ${workspace_dir}/manifest7.json
input_manifest_arg: "--manifest"
output_manifest_arg: "--output_filename"
arg_separator: "="
cmd: "python /home/nkarpov/workspace/NeMo-text-processing/nemo_text_processing/text_normalization/normalize_with_audio.py \
--language=es --n_jobs=-1 --batch_size=600 --manifest_text_field=text --cache_dir=${workspace_dir}/cache \
--whitelist=/home/nkarpov/workspace/NeMo-text-processing/nemo_text_processing/text_normalization/it/data/whitelist.tsv"
# --overwrite_cache

- _target_: sdp.processors.RenameFields
output_manifest_file: ${workspace_dir}/manifest8.json
rename_fields: {"normalized":"text"}

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest9.json
text_key: text
regex_params_list:
- {"pattern": "^\\s*'+\\s(.*?)\\s*'+\\s*$", "repl": "\\1"}
- {"pattern": "^\\s*'*\\s*", "repl": ""}
- {"pattern": "'{2,}", "repl": "'"}
- {"pattern": '!', "repl": '.'}
- {"pattern": '\s(\\x[a-h][0-9]){1,}\s', "repl": ' '}
- {"pattern": '(\\x[a-h][0-9]){1,}', "repl": ''}
- {"pattern": '\.{3}', "repl": '.'}
- {"pattern": '\$', "repl": ""}
- {"pattern": "[^a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ'.,?]", "repl": " "}
- {"pattern": ' ', "repl": " "}
test_cases:
- {input: {text: "' jupiter and venus both shining in the golden rosy sky"}, output: {text: "jupiter and venus both shining in the golden rosy sky"}}
- {input: {text: "' may all the gold i have ever dreamed of be yours '"}, output: {text: "may all the gold i have ever dreamed of be yours"}}
- {input: {text: "''cause it''s an adult novel versus ya"}, output: {text: "cause it's an adult novel versus ya"}}


- _target_: sdp.processors.DuplicateFields
output_manifest_file: ${workspace_dir}/manifest10.json
duplicate_fields: {"text":"text_pc"}

- _target_: sdp.processors.SubMakeLowercase
output_manifest_file: ${workspace_dir}/manifest11.json
text_key: text

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest12.json
text_key: text
regex_params_list:
- {"pattern": "[\\?\\.]", "repl": " "}
- {"pattern": ",", "repl": " "}
- {"pattern": " ", "repl": " "}

- _target_: sdp.processors.DropIfRegexMatch
output_manifest_file: ${workspace_dir}/manifest13.json
text_key: text
regex_patterns:
- "^\\s*$"

- _target_: sdp.processors.DuplicateFields
output_manifest_file: ${workspace_dir}/manifest14.json
duplicate_fields: {"pred_text":"pred_text_pc"}

- _target_: sdp.processors.SubMakeLowercase
output_manifest_file: ${workspace_dir}/manifest15.json
text_key: pred_text

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest16.json
text_key: pred_text
regex_params_list:
- {"pattern": "[\\?\\.]", "repl": " "}
- {"pattern": ",", "repl": " "}
- {"pattern": " ", "repl": " "}

- _target_: sdp.processors.DropHighWER
output_manifest_file: ${workspace_dir}/manifest17.json
text_key: text
pred_text_key: pred_text
wer_threshold: 75

- _target_: sdp.processors.DropHighCER
output_manifest_file: ${workspace_dir}/manifest18.json
text_key: text
pred_text_key: pred_text
cer_threshold: 30

128 changes: 128 additions & 0 deletions dataset_configs/commoncrawl/big_nl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
processors_to_run: "0:"
workspace_dir: /mnt/md0/common_crawl/cc_sdp/nl

processors:
- _target_: sdp.processors.datasets.commoncrawl.PreserveByValue
input_manifest_file: /mnt/md0/common_crawl/cc_sdp/manifest8.json
output_manifest_file: ${workspace_dir}/manifest0.json
input_field: audio_lang
target_value: nl

- _target_: sdp.processors.datasets.commoncrawl.PreserveByValue
output_manifest_file: ${workspace_dir}/manifest1.json
input_field: text_lang
target_value: nl

- _target_: sdp.processors.datasets.commoncrawl.ASR_HF
output_manifest_file: ${workspace_dir}/manifest2.json
pretrained_model: jonatasgrosman/wav2vec2-large-xlsr-53-dutch
output_text_field: pred_text
batch_size: 16

- _target_: sdp.processors.DuplicateFields
output_manifest_file: ${workspace_dir}/manifest3.json
duplicate_fields: {"text":"orig_text"}

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest4.json
text_key: text
regex_params_list:
- {"pattern": '\[(.*?)\]', "repl": ' '}
- {"pattern": "^[\\s]*\\*(.*?)\\*[\\s]*$", "repl": "\\1"}
- {"pattern": 'î', "repl": "i"}
- {"pattern": 'ì', "repl": "i"}
- {"pattern": 'è', "repl": "e"}
- {"pattern": 'È', "repl": "E"}
- {"pattern": 'ù', "repl": "u"}
- {"pattern": 'ò', "repl": "o"}
- {"pattern": 'à', "repl": "a"}
- {"pattern": '‚', "repl": ","}
- {"pattern": "’", "repl": "'"}
- {"pattern": "[-–—]", "repl": " "}
- {"pattern": '―', "repl": "-"}
- {"pattern": '—', "repl": "-"}
- {"pattern": '⁺', "repl": "+"}
- {"pattern": '“', "repl": '"'}
- {"pattern": '”', "repl": '"'}
- {"pattern": '…', "repl": '.'}
- {"pattern": '‘', "repl": "'"}
- {"pattern": '′', "repl": "'"}
- {"pattern": '`', "repl": "'"}
- {"pattern": '⁻', "repl": "-"}
- {"pattern": '‑', "repl": "-"}
- {"pattern": '¶', "repl": ' '}
- {"pattern": '«', "repl": '"'}
- {"pattern": '»', "repl": '"'}
- {"pattern": '„', "repl": '"'}
- {"pattern": '®', "repl": ' '}
- {"pattern": '•', "repl": " "}
- {"pattern": '●', "repl": " "}
- {"pattern": '@', "repl": " "}
- {"pattern": " ", "repl": " "}

- _target_: sdp.processors.DropHighLowWordrate
output_manifest_file: ${workspace_dir}/manifest5.json
text_key: text
high_wordrate_threshold: 100
low_wordrate_threshold: 0.01

- _target_: sdp.processors.DropIfRegexMatch
output_manifest_file: ${workspace_dir}/manifest6.json
text_key: text
regex_patterns:
- "^\\s*$"

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest7.json
text_key: text
regex_params_list:
- {"pattern": "^\\s*'+\\s(.*?)\\s*'+\\s*$", "repl": "\\1"}
- {"pattern": "^\\s*'*\\s*", "repl": ""}
- {"pattern": "'{2,}", "repl": "'"}
- {"pattern": '!', "repl": '.'}
- {"pattern": '\s(\\x[a-h][0-9]){1,}\s', "repl": ' '}
- {"pattern": '(\\x[a-h][0-9]){1,}', "repl": ''}
- {"pattern": '\.{3}', "repl": '.'}
- {"pattern": '\$', "repl": ""}
- {"pattern": "[^a-zA-ZóÓáÁéÉíÍúÚöÖäÄëËïÏüÜ'.,?]", "repl": " "}
- {"pattern": ' ', "repl": " "}
test_cases:
- {input: {text: "' jupiter and venus both shining in the golden rosy sky"}, output: {text: "jupiter and venus both shining in the golden rosy sky"}}
- {input: {text: "' may all the gold i have ever dreamed of be yours '"}, output: {text: "may all the gold i have ever dreamed of be yours"}}
- {input: {text: "''cause it''s an adult novel versus ya"}, output: {text: "cause it's an adult novel versus ya"}}


- _target_: sdp.processors.DuplicateFields
output_manifest_file: ${workspace_dir}/manifest8.json
duplicate_fields: {"text":"text_pc"}

- _target_: sdp.processors.SubMakeLowercase
output_manifest_file: ${workspace_dir}/manifest9.json
text_key: text

- _target_: sdp.processors.SubRegex
output_manifest_file: ${workspace_dir}/manifest10.json
text_key: text
regex_params_list:
- {"pattern": "[\\?\\.]", "repl": " "}
- {"pattern": ",", "repl": " "}
- {"pattern": " ", "repl": " "}

- _target_: sdp.processors.DropIfRegexMatch
output_manifest_file: ${workspace_dir}/manifest11.json
text_key: text
regex_patterns:
- "^\\s*$"

- _target_: sdp.processors.DropHighWER
output_manifest_file: ${workspace_dir}/manifest12.json
text_key: text
pred_text_key: pred_text
wer_threshold: 75

- _target_: sdp.processors.DropHighCER
output_manifest_file: ${workspace_dir}/manifest13.json
text_key: text
pred_text_key: pred_text
cer_threshold: 30

Loading