Fix French subtitles + refactor conversion script (#431)

* Fix subtitles and scripts * Fix subtitle
huggingface · Dec 28, 2022 · af0c221 · af0c221
1 parent 78a3576
commit af0c221
Show file tree

Hide file tree

Showing 9 changed files with 80 additions and 71 deletions.
diff --git a/subtitles/README.md b/subtitles/README.md
@@ -28,15 +28,29 @@ python utils/generate_subtitles.py --language zh-CN --youtube_language_code zh-H
 
 Once you have the `.srt` files you can manually fix any translation errors and then open a pull request with the new files.
 
-# How to convert bilingual subtitle to monolingual subtitle
+# Convert bilingual subtitles to monolingual subtitles
 
-# Logic
+In some SRT files, the English caption line is conventionally placed at the last line of each subtitle block to enable easier comparison when correcting the machine translation.
 
-The english caption line is conventionally placed at the last line of each subtitle block in srt files. So removing the last line of each subtitle block would make the bilingual subtitle a monolingual subtitle. 
+For example, in the `zh-CN` subtitles, each block has the following format:
 
-# Usage
-> python3 convert_bilingual_monolingual.py -i \<input_file\> -o \<output_file\>
+```
+1
+00:00:05,850 --> 00:00:07,713
+- 欢迎来到 Hugging Face 课程。
+- Welcome to the Hugging Face Course.
+```
+
+To upload the SRT file to YouTube, we need the subtitle in monolingual format, i.e. the above block should read:
+
+```
+1
+00:00:05,850 --> 00:00:07,713
+- 欢迎来到 Hugging Face 课程。
+```
 
-**Example**
-* For instance, the input file name is "test.cn.en.srt", and you name your output file as "output_test.cn.srt" *
-> python3 convert_bilingual_monolingual.py -i test.cn.en.srt -o output_test.cn.srt
+To handle this, we provide a script that converts the bilingual SRT files to monolingual ones. To perform the conversion, run:
+
+```bash
+python utils/convert_bilingual_monolingual.py --input_language_folder subtitles/LANG_ID --output_language_folder tmp-subtitles
+```
diff --git a/subtitles/en/00_welcome-to-the-hugging-face-course.srt b/subtitles/en/00_welcome-to-the-hugging-face-course.srt
@@ -1,6 +1,6 @@
 1
 00:00:05,850 --> 00:00:07,713
-- Welcome to the Hugging Face Course.
+Welcome to the Hugging Face Course.
 
 2
 00:00:08,550 --> 00:00:10,320

diff --git a/subtitles/fr/00_welcome-to-the-hugging-face-course.srt b/subtitles/fr/00_welcome-to-the-hugging-face-course.srt
@@ -7,7 +7,7 @@ Bienvenue au cours d'Hugging Face.
 Ce cours a été conçu pour vous enseigner tout ce qu'il faut savoir à propos de l'écosystème d'Hugging Face.
 
 3
-0:00:12.559,0:00:18.080
+0:00:12.559 --> 0:00:18.080
 Comment utiliser le Hub de jeux de données et de modèles ainsi que toutes nos bibliothèques open source.
 
 4
@@ -27,7 +27,7 @@ La première vous apprendra les bases sur comment utiliser un transformer finetu
 La deuxième est une plongée plus profonde dans nos bibliothèques et vous apprendra à aborder n'importe quelle tâche de NLP.
 
 8
-0:00:42.079,0:00:48.320
+0:00:42.079 --> 0:00:48.320
 Nous travaillons activement sur la dernière partie et nous espérons qu'elle sera prête pour le printemps 2022.
 
 9
@@ -39,7 +39,7 @@ Le premier chapitre ne requiert aucune connaissance et constitue une bonne intro
 Les chapitres suivants nécessitent une bonne connaissance de Python et quelques notions de base de l'apprentissage automatique et de l'apprentissage profond.
 
 11
-0:01:04.159,0:01:09.840
+0:01:04.159 --> 0:01:09.840
 Si vous ne savez pas ce qu'un entraînement et une validation sont ou encore ce qu'une descente de gradient signifie,
 
 12
@@ -59,15 +59,15 @@ Chaque partie abordée dans ce cours a une version dans ces deux frameworks. Vou
 Voici l'équipe qui a développé ce cours. Je vais maintenant laisser chacun des intervenants se présenter brièvement.
 
 16
-0:01:37.119,0:01:41.000
+0:01:37.119 --> 0:01:41.000
 Bonjour, je m'appelle Matthew et je suis ingénieur en apprentissage machine chez Hugging Face.
 
 17
 0:01:41.000 --> 0:01:47.119
 Je travaille dans l'équipe open source et je suis responsable de la maintenance en particulier des codes en TensorFlow.
 
 18
-0:01:47.119,0:01:52.960
+0:01:47.119 --> 0:01:52.960
 Auparavant, j'étais ingénieur en apprentissage automatique chez Parse.ly qui a récemment été acquis par Automattic.
 
 19
@@ -79,15 +79,15 @@ Avant cela j'étais chercheur en post-doc au Trinity College Dublin en Irlande,
 Bonjour, je suis Lysandre, je suis ingénieur en apprentissage automatique chez Hugging Face et je fais spécifiquement partie de l'équipe open source.
 
 21
-0:02:08.479,0:02:18.080
+0:02:08.479 --> 0:02:18.080
 Je suis à Hugging Face depuis quelques années maintenant et aux côtés des membres de mon équipe j'ai travaillé sur la plupart des outils que vous verrez dans ce cours.
 
 22
 0:02:18.080 --> 0:02:25.599
 Bonjour, je m'appelle Sylvain, je suis ingénieur de recherche chez Hugging Face et l'un des principaux mainteneurs de la bibliothèque Transformers.
 
 23
-0:02:25.599,0:02:32.000
+0:02:25.599 --> 0:02:32.000
 Auparavant, j'ai travaillé chez Fast.ai où j'ai aidé à développer la bibliothèque Fastai ainsi que le livre en ligne.
 
 24
@@ -151,5 +151,5 @@ Dans une vie antérieure, j'étais physicien théoricien et je faisais des reche
 Je m'appelle Leandro et je suis ingénieur en apprentissage automatique dans le domaine de l'équipe open source d'Hugging Face.
 
 39
-0:04:20.799,0:04:28.680
+0:04:20.799 --> 0:04:28.680
 Avant de rejoindre Hugging Face, j'ai travaillé comme data scientist en Suisse et j'ai enseigné la science des données à l'université.
diff --git a/subtitles/fr/03_what-is-transfer-learning.srt b/subtitles/fr/03_what-is-transfer-learning.srt
@@ -124,7 +124,7 @@ OpenAI a également étudié le biais de prédiction de son modèle GPT-3
 0:03:35.840,0:03:39.519
 qui a été pré-entrainé en utilisant l'objectif de deviner le mot suivant.
 
-0:03:39.5190:03:50.000
+0:03:39.519,0:03:50.000
 En changeant le genre du prompt de « Il était très » à « Elle était très », les prédictions majoritairement neutres sont devenues presque uniquement axées sur le physique.
 
 0:03:50.000,0:03:59.640

diff --git a/subtitles/fr/68_data-collators-a-tour.srt b/subtitles/fr/68_data-collators-a-tour.srt
@@ -226,8 +226,10 @@ Mais l'assembleur de données pour la modélisation du langage le fera pour vous
 00:05:57.680 --> 00:05:59.280
 Et c'est tout.
 
+60
 00:05:59.280 --> 00:06:02.560
 Ceci couvre donc les assembleurs de données les plus couramment utilisés et les tâches pour lesquelles ils sont utilisés.
 
+61
 00:06:02.560 --> 00:06:08.720
 Nous espérons que vous savez maintenant quand utiliser les assembleurs de données et lequel choisir pour votre tâche spécifique.
diff --git a/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt b/subtitles/zh-CN/00_welcome-to-the-hugging-face-course.srt
@@ -1,7 +1,7 @@
 1
 00:00:05,850 --> 00:00:07,713
-- 欢迎来到 Hugging Face 课程。
-- Welcome to the Hugging Face Course.
+欢迎来到 Hugging Face 课程。
+Welcome to the Hugging Face Course.
 
 2
 00:00:08,550 --> 00:00:10,320

diff --git a/utils/convert_bilingual_monolingual.py b/utils/convert_bilingual_monolingual.py
@@ -1,61 +1,53 @@
-#!/usr/bin/python3
-import getopt
 import re
-import sys
+import argparse
+from pathlib import Path
 
-PATTERN_TIMESTAMP = re.compile('^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]')
-PATTERN_NUM = re.compile('\\d+')
+PATTERN_TIMESTAMP = re.compile(
+    "^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]"
+)
+PATTERN_NUM = re.compile("\\d+")
 
 
-def main(argv):
-    inputfile = ''
-    outputfile = ''
-    try:
-        opts, args = getopt.getopt(argv, "hi:o:", ["ifile=", "ofile="])
-    except getopt.GetoptError:
-        print('srt_worker.py -i <inputfile> -o <outputfile>')
-        sys.exit(2)
-    for opt, arg in opts:
-          if opt == '-h':
-             print( 'Usage: convert_bilingual_monolingual.py -i <inputfile> -o <outputfile>')
-             sys.exit(-2)
-          elif opt in ("-i", "--ifile"):
-             inputfile = arg
-          elif opt in ("-o", "--ofile"):
-             outputfile = arg
-
-    if not inputfile:
-        print('no input file is specified.\nUsage: convert_bilingual_monolingual.py -i <inputfile> -o <outputfile>')
-    elif not outputfile:
-        print('no output file is specified.\nUsage: convert_bilingual_monolingual.py -i <inputfile> -o <outputfile>')
-    else:
-        process(inputfile, outputfile)
-
-
-def process(input_file, output):
+def convert(input_file, output_file):
     """
-    Convert bilingual caption file to monolingual caption, supported caption file type is srt.
+    Convert bilingual caption file to monolingual caption. Supported caption file type is SRT.
     """
     line_count = 0
     with open(input_file) as file:
-        with open(output, 'a') as output:
+        with open(output_file, "w") as output_file:
             for line in file:
                 if line_count == 0:
                     line_count += 1
-                    output.write(line)
+                    output_file.write(line)
                 elif PATTERN_TIMESTAMP.match(line):
                     line_count += 1
-                    output.write(line)
-                elif line == '\n':
+                    output_file.write(line)
+                elif line == "\n":
                     line_count = 0
-                    output.write(line)
+                    output_file.write(line)
                 else:
                     if line_count == 2:
-                        output.write(line)
+                        output_file.write(line)
                     line_count += 1
-        output.close()
-        print('conversion completed!')
+        output_file.close()
 
 
 if __name__ == "__main__":
-    main(sys.argv[1:])
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--input_language_folder", type=str, help="Folder with input bilingual SRT files to be converted"
+    )
+    parser.add_argument(
+        "--output_language_folder",
+        type=str,
+        default="tmp-subtitles",
+        help="Folder to store converted monolingual SRT files",
+    )
+    args = parser.parse_args()
+
+    output_path = Path(args.output_language_folder)
+    output_path.mkdir(parents=True, exist_ok=True)
+    input_files = Path(args.input_language_folder).glob("*.srt")
+    for input_file in input_files:
+        convert(input_file, output_path / input_file.name)
+    print(f"Succesfully converted {len(list(input_files))} files to {args.output_language_folder} folder")
diff --git a/utils/generate_subtitles.py b/utils/generate_subtitles.py
@@ -7,14 +7,13 @@
 import argparse
 import sys
 
-def generate_subtitles(language: str, youtube_language_code: str=None):
+
+def generate_subtitles(language: str, youtube_language_code: str = None):
     metadata = []
     formatter = SRTFormatter()
     path = Path(f"subtitles/{language}")
     path.mkdir(parents=True, exist_ok=True)
-    playlist_videos = Playlist.getVideos(
-        "https://youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o"
-    )
+    playlist_videos = Playlist.getVideos("https://youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o")
 
     for idx, video in enumerate(playlist_videos["videos"]):
         video_id = video["id"]
@@ -34,7 +33,9 @@ def generate_subtitles(language: str, youtube_language_code: str=None):
         # Map mismatched language codes
         if language not in languages:
             if youtube_language_code is None:
-                raise ValueError(f"Language code {language} not found in YouTube's list of supported language: {languages}. Please provide a value for `youtube_language_code` and try again.")
+                raise ValueError(
+                    f"Language code {language} not found in YouTube's list of supported language: {languages}. Please provide a value for `youtube_language_code` and try again."
+                )
             language_code = youtube_language_code
         else:
             language_code = language
@@ -55,10 +56,11 @@ def generate_subtitles(language: str, youtube_language_code: str=None):
     df = pd.DataFrame(metadata)
     df.to_csv(f"subtitles/{language}/metadata.csv", index=False)
 
+
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument("--language", type=str, help="Language to generate subtitles for")
     parser.add_argument("--youtube_language_code", type=str, help="YouTube language code")
     args = parser.parse_args()
     generate_subtitles(args.language, args.youtube_language_code)
-    print(f"All done! Subtitles stored at subtitles/{args.language}")
+    print(f"All done! Subtitles stored at subtitles/{args.language}")
diff --git a/utils/validate_translation.py b/utils/validate_translation.py
@@ -6,10 +6,9 @@
 
 PATH_TO_COURSE = Path("chapters/")
 
+
 def load_sections(language: str):
-    toc = yaml.safe_load(
-        open(os.path.join(PATH_TO_COURSE / language, "_toctree.yml"), "r")
-    )
+    toc = yaml.safe_load(open(os.path.join(PATH_TO_COURSE / language, "_toctree.yml"), "r"))
     sections = []
     for chapter in toc:
         for section in chapter["sections"]:
@@ -35,4 +34,4 @@ def load_sections(language: str):
         for section in missing_sections:
             print(section)
     else:
-        print("✅ No missing sections - translation complete!")
+        print("✅ No missing sections - translation complete!")