-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to ICU tokenizer #939
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm requesting changes the performance issue in OpusTrainer.
Also, we should do a performance comparison to understand the speed impact of running the new segmenter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems good for me. The only concern I have is how good it would be compared to sacremoses for all the other languages. Maybe we could test or add as tests some of the examples that sacremoses has?
We can always switch between different tokenizers in the config generator, and do some per-language rules. |
I can try adding some of those examples but tokenization serves a limited purpose here, it's only for inline noise augmentation and guided alignments, so maybe even if it works slightly worse it will not really affect quality. Also, I see in your spreadsheet that it's doing ok. The problem with tests will be that if it doesn't pass we can't really do anything about it except file an issue for ICU. |
I see that it's super fast now (300K sentences per second), so the whole thing took 10 min for 200M sentences. Test run. I can't find any old tasks because they expired but it took a while for Moses fast C++ tokenizer. |
# Conflicts: # poetry.lock
I copied all test cases from sacremoses but I just used output tokens as expected values as the goal here is more to observe its behavior rather than fix it. |
# Conflicts: # pipeline/alignments/align.py # pipeline/train/requirements/train.in # pipeline/train/requirements/train.txt
* Add ICU tokenizer * Use ICU tokenizer in alignments * Update to OpusTrainer with ICU detokenization support * Update docs * Add pyicu pypi package * Use ICU system package * Strip new lines * Refactor abstract classes * Fix typo * Add todo with issue link for OpusTrainer package * Add test cases from sacremoses * Update to the latest commit * Relock poetry
closes #860
Related OpusTrainer fixes: hplt-project/OpusTrainer#61