Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to ICU tokenizer #939

Merged
merged 15 commits into from
Dec 21, 2024
Merged

Switch to ICU tokenizer #939

merged 15 commits into from
Dec 21, 2024

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Nov 22, 2024

closes #860

Related OpusTrainer fixes: hplt-project/OpusTrainer#61

@eu9ene eu9ene marked this pull request as ready for review November 23, 2024 00:48
@eu9ene eu9ene requested review from a team as code owners November 23, 2024 00:48
@eu9ene eu9ene requested review from hneiva, gregtatum and ZJaume and removed request for a team and hneiva November 23, 2024 00:48
Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm requesting changes the performance issue in OpusTrainer.

@gregtatum
Copy link
Member

Also, we should do a performance comparison to understand the speed impact of running the new segmenter.

Copy link
Collaborator

@ZJaume ZJaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems good for me. The only concern I have is how good it would be compared to sacremoses for all the other languages. Maybe we could test or add as tests some of the examples that sacremoses has?

@gregtatum
Copy link
Member

We can always switch between different tokenizers in the config generator, and do some per-language rules.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Dec 17, 2024

Overall seems good for me. The only concern I have is how good it would be compared to sacremoses for all the other languages. Maybe we could test or add as tests some of the examples that sacremoses has?

I can try adding some of those examples but tokenization serves a limited purpose here, it's only for inline noise augmentation and guided alignments, so maybe even if it works slightly worse it will not really affect quality. Also, I see in your spreadsheet that it's doing ok. The problem with tests will be that if it doesn't pass we can't really do anything about it except file an issue for ICU.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Dec 18, 2024

Also, we should do a performance comparison to understand the speed impact of running the new segmenter.

I see that it's super fast now (300K sentences per second), so the whole thing took 10 min for 200M sentences. Test run. I can't find any old tasks because they expired but it took a while for Moses fast C++ tokenizer.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Dec 19, 2024

Overall seems good for me. The only concern I have is how good it would be compared to sacremoses for all the other languages. Maybe we could test or add as tests some of the examples that sacremoses has?

I copied all test cases from sacremoses but I just used output tokens as expected values as the goal here is more to observe its behavior rather than fix it.

@eu9ene eu9ene requested review from ZJaume and gregtatum December 19, 2024 00:52
# Conflicts:
#	pipeline/alignments/align.py
#	pipeline/train/requirements/train.in
#	pipeline/train/requirements/train.txt
@eu9ene eu9ene merged commit 7d45b3a into main Dec 21, 2024
39 checks passed
ZJaume pushed a commit that referenced this pull request Jan 16, 2025
* Add ICU tokenizer

* Use ICU tokenizer in alignments

* Update to OpusTrainer with ICU detokenization support

* Update docs

* Add pyicu pypi package

* Use ICU system package

* Strip new lines

* Refactor abstract classes

* Fix typo

* Add todo with issue link for OpusTrainer package

* Add test cases from sacremoses

* Update to the latest commit

* Relock poetry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate switching to ICU segmenter
3 participants