Switch to ICU tokenizer #939

eu9ene · 2024-11-22T21:58:22Z

closes #860

Related OpusTrainer fixes: hplt-project/OpusTrainer#61

gregtatum

I'm requesting changes the performance issue in OpusTrainer.

pipeline/alignments/tokenizer.py

docs/opus-trainer.md

pipeline/data/requirements/data.in

gregtatum · 2024-12-04T14:23:47Z

Also, we should do a performance comparison to understand the speed impact of running the new segmenter.

ZJaume

Overall seems good for me. The only concern I have is how good it would be compared to sacremoses for all the other languages. Maybe we could test or add as tests some of the examples that sacremoses has?

gregtatum · 2024-12-06T19:37:09Z

We can always switch between different tokenizers in the config generator, and do some per-language rules.

eu9ene · 2024-12-17T20:02:03Z

Overall seems good for me. The only concern I have is how good it would be compared to sacremoses for all the other languages. Maybe we could test or add as tests some of the examples that sacremoses has?

I can try adding some of those examples but tokenization serves a limited purpose here, it's only for inline noise augmentation and guided alignments, so maybe even if it works slightly worse it will not really affect quality. Also, I see in your spreadsheet that it's doing ok. The problem with tests will be that if it doesn't pass we can't really do anything about it except file an issue for ICU.

eu9ene · 2024-12-18T23:47:13Z

Also, we should do a performance comparison to understand the speed impact of running the new segmenter.

I see that it's super fast now (300K sentences per second), so the whole thing took 10 min for 200M sentences. Test run. I can't find any old tasks because they expired but it took a while for Moses fast C++ tokenizer.

# Conflicts: # poetry.lock

eu9ene · 2024-12-19T00:50:00Z

Overall seems good for me. The only concern I have is how good it would be compared to sacremoses for all the other languages. Maybe we could test or add as tests some of the examples that sacremoses has?

I copied all test cases from sacremoses but I just used output tokens as expected values as the goal here is more to observe its behavior rather than fix it.

pipeline/alignments/tokenizer.py

# Conflicts: # pipeline/alignments/align.py # pipeline/train/requirements/train.in # pipeline/train/requirements/train.txt

* Add ICU tokenizer * Use ICU tokenizer in alignments * Update to OpusTrainer with ICU detokenization support * Update docs * Add pyicu pypi package * Use ICU system package * Strip new lines * Refactor abstract classes * Fix typo * Add todo with issue link for OpusTrainer package * Add test cases from sacremoses * Update to the latest commit * Relock poetry

eu9ene added 6 commits November 22, 2024 13:56

Add ICU tokenizer

4041e04

Use ICU tokenizer in alignments

070c0d3

Update to OpusTrainer with ICU detokenization support

4000853

Update docs

13a80bf

Add pyicu pypi package

bb7f523

Use ICU system package

d585a63

eu9ene marked this pull request as ready for review November 23, 2024 00:48

eu9ene requested review from a team as code owners November 23, 2024 00:48

eu9ene requested review from hneiva, gregtatum and ZJaume and removed request for a team and hneiva November 23, 2024 00:48

gregtatum requested changes Dec 3, 2024

View reviewed changes

pipeline/alignments/tokenizer.py Outdated Show resolved Hide resolved

docs/opus-trainer.md Outdated Show resolved Hide resolved

pipeline/data/requirements/data.in Outdated Show resolved Hide resolved

pipeline/data/requirements/data.in Outdated Show resolved Hide resolved

ZJaume requested changes Dec 4, 2024

View reviewed changes

Strip new lines

c419105

eu9ene added 2 commits December 17, 2024 12:33

Refactor abstract classes

71b6b10

Fix typo

95c5362

eu9ene mentioned this pull request Dec 18, 2024

Add support for ICU tokenizer hplt-project/OpusTrainer#61

Merged

Add todo with issue link for OpusTrainer package

d0f175e

eu9ene added 2 commits December 18, 2024 16:43

Add test cases from sacremoses

2923e72

Update to the latest commit

7176053

eu9ene added 2 commits December 18, 2024 16:47

Merge branch 'refs/heads/main' into icu_tokenizer

8c84582

# Conflicts: # poetry.lock

Relock poetry

469e0b9

eu9ene requested review from ZJaume and gregtatum December 19, 2024 00:52

ZJaume approved these changes Dec 19, 2024

View reviewed changes

gregtatum approved these changes Dec 20, 2024

View reviewed changes

pipeline/alignments/tokenizer.py Show resolved Hide resolved

Merge branch 'refs/heads/main' into icu_tokenizer

92325fc

# Conflicts: # pipeline/alignments/align.py # pipeline/train/requirements/train.in # pipeline/train/requirements/train.txt

eu9ene merged commit 7d45b3a into main Dec 21, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to ICU tokenizer #939

Switch to ICU tokenizer #939

eu9ene commented Nov 22, 2024 •

edited

Loading

gregtatum left a comment

gregtatum commented Dec 4, 2024

ZJaume left a comment

gregtatum commented Dec 6, 2024

eu9ene commented Dec 17, 2024

eu9ene commented Dec 18, 2024

eu9ene commented Dec 19, 2024

Switch to ICU tokenizer #939

Switch to ICU tokenizer #939

Conversation

eu9ene commented Nov 22, 2024 • edited Loading

gregtatum left a comment

Choose a reason for hiding this comment

gregtatum commented Dec 4, 2024

ZJaume left a comment

Choose a reason for hiding this comment

gregtatum commented Dec 6, 2024

eu9ene commented Dec 17, 2024

eu9ene commented Dec 18, 2024

eu9ene commented Dec 19, 2024

eu9ene commented Nov 22, 2024 •

edited

Loading