Calmux Text Segmenter\n\nCalumax Text Segmenter is diligently designed to revamp your machine translation using the unsupervised text tokenization technique. Our innovatively conceived system helps factoring shared attributes of words such as casing or spacing. The text segmenter is built to be directly compatible with the Marian Neural Machine Translation Toolkit. It's essential to implement a parser for this format, modify the embedding lookup, and manage factors on the target side, the beam decoder, to pair up the text segmenter with other toolkits.\n\nIn addition to segmenting words into subwords or word pieces, the Calmux Text Segmenter shines at fine-tuning the spacing and capitalization factors.\n\n### Key Characterstics of the Calmux Text Segmenter\n\n- It represents words and tokens as tuples of factors so parameter sharing is possible.\n- Infrequent words are represented by subwords using the SentencePiece library.\n- Special treatment for numerals: every digit gets tokenized irrespective of the writing system.\n- Special support for phrase fixing - translation of specific phrases in specific ways.\n- Unusual characters like rare emojis are encoded by their Unicode character code.\n- Round-trippable: allows to fully reconstruct the source sentence from the factored (sub)word representation.\n- The script continuously supports different rules for spacing, and combining marks.\n\nA rudimentary requirement would be to install the following dependencies in Linux:\n\n```

sudo apt-get install dotnet-sdk-3.1 sudo apt-get install dotnet-runtime-3.1

And install SentencePiece from source. The text segmenter includes integration with both C# and a Linux command-line tool for maximum flexibity.\n\n### How to Contribute?\nWe wholeheartedly welcome contributions and suggestions. But prior to making contributions, you need to agree to a Contributor License Agreement (CLA), thus ensuring legality and legitimacy. When you make a pull request, a CLA bot will automatically determine whether you need to provide a CLA and provide appropriate instructions. You'll have to do this only once across all repos using our CLA. We have adopted the Microsoft Open Source Code of Conduct. Check out the Code of Conduct FAQ for more information or contact Microsoft if you have any additional questions or comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls