choppa

Partial python port of java SRX segmenter, originally written by Jarek Lipski.

In a nutshell it allows you to tokenize texts into the sentences (but generally it's rule-based, so you can chop anything textual).

Shipped with segment.srx set of segmentation rules for different languages, crafted by the great team of languagetool.

Current status and plans

That port currently covers:

All structures (structures.py) necessary for parser to operate (Rule, LanguageRule, LanguageMap)
Abstract and Accurate legacy iterator (iterators.py) which basically segments text into the chunks according to the SRX rules
SAX based parser (srx_parser.py) to read SRX rules from xml files (SRX2.0 only)
SrxDocument (again srx_parser.py) class which allows you to manage rules and cache regexes
A partial implementation of Java Matcher class, which is absent in python.
Tests for everything above (and beyond)
Additional tests from LanguageTool for Ukrainian language
Type hints

I also pythonized the code to the some extent (by removing some of setters/getters, snake_casing methods and variables and adapting data structures).

Important notes

First and foremost, I would like to thank Jarek for his work and the quality of his code. My project is not original, it just brings the power of srx segmenter to python world. And it relies completely on the work done by Jarek.

Please pay attention to the fact that only Accurate iterator is currently implemented (and I don't have immediate plans to implement the rest). Accurate Iterator should work well on a relatively small documents (i.e do not use it on multi GB plaintext corpora!). Other iterators from original library allows to parse large documents efficiently while sacrificing some accuracy (limiting look-behind patterns, etc). If you really need it — I'm always open for the pull requests. Similary, I've only implemented SAX reader for rules and using xmlschema package for schema validation. Last but not least, various readers aren't ported from the original library too, in my opinion, python already has a lot of built-in tools for that behaviour.

Also, I don't have any plan of porting UI at all. You can simply reuse some of UI's available.

Copyrights and kudos

Python port: Dmytro Chaplynskyi
Original Java implementation: Jarek Lipski
Segmentation rules: Daniel Naber, Jaume Ortolà et al (153 contributors!)
Special thanks to Andriy Rysin, driving force behind Ukrainian language in LanguageTool

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
choppa		choppa
data		data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

choppa

Current status and plans

Important notes

Copyrights and kudos

About

Releases

Packages

Languages

License

proger/choppa

Folders and files

Latest commit

History

Repository files navigation

choppa

Current status and plans

Important notes

Copyrights and kudos

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages