Skip to content
/ choppa Public
forked from lang-uk/choppa

Partial python port of java SRX segmenter

License

Notifications You must be signed in to change notification settings

proger/choppa

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

choppa

Partial python port of java SRX segmenter, originally written by Jarek Lipski.

In a nutshell it allows you to tokenize texts into the sentences (but generally it's rule-based, so you can chop anything textual).

Shipped with segment.srx set of segmentation rules for different languages, crafted by the great team of languagetool.

Current status and plans

That port currently covers:

  • All structures (structures.py) necessary for parser to operate (Rule, LanguageRule, LanguageMap)
  • Abstract and Accurate legacy iterator (iterators.py) which basically segments text into the chunks according to the SRX rules
  • SAX based parser (srx_parser.py) to read SRX rules from xml files (SRX2.0 only)
  • SrxDocument (again srx_parser.py) class which allows you to manage rules and cache regexes
  • A partial implementation of Java Matcher class, which is absent in python.
  • Tests for everything above (and beyond)
  • Additional tests from LanguageTool for Ukrainian language
  • Type hints

I also pythonized the code to the some extent (by removing some of setters/getters, snake_casing methods and variables and adapting data structures).

Important notes

First and foremost, I would like to thank Jarek for his work and the quality of his code. My project is not original, it just brings the power of srx segmenter to python world. And it relies completely on the work done by Jarek.

Please pay attention to the fact that only Accurate iterator is currently implemented (and I don't have immediate plans to implement the rest). Accurate Iterator should work well on a relatively small documents (i.e do not use it on multi GB plaintext corpora!). Other iterators from original library allows to parse large documents efficiently while sacrificing some accuracy (limiting look-behind patterns, etc). If you really need it — I'm always open for the pull requests. Similary, I've only implemented SAX reader for rules and using xmlschema package for schema validation. Last but not least, various readers aren't ported from the original library too, in my opinion, python already has a lot of built-in tools for that behaviour.

Also, I don't have any plan of porting UI at all. You can simply reuse some of UI's available.

Copyrights and kudos

  • Python port: Dmytro Chaplynskyi
  • Original Java implementation: Jarek Lipski
  • Segmentation rules: Daniel Naber, Jaume Ortolà et al (153 contributors!)
  • Special thanks to Andriy Rysin, driving force behind Ukrainian language in LanguageTool

About

Partial python port of java SRX segmenter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%