Skip to content

Latest commit

 

History

History

python

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

cutters

A rule based sentence segmentation library.
Python bindings for the cutters library written in Rust.

Release License Downloads

🚧 This library is experimental. 🚧

Features

  • Full UTF-8 support.
  • Robust parsing.
  • Language specific rules (each defined by its own PEG).
  • Fast and memory efficient parsing via the pest library.
  • Sentences can contain quotes which can contain subsentences.

Supported languages

  • Croatian (standard)
  • English (standard)

There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.

Example

After installing the cutters package with pip, usage is simple (note that the language is defined via ISO 639-1 two letter language codes).

import cutters

text = """
Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način."
""";

sentences = cutters.cut(text, "hr");

print(sentences);

This results in the following output (note that the str struct fields are &str).

[Sentence {
    str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
    quotes: [],
}, Sentence {
    str: "St. Louis 9LX je događaj u svijetu šaha.",
    quotes: [],
}, Sentence {
    str: "To je prof.dr.sc. Ivan Horvat.",
    quotes: [],
}, Sentence {
    str: "Volim rock, punk, funk, pop itd.",
    quotes: [],
}, Sentence {
    str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
    quotes: [
        Quote {
            str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
            sentences: [
                "Sve sretne obitelji nalik su jedna na drugu.",
                "Svaka nesretna obitelj nesretna je na svoj način.",
            ],
        },
    ],
}]