Name		Name	Last commit message	Last commit date
parent directory ..
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

README.md

cutters

A rule based sentence segmentation library.
Python bindings for the cutters library written in Rust.

🚧 This library is experimental. 🚧

Features

Full UTF-8 support.
Robust parsing.
Language specific rules (each defined by its own PEG).
Fast and memory efficient parsing via the pest library.
Sentences can contain quotes which can contain subsentences.

Supported languages

Croatian (standard)
English (standard)

There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.

Example

After installing the cutters package with pip, usage is simple (note that the language is defined via ISO 639-1 two letter language codes).

import cutters

text = """
Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način."
""";

sentences = cutters.cut(text, "hr");

print(sentences);

This results in the following output (note that the str struct fields are &str).

[Sentence {
    str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
    quotes: [],
}, Sentence {
    str: "St. Louis 9LX je događaj u svijetu šaha.",
    quotes: [],
}, Sentence {
    str: "To je prof.dr.sc. Ivan Horvat.",
    quotes: [],
}, Sentence {
    str: "Volim rock, punk, funk, pop itd.",
    quotes: [],
}, Sentence {
    str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
    quotes: [
        Quote {
            str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
            sentences: [
                "Sve sretne obitelji nalik su jedna na drugu.",
                "Svaka nesretna obitelj nesretna je na svoj način.",
            ],
        },
    ],
}]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python

python

README.md

cutters

Features

Supported languages

Example

Files

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

cutters

Features

Supported languages

Example