GitHub - williamvoor/langid: Language Detection

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
ENGLISH.TEXT		ENGLISH.TEXT
ENGLISH.TWEET		ENGLISH.TWEET
GERMAN.TEXT		GERMAN.TEXT
ITALIAN.TEXT		ITALIAN.TEXT
POLISH.TEXT		POLISH.TEXT
PORTUGUESE.TEXT		PORTUGUESE.TEXT
README.md		README.md
pom.xml		pom.xml
run.sh		run.sh

Repository files navigation

Language detection pipeline

Implementation of a modular language detection pipeline that uses a chain of responsibility to process a detection request in stages. The current implementation uses two stages, one that trains on a set of known texts and builds dictionaries, and a second stage that computes a simple score based on the number of matches between tokens in the target text and the training dictionaries.

The dictionary implementation uses a space-efficient probabilistic data structure: bloom filter. See https://github.com/magnuss/java-bloomfilter.

Compiling

mvn package

Running examples

./run.sh ENGLISH.TEXT

or

java -jar target/langid-0.0.1-SNAPSHOT-jar-with-dependencies.jar ENGLISH.TEXT

Training data

Texts with training data (i.e. ENGLISH.1, ENGLISH.2) files are located in src/main/resources. New files should be copied to that directory. If the new files are of a different language than listed in the Language enumeration, a new constant should be added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language detection pipeline

Compiling

Running examples

Training data

About

Releases

Packages

Languages

williamvoor/langid

Folders and files

Latest commit

History

Repository files navigation

Language detection pipeline

Compiling

Running examples

Training data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages