textmining

Tiny text mining package for Python.

See textmining-dataprep repository for details on data preparation.

Setup

pip install textmining

Usage

Tokenization

import tm

text = u"Les chaussettes de l'archiduchesse sont-elles sèches, archi-sèches ?"
for token in tm.tokenize(text):
    print token

Output:

Les
chaussettes
de
l
archiduchesse
sont
elles
sèches
archi
sèches

import tm

text = u"Les chaussettes de l'archiduchesse sont-elles sèches, archi-sèches ?"
for token in tm.tokenize(text, split_on_dash=False):
    print token

Output:

Les
chaussettes
de
l
archiduchesse
sont-elles
sèches
archi-sèches

Applying successive transformations

import tm

print tm.transform(u"Archi-sèches", [tm.to_lower, tm.substitute_accents])

Output:

archi-seches

Tokenization and transformation

import tm

transformations = [tm.clean, \
    tm.min_len(1), \
    tm.to_lower, \
    tm.remove_stopwords, \
    tm.substitute_accents]

text = u"Les chaussettes de l'archiduchesse sont-elles sèches, archi-sèches ?"
tokens = tm.tokenize_transform(text, transformations)

print " ".join(tokens)

Output:

chaussettes archiduchesse seches archi seches

First name detection

import tm

for firstname in [u"Mathilde", u"Gaëlle", u"Grégory", u"Rennes", u"Yann"]:
    print "%s is %sa first name." % (firstname, "" if tm.is_firstname(firstname) else "not ")

Output:

Mathilde is a first name.
Gaëlle is a first name.
Grégory is a first name.
Rennes is not a first name.
Yann is a first name.

Gender detection

from firsname import FirstName

for firstname in [u"Mathilde", u"Gaëlle", u"Grégory", u"Yann"]:
	print "%s: %s" % (firstname, FirstName.get_singleton().get_gender_confidence(firstname))

Output:

Mathilde: (2, 0.9991225231056928)
Gaëlle: (2, 1.0)
Grégory: (1, 1.0)
Yann: (1, 0.9993874425727411)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
textmining		textmining
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textmining

Setup

Usage

Tokenization

Applying successive transformations

Tokenization and transformation

First name detection

Gender detection

About

Releases

Packages

Languages

License

michelcaradec/textmining

Folders and files

Latest commit

History

Repository files navigation

textmining

Setup

Usage

Tokenization

Applying successive transformations

Tokenization and transformation

First name detection

Gender detection

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages