Preprocessor is a preprocessing library for tweet data written in Python.
When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.
Currently supports cleaning, tokenizing and parsing:
- URLs
- Hashtags
- Mentions
- Reserved words (RT, FAV)
- Emojis
- Smileys
- Number
JSON
and.txt
file support
Preprocessor v0.6.0
supports Python 3.4+ on Linux, macOS and Windows
. Tests run on following setups:
Linux Xenial with Python 3.4.8, 3.5.6, 3.6.7, 3.7.1, 3.8.0, 3.8.3+
macOS with Python 3.7.5, 3.8.0
Windows with Python 3.5.4, 3.6.8
>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'
>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'
>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58
>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'
Preprocessor will go through all of the options by default unless you specify some options.
Preprocessor currently supports processing .json
and .txt
formats. Please see below examples for the correct input format.
[
"Preprocessor now supports files. https://github.com/s/preprocessor",
"#preprocessing is a cruical part of @ML projects.",
"@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl"
]
Preprocessor now supports files. https://github.com/s/preprocessor
#preprocessing is a cruical part of @ML projects.
@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl
# JSON example
>>> input_file_name = "sample_json.json"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
# Text file example
>>> input_file_name = "sample_txt.txt"
>>> p.clean_file(file_name, options=[p.OPT.URL, p.OPT.MENTION])
Option Name | Option Short Code |
---|---|
URL | p.OPT.URL |
Mention | p.OPT.MENTION |
Hashtag | p.OPT.HASHTAG |
Reserved Words | p.OPT.RESERVED |
Emoji | p.OPT.EMOJI |
Smiley | p.OPT.SMILEY |
Number | p.OPT.NUMBER |
using pip:
$ pip install tweet-preprocessor
using manual installation:
$ python setup.py build
$ python setup.py install
Are you willing to contribute to preprocessor? That's great! Please follow below steps to contribute to this project:
-
Create a bug report or a feature idea using the templates on Issues page.
-
Fork the repository and make your changes.
-
Open a PR and make sure your PR has tests and all the checks pass.
-
And that's all!