The tweets preprocessor module, developed by the AUTH team as part of the PlasticTwist Crowdsourcing module
The tweets-preprocessor
module is not yet available trough PyPI, thus requiring manual import.
$ pip install -r requirements.txt
then execute utils/requirements_installer.py
to install additional dependencies automatically.
The module was developed in a functional way and features a Fluent API. This allows the user
to either call individual pre-processing methods or use the full_preprocess
method to
apply all of the pre-processing methods to his text.
The list of methods that can currently be used are:
remove_urls
- Removes all urls (e.g. 'https://ptwist.eu')remove_mentions
- Removes all mentions (e.g. '@PlasticTwistBot')remove_hashtags
- Removes all hashtags (e.g. '#plastictwist')remove_twitter_reserved_words
- Removes Twitter reserved words (e.g. 'RT', 'via')remove_punctuation
- Removes punctuation (e.g. '.', '!')remove_single_letter_words
- Removes single-letter words (e.g. 'b', 'f')remove_blank_spaces
- Removes blank spacesremove_stopwords
- Removes stopwords (e.g. 'a', 'at', 'here')- has an
extra_stopwords
parameter (list) that allows users to add extra stopwords
- has an
remove_profane_words
- Removes profane wordsremove_numbers
- Removes numbers (e.g. '2', '999')- has an
preserve_years
parameter (boolean) that allows users to choose whether or not years should be removed.
- has an
from twitter_preprocessor import TwitterPreprocessor
p = TwitterPreprocessor('Some @ptwist text to be preprocessed. It contains 2 sentences. Best text 2019!')
p.remove_mentions().remove_punctuation().remove_numbers(preserve_years=True).remove_blank_spaces()
print(p.text)
# 'Some text to be preprocessed It contains sentences Best text 2019'
from twitter_preprocessor import TwitterPreprocessor
p = TwitterPreprocessor('RT @ptwist This text contains mentions, urls, some Twitter words and some stopwords to be preprocessed via https://example.com.')
p.fully_preprocess()
print(p.text)
# 'This text contains mentions urls Twitter words stopwords preprocessed'
This project is licensed under the GPL 3.0 license.
Developed and maintained by: vasisouv, alextsil, idimitriadis