Skip to content
forked from Jcharis/neattext

NeatText a simple NLP package for cleaning textual data and text preprocessing

License

Notifications You must be signed in to change notification settings

abtGIT/neattext

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

neattext

NeatText:a simple NLP package for cleaning textual data and text preprocessing

Build Status

GitHub license

Problem

  • Cleaning of unstructured text data
  • Reduce noise [special characters,stopwords]
  • Reducing repetition of using the same code for text preprocessing

Solution

  • convert the already known solution for cleaning text into a reuseable package

Installation

pip install neattext

Usage

  • The OOP Way(Object Oriented Way)
  • NeatText offers 4 main classes for working with text data
    • TextFrame : a frame-like object for cleaning text
    • TextCleaner: remove or replace specifics
    • TextExtractor: extract unwanted text data
    • TextMetrics: word stats and metrics

Overall Components of NeatText

Using TextFrame

  • Keeps the text as TextFrame object. This allows us to do more with our text.
  • It inherits the benefits of the TextCleaner and the TextMetrics out of the box with some additional features for handling text data.
  • This is the simplest way for text preprocessing with this library alternatively you can utilize the other classes too.
>>> import neattext as nt 
>> mytext = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊."
>>> docx = nt.TextFrame(text=mytext)
>>> docx.text 
"This is the mail [email protected] ,our WEBSITE is https://example.com 😊."
>>>
>>> docx.describe()
Key      Value          
Length  : 73             
vowels  : 21             
consonants: 34             
stopwords: 4              
punctuations: 8              
special_char: 8              
tokens(whitespace): 10             
tokens(words): 14             
>>> 
>>> docx.length
73
>>> # Scan Percentage of Noise(Unclean data) in text
>>> d.noise_scan()
{'text_noise': 19.17808219178082, 'text_length': 73, 'noise_count': 14}
>>> 
>>> docs.head(16)
'This is the mail'
>>> docx.tail()
>>> docx.count_vowels()
>>> docx.count_stopwords()
>>> docx.count_consonants()
>>> docx.nlongest()
>>> docx.nshortest()
>>> docx.readability()

Basic NLP Task (Tokenization,Ngram,Text Generation)

>>> docx.word_tokens()
>>>
>>> docx.sent_tokens()
>>>
>>> docx.term_freq()
>>>
>>> docx.bow()

Basic Text Preprocessing

>>> docx.normalize()
'this is the mail [email protected] ,our website is https://example.com 😊.'
>>> docx.normalize(level='deep')
'this is the mail examplegmailcom our website is httpsexamplecom '

>>> docx.remove_puncts()
>>> docx.remove_stopwords()
>>> docx.remove_html_tags()
>>> docx.remove_special_characters()
>>> docx.remove_emojis()
>>> docx.fix_contractions()
Handling Files with NeatText
  • Read txt file directly into TextFrame
>>> import neattext as nt 
>>> docx_df = nt.read_txt('file.txt')
  • Alternatively you can instantiate a TextFrame and read a text file into it
>>> import neattext as nt 
>>> docx_df = nt.TextFrame().read_txt('file.txt')
Chaining Methods on TextFrame
>>> t1 = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextFrame(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'

Clean Text

  • Clean text by removing emails,numbers,stopwords,emojis,etc
  • A simplified method for cleaning text by specifying as True/False what to clean from a text
>>> from neattext.functions import clean_text
>>> 
>>> mytext = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊."
>>> 
>>> clean_text(mytext)
'mail [email protected] ,our website https://example.com .'
  • You can remove punctuations,stopwords,urls,emojis,multiple_whitespaces,etc by setting them to True.

  • You can choose to remove or not remove punctuations by setting to True/False respectively

>>> clean_text(mytext,puncts=True)
'mail example@gmailcom website https://examplecom '
>>> 
>>> clean_text(mytext,puncts=False)
'mail [email protected] ,our website https://example.com .'
>>> 
>>> clean_text(mytext,puncts=False,stopwords=False)
'this is the mail [email protected] ,our website is https://example.com .'
>>> 
  • You can also remove the other non-needed items accordingly
>>> clean_text(mytext,stopwords=False)
'this is the mail [email protected] ,our website is https://example.com .'
>>>
>>> clean_text(mytext,urls=False)
'mail [email protected] ,our website https://example.com .'
>>> 
>>> clean_text(mytext,urls=True)
'mail [email protected] ,our website .'
>>> 

Removing Punctuations [A Very Common Text Preprocessing Step]

  • You remove the most common punctuations such as fullstop,comma,exclamation marks and question marks by setting most_common=True which is the default
  • Alternatively you can also remove all known punctuations from a text.
>>> import neattext as nt 
>>> mytext = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_puncts()
TextFrame(text="This is the mail example@gmailcom our WEBSITE is https://examplecom 😊 Please dont forget the email when you enter ")

>>> docx.remove_puncts(most_common=False)
TextFrame(text="This is the mail examplegmailcom our WEBSITE is httpsexamplecom 😊 Please dont forget the email when you enter ")

Removing Stopwords [A Very Common Text Preprocessing Step]

  • You can remove stopwords from a text by specifying the language. The default language is English
  • Supported Languages include English(en),Spanish(es),French(fr)|Russian(ru)|Yoruba(yo)|German(de)
>>> import neattext as nt 
>>> mytext = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊. Please don't forget the email when you enter !!!!!"
>>> docx = nt.TextFrame(mytext)
>>> docx.remove_stopwords(lang='en')
TextFrame(text="mail [email protected] ,our WEBSITE https://example.com 😊. forget email enter !!!!!")

Remove Emails,Numbers,Phone Numbers,Dates,Btc Address,VisaCard Address,etc

>>> print(docx.remove_emails())
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> print(docx.remove_stopwords())
>>> 'This mail [email protected] ,our WEBSITE https://example.com 😊.'
>>>
>>> print(docx.remove_numbers())
>>> docx.remove_phone_numbers()
>>> docx.remove_btc_address()

Remove Special Characters

>>> docx.remove_special_characters()

Remove Emojis

>>> print(docx.remove_emojis())
>>> 'This is the mail [email protected] ,our WEBSITE is https://example.com .'

Remove Custom Pattern

  • You can also specify your own custom pattern, incase you cannot find what you need in the functions using the remove_custom_pattern() function
>>> import neattext.functions as nfx 
>>> ex = "Last !RT tweeter multiple &#7777"
>>> 
>>> nfx.remove_custom_pattern(e,r'&#\d+')
'Last !RT tweeter multiple  '

Replace Emails,Numbers,Phone Numbers

>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()

Chain Multiple Methods

>>> t1 = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊 and it will cost $100 to subscribe."
>>> docx = TextCleaner(t1)
>>> result = docx.remove_emails().remove_urls().remove_emojis()
>>> print(result)
'This is the mail  ,our WEBSITE is   and it will cost $100 to subscribe.'

Using TextExtractor

  • To Extract emails,phone numbers,numbers,urls,emojis from text
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['[email protected]']
>>>
>>> docx.extract_emojis()
>>> ['😊']

Using TextMetrics

  • To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail [email protected] ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()

Usage

  • The MOP(method/function oriented way) Way
>>> from neattext.functions import clean_text,extract_emails
>>> t1 = "This is the mail [email protected] ,our WEBSITE is https://example.com ."
>>> clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> extract_emails(t1)
>>> ['[email protected]']
  • Alternatively you can also use this approach
>>> import neattext.functions as nfx 
>>> t1 = "This is the mail [email protected] ,our WEBSITE is https://example.com ."
>>> nfx.clean_text(t1,puncts=True,stopwords=True)
>>>'this mail examplegmailcom website httpsexamplecom'
>>> nfx.extract_emails(t1)
>>> ['[email protected]']

Explainer

  • Explain an emoji or unicode for emoji
    • emoji_explainer()
    • emojify()
    • unicode_2_emoji()
>>> from neattext.explainer import emojify
>>> emojify('Smiley')
>>> 'πŸ˜ƒ'
>>> from neattext.explainer import emoji_explainer
>>> emoji_explainer('πŸ˜ƒ')
>>> 'SMILING FACE WITH OPEN MOUTH'
>>> from neattext.explainer import unicode_2_emoji
>>> unicode_2_emoji('0x1f49b')
	'FLUSHED FACE'

Documentation

Please read the documentation for more information on what neattext does and how to use is for your needs.

More Features To Add

  • basic nlp task
  • currency normalizer

Acknowledgements

  • Inspired by packages like clean-text from Johannes Fillter and textify by JCharisTech

NB

  • Contributions Are Welcomed
  • Notice a bug, please let us know.
  • Thanks A lot

By

  • Jesse E.Agbe(JCharis)
  • Jesus Saves @JCharisTech

About

NeatText a simple NLP package for cleaning textual data and text preprocessing

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%