Skip to content
/ dobbi Public

An open-source NLP library: fast text cleaning and preprocessing

License

Notifications You must be signed in to change notification settings

iaramer/dobbi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌴 dobbi 🦕

Takes care of all of this boring NLP stuff

PyPI - Python Version Version GitHub

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

import pandas as pd

d = {'text': ['#fun #lol   Why  @Alex33 is so funny here: https://some-url.com',
              '#looool     =)      😍 such lovely!?*!!!%&']}
df = pd.DataFrame(d)

cln_func = dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .function()
df['text'] = df['text'].map(cln_func)

repl_func = dobbi.replace() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
df['text'] = df['text'].map(repl_func)

Result:

print(df['text'][0])  # 'Why is so funny here'
print(df['text'][1])  # 'TOKEN_EMOTICON_HAPPY_FACE_OR_SMILEY TOKEN_EMOJI_SMILING_FACE_WITH_HEART_EYES such lovely'

Supported methods and patterns

The process consists of three stages:

  1. Initialization methods: initialize a dobbi Work object
  2. Intermediate methods: chain patterns in the needed order
  3. Terminal methods: choose if you need a function or a result

Initialization functions:

  • dobbi.clean()
  • dobbi.collect()
  • dobbi.replace()

Intermediate methods (pattern processing choice):

  • regexp() - custom regular expressions
  • url() - URLs
  • html() - HTML and "<...>" type markups
  • punctuation() - punctuation
  • hashtag() - hashtags
  • emoji() - emoji
  • emoticons() - emoticons
  • whitespace() - any type of whitespaces
  • nickname() - @-starting nicknames

Terminal methods:

  • execute(str) - executes chosen methods on the provided string.
  • function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function

func = dobbi.clean() \
    .url() \
    .hashtag() \
    .punctuation() \
    .whitespace() \
    .html() \
    .function()
func('\t #fun #lol    Why  @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')

Result:

'Why Alex33 is so funny Check here'
  1. Chain regexp methods
dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'
  1. Remove emoji and emoticons
em_func = dobbi.clean() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
em_func('Great! =) :D  😍 😋such lovely!?*!!!%&')

Result:

'Great such lovely'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

  • Finding bugs
  • Making code optimizations
  • Writing tests
  • Help with new features development

About

An open-source NLP library: fast text cleaning and preprocessing

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages