Skip to content

EFord36/normalise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Title Logo

A module for normalising text.

Introduction

This module takes a text as input, and returns it in a fully normalised form, ie. expands everything that is not in a standard, readable format. Non-standard words (NSWs) are detected, classified and expanded. Examples of NSWs that are normalised include:

  • Numbers - percentages, dates, currency amounts, ranges, telephone numbers.
  • Abbreviations and acronyms.
  • Web addresses and hashtags.

Table of Contents

Installation

To install the module (on Windows, Mac OS X, Linux, etc.) and to ensure that you have the latest version of pip and setuptools:

$ pip install --upgrade pip setuptools

$ pip install normalise

If pip installation fails, you can try easy_install normalise.

Usage

Your input text can be a list of words, or a string.

To normalise your text, use the normalise function. This will return the text with NSWs replaced by their expansions:

text = ["On", "the", "28", "Apr.", "2010", ",", "Dr.", "Banks", "bought", "a", "chair", "for", "£35", "."]

normalise(text, verbose=True)

Out: 
['On',
 'the',
 'twenty-eighth of',
 'April',
 'twenty ten',
 ',',
 'Doctor',
 'Banks',
 'bought',
 'a',
 'chair',
 'for',
 'thirty five pounds',
 '.']

verbose=True displays the stages of the normalisation process, so you can monitor its progress. To turn this off, use verbose=False.

If your input is a string, you can use our basic tokenizer. For best results, input your own custom tokenizer.

normalise(text, tokenizer=tokenize_basic, verbose=True)

In order to see a list of all NSWs in your text, along with their index, tags, and expansion, use the list_NSWs function:

list_NSWs(text)

Out:
({3: ('Apr.', 'ALPHA', 'EXPN', 'April'),
  6: ('Dr.', 'ALPHA', 'EXPN', 'Doctor')},
 {2: ('28', 'NUMB', 'NORD', 'twenty-eighth of'),
  4: ('2010', 'NUMB', 'NYER', 'twenty ten'),
  12: ('£35', 'NUMB', 'MONEY', 'thirty five pounds')}

Authors

License

This project is licensed under the terms of the GNU General Public License.

Please see LICENSE.txt for more information.

Acknowledgements

This project builds on the work described in Sproat et al (2001).

We would like to thank Andrew Caines and Paula Buttery for supervising us during this project.