GitHub - EFord36/normalise at 3dbb8b53866fd014228dc3e83fb230ffc5d988b5

Name	Name	Last commit message	Last commit date
Latest commit History 257 Commits
__pycache__	__pycache__
evaluation	evaluation
normalise	normalise
tests	tests
LICENSE.txt	LICENSE.txt
README.md	README.md
ez_setup.py	ez_setup.py
logo.png	logo.png
setup.cfg	setup.cfg
setup.py	setup.py

A module for normalising text.

Introduction

This module takes a text as input, and returns it in a fully normalised form, ie. expands everything that is not in a standard, readable format. Non-standard words (NSWs) are detected, classified and expanded. Examples of NSWs that are normalised include:

Numbers - percentages, dates, currency amounts, ranges, telephone numbers.
Abbreviations and acronyms.
Web addresses and hashtags.

Installation
Usage
Customise to your variety
Input your own abbreviation dictionary
Example
Authors
License
Acknowledgements

1. Installation

To install the module (on Windows, Mac OS X, Linux, etc.) and to ensure that you have the latest version of pip and setuptools:

$ pip install --upgrade pip setuptools

$ pip install normalise

If pip installation fails, you can try easy_install normalise.

2. Usage

Your input text can be a list of words, or a string.

To normalise your text, use the normalise function. This will return the text with NSWs replaced by their expansions:

text = ["On", "the", "28", "Apr.", "2010", ",", "Dr.", "Banks", "bought", "a", "chair", "for", "£35", "."]

normalise(text, verbose=True)

Out:
['On',
 'the',
 'twenty-eighth of',
 'April',
 'twenty ten',
 ',',
 'Doctor',
 'Banks',
 'bought',
 'a',
 'chair',
 'for',
 'thirty five pounds',
 '.']

verbose=True displays the stages of the normalisation process, so you can monitor its progress. To turn this off, use verbose=False.

If your input is a string, you can use our basic tokenizer. For best results, input your own custom tokenizer.

normalise(text, tokenizer=tokenize_basic, verbose=True)

In order to see a list of all NSWs in your text, along with their index, tags, and expansion, use the list_NSWs function:

list_NSWs(text)

Out:
({3: ('Apr.', 'ALPHA', 'EXPN', 'April'),
  6: ('Dr.', 'ALPHA', 'EXPN', 'Doctor')},
 {2: ('28', 'NUMB', 'NORD', 'twenty-eighth of'),
  4: ('2010', 'NUMB', 'NYER', 'twenty ten'),
  12: ('£35', 'NUMB', 'MONEY', 'thirty five pounds')}

i. Customise to your variety

In order to customise normalisation to your variety of English, use variety="BrE" for British English, or variety="AmE" for American English:

text = ["On", "10/04", ",", "he", "went", "to", "the", "seaside", "."]

normalise(text, variety="BrE")
Out: ['On', 'the tenth of April', ',', 'he', 'went', 'to', 'the', 'seaside', '.']

normalise(text, variety="AmE")
Out: ['On', 'the fourth of October', ',', 'he', 'went', 'to', 'the', 'seaside', '.']

If a variety is not specified, our default is British English.

ii. Input your own abbreviation dictionary

Although our system aims to be domain-general, users can input their own dictionary of abbreviations in order to tailor to a specific domain. This can be done using the keyword argument user_abbrevs={}:

my_abbreviations = {"bdrm": "bedroom",
                    "KT": "kitchen",
                    "wndw": "window",
                    "ONO": "or near offer"}

text = ["4bdrm", "house", "for", "sale", ",", "£459k", "ONO"]

normalise(text, user_abbrevs=my_abbreviations)

Out:
['four bedroom',
 'house',
 'for',
 'sale',
 ',',
 'four hundred and fifty nine thousand pounds',
 'or near offer']

3. Example

A further example demonstrating the expansion of more types of NSW (including abbreviations, spelling mistakes, percentage ranges, currency):

text = ["On", "the", "13", "Feb.", "2007", ",", "Theresa", "May", "MP", "announced",
"on", "ITV", "News", "that", "the", "rate", "of", "childhod", "obesity", "had", "risen",
"from", "7.3-9.6%", "in", "just", "3", "years", ",", "costing", "the", "Gov.", "£20m", "."]

normalise(text, verbose=True)

Out:
['On',
 'the',
 'thirteenth of',
 'February',
 'two thousand and seven',
 'Theresa',
 'May',
 'M P',
 'announced',
 'on',
 'I T V',
 'News',
 'that',
 'the',
 'rate',
 'of',
 'childhood',
 'obesity',
 'had',
 'risen',
 'from',
 'seven point three to nine point six percent',
 'in',
 'just',
 'three',
 'years',
 ',',
 'costing',
 'the',
 'government',
 'twenty million pounds',
 '.']

4. Authors

Elliot Ford - EFord36
Emma Flint - emmaflint27

5. License

This project is licensed under the terms of the GNU General Public License version 3.0 or later.

Please see LICENSE.txt for more information.

6. Acknowledgements

This project builds on the work described in Sproat et al (2001).

We would like to thank Andrew Caines and Paula Buttery for supervising us during this project.

The font used for the logo was Anita Semi-Square by Gustavo Paz. License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Table of Contents

1. Installation

2. Usage

i. Customise to your variety

ii. Input your own abbreviation dictionary

3. Example

4. Authors

5. License

6. Acknowledgements

About

Releases 1

Packages

Contributors 2

Languages

License

EFord36/normalise

Folders and files

Latest commit

History

Repository files navigation

Introduction

Table of Contents

1. Installation

2. Usage

i. Customise to your variety

ii. Input your own abbreviation dictionary

3. Example

4. Authors

5. License

6. Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages