-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
For PyPI readme
- Loading branch information
EFord36
committed
Sep 29, 2016
1 parent
b08a5a0
commit 1d76c10
Showing
1 changed file
with
242 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,242 @@ | ||
A module for normalising text. | ||
|
||
Introduction | ||
------------ | ||
|
||
This module takes a text as input, and returns it in a normalised form, | ||
*ie.* expands all word tokens deemed not to be of a standard type. | ||
Non-standard words (NSWs) are detected, classified and expanded. | ||
Examples of NSWs that are normalised include: | ||
|
||
- Numbers - percentages, dates, currency amounts, ranges, telephone | ||
numbers. | ||
- Abbreviations and acronyms. | ||
- Web addresses and hashtags. | ||
|
||
Table of Contents | ||
----------------- | ||
|
||
#. `Installation <#installation>`__ | ||
#. `Usage <#usage>`__ | ||
i. `Customise to your variety` | ||
ii. `Input your own abbreviation dictionary` | ||
iii. `Execute normalise from the command line` | ||
#. `Example <#example>`__ | ||
#. `Authors <#authors>`__ | ||
#. `License <#license>`__ | ||
#. `Acknowledgements <#acknowledgements>`__ | ||
|
||
1. Installation | ||
--------------- | ||
|
||
normalise requires Python 3. | ||
|
||
To install the module (on Windows, Mac OS X, Linux, etc.) and to ensure | ||
that you have the latest version of pip and setuptools: | ||
|
||
:: | ||
|
||
$ pip install --upgrade pip setuptools | ||
|
||
$ pip install normalise | ||
|
||
If ``pip`` installation fails, you can try ``easy_install normalise``. | ||
|
||
2. Usage | ||
-------- | ||
|
||
Your input text can be a list of words, or a string. | ||
|
||
To normalise your text, use the ``normalise`` function. This will return | ||
the text with NSWs replaced by their expansions: | ||
|
||
.. code:: python | ||
text = ["On", "the", "28", "Apr.", "2010", ",", "Dr.", "Banks", "bought", "a", "chair", "for", "£35", "."] | ||
normalise(text, verbose=True) | ||
Out: | ||
['On', | ||
'the', | ||
'twenty-eighth of', | ||
'April', | ||
'twenty ten', | ||
',', | ||
'Doctor', | ||
'Banks', | ||
'bought', | ||
'a', | ||
'chair', | ||
'for', | ||
'thirty five pounds', | ||
'.'] | ||
``verbose=True`` displays the stages of the normalisation process, so | ||
you can monitor its progress. To turn this off, use ``verbose=False``. | ||
|
||
If your input is a string, you can use our basic tokenizer. For best | ||
results, input your own custom tokenizer. | ||
|
||
.. code:: python | ||
normalise(text, tokenizer=tokenize_basic, verbose=True) | ||
In order to see a list of all NSWs in your text, along with their index, | ||
tags, and expansion, use the ``list_NSWs`` function: | ||
|
||
.. code:: python | ||
list_NSWs(text) | ||
Out: | ||
({3: ('Apr.', 'ALPHA', 'EXPN', 'April'), | ||
6: ('Dr.', 'ALPHA', 'EXPN', 'Doctor')}, | ||
{2: ('28', 'NUMB', 'NORD', 'twenty-eighth of'), | ||
4: ('2010', 'NUMB', 'NYER', 'twenty ten'), | ||
12: ('£35', 'NUMB', 'MONEY', 'thirty five pounds')} | ||
i. Customise to your variety | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
In order to customise normalisation to your variety of English, use | ||
``variety="BrE"`` for British English, or ``variety="AmE"`` for American | ||
English: | ||
.. code:: python | ||
text = ["On", "10/04", ",", "he", "went", "to", "the", "seaside", "."] | ||
normalise(text, variety="BrE") | ||
Out: ['On', 'the tenth of April', ',', 'he', 'went', 'to', 'the', 'seaside', '.'] | ||
normalise(text, variety="AmE") | ||
Out: ['On', 'the fourth of October', ',', 'he', 'went', 'to', 'the', 'seaside', '.'] | ||
If a variety is not specified, our default is British English. | ||
ii. Input your own abbreviation dictionary | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
Although our system aims to be domain-general, users can input their own | ||
dictionary of abbreviations in order to tailor to a specific domain. | ||
This can be done using the keyword argument ``user_abbrevs={}``: | ||
.. code:: python | ||
my_abbreviations = {"bdrm": "bedroom", | ||
"KT": "kitchen", | ||
"wndw": "window", | ||
"ONO": "or near offer"} | ||
text = ["4bdrm", "house", "for", "sale", ",", "£459k", "ONO"] | ||
normalise(text, user_abbrevs=my_abbreviations) | ||
Out: | ||
['four bedroom', | ||
'house', | ||
'for', | ||
'sale', | ||
',', | ||
'four hundred and fifty nine thousand pounds', | ||
'or near offer'] | ||
``` | ||
iii. Execute normalise from the command line | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
From the command line, you can normalise text from a given .txt file. Use the command `normalise /path/to/your-file.txt`. This will print the normalised output, as well as save it to a separate file "your-file_normalised.txt" in the same directory as the original text. | ||
To specify the variety as American English, use `--AmE` (default is British English). For a verbose output, use `--V`: | ||
``$ normalise /path/to/your\_file.txt --AmE --V`` | ||
3. Example | ||
---------- | ||
A further example demonstrating the expansion of more types of NSW | ||
(including abbreviations, spelling mistakes, percentage ranges, | ||
currency): | ||
| \`\`\`python | ||
| text = ["On", "the", "13", "Feb.", "2007", ",", "Theresa", "May", | ||
"MP", "announced", | ||
| "on", "ITV", "News", "that", "the", "rate", "of", "childhod", | ||
"obesity", "had", "risen", | ||
| "from", "7.3-9.6%", "in", "just", "3", "years", ",", "costing", "the", | ||
"Gov.", "£20m", "."] | ||
normalise(text, verbose=True) | ||
| Out: | ||
| ['On', | ||
| 'the', | ||
| 'thirteenth of', | ||
| 'February', | ||
| 'two thousand and seven', | ||
| 'Theresa', | ||
| 'May', | ||
| 'M P', | ||
| 'announced', | ||
| 'on', | ||
| 'I T V', | ||
| 'News', | ||
| 'that', | ||
| 'the', | ||
| 'rate', | ||
| 'of', | ||
| 'childhood', | ||
| 'obesity', | ||
| 'had', | ||
| 'risen', | ||
| 'from', | ||
| 'seven point three to nine point six percent', | ||
| 'in', | ||
| 'just', | ||
| 'three', | ||
| 'years', | ||
| ',', | ||
| 'costing', | ||
| 'the', | ||
| 'government', | ||
| 'twenty million pounds', | ||
| '.'] | ||
| \`\`\` | ||
4. Authors | ||
---------- | ||
- **Elliot Ford** - `EFord36 <https://github.com/EFord36>`__ | ||
- **Emma Flint** - `emmaflint27 <https://github.com/emmaflint27>`__ | ||
Our system is described in detail in Emma Flint, Elliot Ford, Olivia | ||
Thomas, Andrew Caines & Paula Buttery (2016) - A Text Normalisation | ||
System for Non-Standard Words. | ||
5. License | ||
---------- | ||
This project is licensed under the terms of the GNU General Public | ||
License version 3.0 or later. | ||
Please see | ||
`LICENSE.txt <https://github.com/EFord36/normalise/blob/master/LICENSE.txt>`__ | ||
for more information. | ||
6. Acknowledgements | ||
------------------- | ||
This project builds on the work described in `Sproat et al | ||
(2001) <http://www.cs.toronto.edu/~gpenn/csc2518/sproatetal01.pdf>`__. | ||
We would like to thank Andrew Caines and Paula Buttery for supervising | ||
us during this project. | ||
| The font used for the logo was Anita Semi-Square by Gustavo Paz. | ||
| License: `Attribution-ShareAlike 4.0 International (CC BY-SA | ||
4.0) <http://creativecommons.org/licenses/by-sa/4.0/deed.en_US>`__ | ||
.. |Title Logo| image:: logo.png |