Skip to content

Commit

Permalink
Added README.rst
Browse files Browse the repository at this point in the history
For PyPI readme
  • Loading branch information
EFord36 committed Sep 29, 2016
1 parent b08a5a0 commit 1d76c10
Showing 1 changed file with 242 additions and 0 deletions.
242 changes: 242 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
A module for normalising text.

Introduction
------------

This module takes a text as input, and returns it in a normalised form,
*ie.* expands all word tokens deemed not to be of a standard type.
Non-standard words (NSWs) are detected, classified and expanded.
Examples of NSWs that are normalised include:

- Numbers - percentages, dates, currency amounts, ranges, telephone
numbers.
- Abbreviations and acronyms.
- Web addresses and hashtags.

Table of Contents
-----------------

#. `Installation <#installation>`__
#. `Usage <#usage>`__
i. `Customise to your variety`
ii. `Input your own abbreviation dictionary`
iii. `Execute normalise from the command line`
#. `Example <#example>`__
#. `Authors <#authors>`__
#. `License <#license>`__
#. `Acknowledgements <#acknowledgements>`__

1. Installation
---------------

normalise requires Python 3.

To install the module (on Windows, Mac OS X, Linux, etc.) and to ensure
that you have the latest version of pip and setuptools:

::

$ pip install --upgrade pip setuptools

$ pip install normalise

If ``pip`` installation fails, you can try ``easy_install normalise``.

2. Usage
--------

Your input text can be a list of words, or a string.

To normalise your text, use the ``normalise`` function. This will return
the text with NSWs replaced by their expansions:

.. code:: python
text = ["On", "the", "28", "Apr.", "2010", ",", "Dr.", "Banks", "bought", "a", "chair", "for", "£35", "."]
normalise(text, verbose=True)
Out:
['On',
'the',
'twenty-eighth of',
'April',
'twenty ten',
',',
'Doctor',
'Banks',
'bought',
'a',
'chair',
'for',
'thirty five pounds',
'.']
``verbose=True`` displays the stages of the normalisation process, so
you can monitor its progress. To turn this off, use ``verbose=False``.

If your input is a string, you can use our basic tokenizer. For best
results, input your own custom tokenizer.

.. code:: python
normalise(text, tokenizer=tokenize_basic, verbose=True)
In order to see a list of all NSWs in your text, along with their index,
tags, and expansion, use the ``list_NSWs`` function:

.. code:: python
list_NSWs(text)
Out:
({3: ('Apr.', 'ALPHA', 'EXPN', 'April'),
6: ('Dr.', 'ALPHA', 'EXPN', 'Doctor')},
{2: ('28', 'NUMB', 'NORD', 'twenty-eighth of'),
4: ('2010', 'NUMB', 'NYER', 'twenty ten'),
12: ('£35', 'NUMB', 'MONEY', 'thirty five pounds')}
i. Customise to your variety
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to customise normalisation to your variety of English, use
``variety="BrE"`` for British English, or ``variety="AmE"`` for American
English:
.. code:: python
text = ["On", "10/04", ",", "he", "went", "to", "the", "seaside", "."]
normalise(text, variety="BrE")
Out: ['On', 'the tenth of April', ',', 'he', 'went', 'to', 'the', 'seaside', '.']
normalise(text, variety="AmE")
Out: ['On', 'the fourth of October', ',', 'he', 'went', 'to', 'the', 'seaside', '.']
If a variety is not specified, our default is British English.
ii. Input your own abbreviation dictionary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Although our system aims to be domain-general, users can input their own
dictionary of abbreviations in order to tailor to a specific domain.
This can be done using the keyword argument ``user_abbrevs={}``:
.. code:: python
my_abbreviations = {"bdrm": "bedroom",
"KT": "kitchen",
"wndw": "window",
"ONO": "or near offer"}
text = ["4bdrm", "house", "for", "sale", ",", "£459k", "ONO"]
normalise(text, user_abbrevs=my_abbreviations)
Out:
['four bedroom',
'house',
'for',
'sale',
',',
'four hundred and fifty nine thousand pounds',
'or near offer']
```
iii. Execute normalise from the command line
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From the command line, you can normalise text from a given .txt file. Use the command `normalise /path/to/your-file.txt`. This will print the normalised output, as well as save it to a separate file "your-file_normalised.txt" in the same directory as the original text.
To specify the variety as American English, use `--AmE` (default is British English). For a verbose output, use `--V`:
``$ normalise /path/to/your\_file.txt --AmE --V``
3. Example
----------
A further example demonstrating the expansion of more types of NSW
(including abbreviations, spelling mistakes, percentage ranges,
currency):
| \`\`\`python
| text = ["On", "the", "13", "Feb.", "2007", ",", "Theresa", "May",
"MP", "announced",
| "on", "ITV", "News", "that", "the", "rate", "of", "childhod",
"obesity", "had", "risen",
| "from", "7.3-9.6%", "in", "just", "3", "years", ",", "costing", "the",
"Gov.", "£20m", "."]
normalise(text, verbose=True)
| Out:
| ['On',
| 'the',
| 'thirteenth of',
| 'February',
| 'two thousand and seven',
| 'Theresa',
| 'May',
| 'M P',
| 'announced',
| 'on',
| 'I T V',
| 'News',
| 'that',
| 'the',
| 'rate',
| 'of',
| 'childhood',
| 'obesity',
| 'had',
| 'risen',
| 'from',
| 'seven point three to nine point six percent',
| 'in',
| 'just',
| 'three',
| 'years',
| ',',
| 'costing',
| 'the',
| 'government',
| 'twenty million pounds',
| '.']
| \`\`\`
4. Authors
----------
- **Elliot Ford** - `EFord36 <https://github.com/EFord36>`__
- **Emma Flint** - `emmaflint27 <https://github.com/emmaflint27>`__
Our system is described in detail in Emma Flint, Elliot Ford, Olivia
Thomas, Andrew Caines & Paula Buttery (2016) - A Text Normalisation
System for Non-Standard Words.
5. License
----------
This project is licensed under the terms of the GNU General Public
License version 3.0 or later.
Please see
`LICENSE.txt <https://github.com/EFord36/normalise/blob/master/LICENSE.txt>`__
for more information.
6. Acknowledgements
-------------------
This project builds on the work described in `Sproat et al
(2001) <http://www.cs.toronto.edu/~gpenn/csc2518/sproatetal01.pdf>`__.
We would like to thank Andrew Caines and Paula Buttery for supervising
us during this project.
| The font used for the logo was Anita Semi-Square by Gustavo Paz.
| License: `Attribution-ShareAlike 4.0 International (CC BY-SA
4.0) <http://creativecommons.org/licenses/by-sa/4.0/deed.en_US>`__
.. |Title Logo| image:: logo.png

0 comments on commit 1d76c10

Please sign in to comment.