From 1d76c107fd1960ac69081bf86e52655e0d6402d3 Mon Sep 17 00:00:00 2001 From: EFord36 Date: Thu, 29 Sep 2016 16:21:29 +0100 Subject: [PATCH] Added README.rst For PyPI readme --- README.rst | 242 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 242 insertions(+) create mode 100644 README.rst diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..eef6bcd --- /dev/null +++ b/README.rst @@ -0,0 +1,242 @@ +A module for normalising text. + +Introduction +------------ + +This module takes a text as input, and returns it in a normalised form, +*ie.* expands all word tokens deemed not to be of a standard type. +Non-standard words (NSWs) are detected, classified and expanded. +Examples of NSWs that are normalised include: + +- Numbers - percentages, dates, currency amounts, ranges, telephone + numbers. +- Abbreviations and acronyms. +- Web addresses and hashtags. + +Table of Contents +----------------- + +#. `Installation <#installation>`__ +#. `Usage <#usage>`__ + i. `Customise to your variety` + ii. `Input your own abbreviation dictionary` + iii. `Execute normalise from the command line` +#. `Example <#example>`__ +#. `Authors <#authors>`__ +#. `License <#license>`__ +#. `Acknowledgements <#acknowledgements>`__ + +1. Installation +--------------- + +normalise requires Python 3. + +To install the module (on Windows, Mac OS X, Linux, etc.) and to ensure +that you have the latest version of pip and setuptools: + +:: + + $ pip install --upgrade pip setuptools + + $ pip install normalise + +If ``pip`` installation fails, you can try ``easy_install normalise``. + +2. Usage +-------- + +Your input text can be a list of words, or a string. + +To normalise your text, use the ``normalise`` function. This will return +the text with NSWs replaced by their expansions: + +.. code:: python + + text = ["On", "the", "28", "Apr.", "2010", ",", "Dr.", "Banks", "bought", "a", "chair", "for", "£35", "."] + + normalise(text, verbose=True) + + Out: + ['On', + 'the', + 'twenty-eighth of', + 'April', + 'twenty ten', + ',', + 'Doctor', + 'Banks', + 'bought', + 'a', + 'chair', + 'for', + 'thirty five pounds', + '.'] + +``verbose=True`` displays the stages of the normalisation process, so +you can monitor its progress. To turn this off, use ``verbose=False``. + +If your input is a string, you can use our basic tokenizer. For best +results, input your own custom tokenizer. + +.. code:: python + + normalise(text, tokenizer=tokenize_basic, verbose=True) + +In order to see a list of all NSWs in your text, along with their index, +tags, and expansion, use the ``list_NSWs`` function: + +.. code:: python + + list_NSWs(text) + + Out: + ({3: ('Apr.', 'ALPHA', 'EXPN', 'April'), + 6: ('Dr.', 'ALPHA', 'EXPN', 'Doctor')}, + {2: ('28', 'NUMB', 'NORD', 'twenty-eighth of'), + 4: ('2010', 'NUMB', 'NYER', 'twenty ten'), + 12: ('£35', 'NUMB', 'MONEY', 'thirty five pounds')} + + +i. Customise to your variety +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to customise normalisation to your variety of English, use +``variety="BrE"`` for British English, or ``variety="AmE"`` for American +English: + +.. code:: python + + text = ["On", "10/04", ",", "he", "went", "to", "the", "seaside", "."] + + normalise(text, variety="BrE") + Out: ['On', 'the tenth of April', ',', 'he', 'went', 'to', 'the', 'seaside', '.'] + + normalise(text, variety="AmE") + Out: ['On', 'the fourth of October', ',', 'he', 'went', 'to', 'the', 'seaside', '.'] + +If a variety is not specified, our default is British English. + +ii. Input your own abbreviation dictionary +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Although our system aims to be domain-general, users can input their own +dictionary of abbreviations in order to tailor to a specific domain. +This can be done using the keyword argument ``user_abbrevs={}``: + +.. code:: python + + my_abbreviations = {"bdrm": "bedroom", + "KT": "kitchen", + "wndw": "window", + "ONO": "or near offer"} + + text = ["4bdrm", "house", "for", "sale", ",", "£459k", "ONO"] + + normalise(text, user_abbrevs=my_abbreviations) + + Out: + ['four bedroom', + 'house', + 'for', + 'sale', + ',', + 'four hundred and fifty nine thousand pounds', + 'or near offer'] + ``` + +iii. Execute normalise from the command line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + From the command line, you can normalise text from a given .txt file. Use the command `normalise /path/to/your-file.txt`. This will print the normalised output, as well as save it to a separate file "your-file_normalised.txt" in the same directory as the original text. + + To specify the variety as American English, use `--AmE` (default is British English). For a verbose output, use `--V`: + +``$ normalise /path/to/your\_file.txt --AmE --V`` + + +3. Example +---------- + +A further example demonstrating the expansion of more types of NSW +(including abbreviations, spelling mistakes, percentage ranges, +currency): + +| \`\`\`python +| text = ["On", "the", "13", "Feb.", "2007", ",", "Theresa", "May", + "MP", "announced", +| "on", "ITV", "News", "that", "the", "rate", "of", "childhod", + "obesity", "had", "risen", +| "from", "7.3-9.6%", "in", "just", "3", "years", ",", "costing", "the", + "Gov.", "£20m", "."] + +normalise(text, verbose=True) + +| Out: +| ['On', +| 'the', +| 'thirteenth of', +| 'February', +| 'two thousand and seven', +| 'Theresa', +| 'May', +| 'M P', +| 'announced', +| 'on', +| 'I T V', +| 'News', +| 'that', +| 'the', +| 'rate', +| 'of', +| 'childhood', +| 'obesity', +| 'had', +| 'risen', +| 'from', +| 'seven point three to nine point six percent', +| 'in', +| 'just', +| 'three', +| 'years', +| ',', +| 'costing', +| 'the', +| 'government', +| 'twenty million pounds', +| '.'] +| \`\`\` + +4. Authors +---------- + +- **Elliot Ford** - `EFord36 `__ +- **Emma Flint** - `emmaflint27 `__ + +Our system is described in detail in Emma Flint, Elliot Ford, Olivia +Thomas, Andrew Caines & Paula Buttery (2016) - A Text Normalisation +System for Non-Standard Words. + +5. License +---------- + +This project is licensed under the terms of the GNU General Public +License version 3.0 or later. + +Please see +`LICENSE.txt `__ +for more information. + +6. Acknowledgements +------------------- + +This project builds on the work described in `Sproat et al +(2001) `__. + +We would like to thank Andrew Caines and Paula Buttery for supervising +us during this project. + +| The font used for the logo was Anita Semi-Square by Gustavo Paz. +| License: `Attribution-ShareAlike 4.0 International (CC BY-SA + 4.0) `__ + +.. |Title Logo| image:: logo.png