CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.
- Parallel processing of files in a directory (CLI only)
- Sentence segmentation with UAX#29 rules
- NFKC and whitespace normalization
- Removal of modifiers and marks
- Lower-case folding
- Trimming of punctuation around words
- Replace words with
<unk>
placeholder if they meet any of the following criteria:- Word has an at sign
@
- Word lacks alphabetic characters
- Word has two punctuation chars in a row, such as
http://
- Word has an at sign
- HTML code is parsed and CSS selectors can be used to:
- Remove undesired elements
- Insert newlines after paragraphs and line breaks
- Extract the main content of an HTML document
- Text is automatically converted to UTF-8 if the original encoding is in the Encoding Standard.
# Install
$ cargo install corpus-preproc
# Run CLI help
$ corpus-preproc clean -h
Preprocess a file or directory
USAGE:
corpus-preproc clean [OPTIONS] <INPUT> <OUTPUT>
ARGS:
<INPUT>
<OUTPUT>
OPTIONS:
-c
Clean HTML tags
--content-selector <CONTENT_SELECTOR>
CSS selector for main content
--delete-selector <DELETE_SELECTOR>
CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]
-h, --help
Print help information
-l
Perform case-folding
-m
Keep modifiers and marks on normalization
-n
Perform NFKC and whitespace normalization
--nl-append-selector <NL_APPEND_SELECTOR>
CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]
-p
Trim punctuation surrounding words
-t <THREADS>
Number of threads to use [default: 4]
$ corpus-preproc serve 127.0.0.1:8000
The requests
Python library needs to be installed.
import requests
import json
DEFAULT_CONFIG = {
"htmlClean": {
"enabled": True,
"contentSelector": None,
"deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure",
"nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6",
},
"charNormalization": {
"enabled": True,
"keepModifiersAndMarks": False,
"lowercase": True,
},
"wordNormalization": {
"enabled": True,
"replacePii": True,
}
}
def clean_text(text):
files = {
'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional
'data': (None, text, 'text/plain'),
}
response = requests.post('http://127.0.0.1:3000/preproc', files=files)
return response.text
clean = clean_text("<b>HELLo, WORLD!!!").rstrip()
assert (clean == "hello world"), "OK"