Required Python version >= 3.6
pipenv
is a utility that manages virtual environments and pip
dependencies at the same time. To install it, navigate to the project's root directory and run:
pip3 install pipenv
This will make sure that pipenv
uses your latest version of Python3, which is hopefully 3.6 or higher. Please refer to the official website for more information on pipenv
.
A Makefile has been created for convenience, so that you can install the project dependencies, download the required models, test and build the tool easily. Note that this is the preferred environment setup approach, the Pipfile
and Pipfile.lock
files ensure that you automatically have access to the installed packages in requirements.txt
after you do a make install
(see below).
Alternatively, if you are using the Anaconda distribution of Python, you can also use conda
to create an environment using the following command:
conda create -n nerds python=3.6 anaconda
You can then enter the newly created conda environment using the following command. After you run the various make ...
commands, the packages listed in requirements.txt
and the downloaded models will only be visible inside the nerds
environment. This approach is usually preferred since it can help prevent version collisions between different environments, at the cost of more disk space.
conda activate nerds
and exit the environment using the following command.
conda deactivate
To install all of the required packages for development and testing run:
make install
The tool will not run without an English language model and a tagger. To download spacy's English language model and NLTK's default tagger run:
make download_models
To execute the unit tests run:
make test
Code quality checks can be run with:
make lint
A wheel distribution of this tool can be created with:
make dist
NERDS is a framework that provides some NER capabilities - among which the option of creating ensembles of NER models - but primarily made to be extended. In the following sections we take a look at the basic data exchange classes, and how you can use them to create your own models.
The NERDS master project on elsevierlabs-os/nerds project uses a set of custom data exchange classes Document
, Annotation
, and AnnotatedDocument
. The project provided a set of conversion utilities which could be used to convert provided datasets to this format, and convert instances of these classes back to whatever format the underlying wrapped NER model needed. However, this NERDS fork on sujitpal/nerds eliminates this requirement -- the internal format is just a list of list of tokens (words in sentence) or BIO tags. The utility function nerds.utils.load_data_and_labels
can read a file in CoNLL BIO format and convert to this internal format. This decision was made because 3 of the 5 provided models consume the list of list format natively, and the result is fewer lines of extra code and less potential for error.
In general, when given an input format that is not in CoNLL BIO format, the main effort in using NERDS would be to convert it to CoNLL BIO format. Once that is done, it is relatively easy to ingest it into a data and label structure, as shown below.
from nerds.utils import load_data_and_labels
data, labels = load_data_and_labels("nerds/test/data/example.iob")
print("data:", data)
print("labels:", labels)
yields the following output.
data: [
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov', '.', '29', '.'],
['Mr', '.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N', '.', 'V', '.', ',', 'the', 'Dutch', 'publishing', 'group', '.']
]
labels [
['B-PER', 'I-PER', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'O'],
['B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'B-NORP', 'O', 'O', 'O']
]
The basic class that every model needs to extend is the NERModel
class in the nerds.models
package. The model class implements a fit - predict
API, similarly to sklearn
. To implement a new model, one must extend the following methods at minimum:
fit(X, y)
: Trains a model given a list of list of tokens X and BIO tags y.predict(X)
: Returns a list of list of BIO tags, given a list of list of tokens X.save(dirpath)
: Saves model to directory given by dirpath.load(dirpath)
: Retrieves model from directory given by dirpath.
As a best practice, I like to implement a single NER model (or group of related NER models) as a single file in the models
folder, but have it be accessible from client code directly as nerds.models.CustomNER
. You can set this redirection up in nerds/models/__init__.py
.
There are two examples of running experiments using NERDS. We will continue to update these examples as more functionality becomes available.
New models and input adapters are always welcome. Please make sure your code is well-documented and readable. Before creating a pull request make sure:
make test
shows that all the unit test pass.make lint
shows no Python code violations.
The CONTRIBUTING.md file lists contributors who have contributed to the NERDS (elsevierlabs-os/nerds) project.
The CHANGES.md file lists the changes and improvements that were made in this fork.
- [slides] Slides for talk at PyData LA 2019.
- [video] Video of talk at PyData LA 2019.
- [blog] Incorporating third party NER (Flair) into NERDS.