This repository builds a Python package that installs a pii-extract-base plugin to perform PII detection for text data using the Microsoft Presidio Python library.
The name of the plugin entry point is piisa-detectors-presidio
The package neads
- at least Python 3.8
- the pii-data and the pii-extract-base base packages
- the presidio-analyzer package
- an NLP engine model for the desired language
- Install the package:
pip install pii-extract-plg-presidio
(it will automatically install its dependencies, includingpresidio-analyzer
) - Download the recognition model for the desired language(s), as instructed by
the presidio-analyzer installation instructions. The default plugin
configuration file defines three spaCy models:
- English model:
python -m spacy download en_core_web_lg
- Spanish model:
python -m spacy download es_core_news_md
- Italian model:
python -m spacy download it_core_news_md
- English model:
- For additional information on model specification, see customizing NLP
models in the Presidio documentation. If custom models are used, the
nlp_config
element in the plugin configuration file must be adjusted accordingly.
The package does not have any user-facing entry points (except for one console information script, see below). Instead, upon installation it defines a plugin entry point. This plugin is automatically picked up by the scripts and classes in pii-extract-base, and thus its functionality is exposed to them.
Runtime behaviour is governed by a configuration file, which sets up which recognizers from Presidio will be instantiated and used (note that the configuration defines which languages are available for detection, but the plugin can also be initialized with a subset of those languages).
The task created from the plugin is a standard PII task object, using the
pii_extract.build.task.MultiPiiTask
class definition. It will be called,
as all PII task objects, with a DocumentChunk
object containing the data to
analyze. The chunk must contain language specification in its metadata, so
that Presidio knows which language to use (unless the plugin task has been
built with only one language; in that case if the chunk does not contain
a language specification, it will use that single language).
pii-extract-presidio-info
is a command-line script which provides
information about the plugin capabilities:
version
: installed package versionspresidio-recognizers
: list of recognizers in Presidiopresidio-entities
: the total list of entities Presidio can generatepii-entities
: the PIISA tasks that this plugin will create, by translating from the entities detected by Presidio (this depends on the PIISA config used)
The provided Makefile can be used to process the package:
make pkg
will build the Python package, creating a file that can be installed withpip
make unit
will launch all unit tests (using pytest, so pytest must be available)make install
will install the package in a Python virtualenv. The virtualenv will be chosen as, in this order:- the one defined in the
VENV
environment variable, if it is defined - if there is a virtualenv activated in the shell, it will be used
- otherwise, a default is chosen as
/opt/venv/pii
(it will be created if it does not exist)
- the one defined in the