Skip to content

pii-extract plugin for PII Detection wrapping the Presidio library

License

Notifications You must be signed in to change notification settings

piisa/pii-extract-plg-presidio

Repository files navigation

Pii Extractor plugin: Presidio

version changelog license build status

This repository builds a Python package that installs a pii-extract-base plugin to perform PII detection for text data using the Microsoft Presidio Python library.

The name of the plugin entry point is piisa-detectors-presidio

Requirements

The package neads

Installation

  • Install the package: pip install pii-extract-plg-presidio (it will automatically install its dependencies, including presidio-analyzer)
  • Download the recognition model for the desired language(s), as instructed by the presidio-analyzer installation instructions. The default plugin configuration file defines three spaCy models:
    • English model: python -m spacy download en_core_web_lg
    • Spanish model: python -m spacy download es_core_news_md
    • Italian model: python -m spacy download it_core_news_md
  • For additional information on model specification, see customizing NLP models in the Presidio documentation. If custom models are used, the nlp_config element in the plugin configuration file must be adjusted accordingly.

Usage

The package does not have any user-facing entry points (except for one console information script, see below). Instead, upon installation it defines a plugin entry point. This plugin is automatically picked up by the scripts and classes in pii-extract-base, and thus its functionality is exposed to them.

Runtime behaviour is governed by a configuration file, which sets up which recognizers from Presidio will be instantiated and used (note that the configuration defines which languages are available for detection, but the plugin can also be initialized with a subset of those languages).

The task created from the plugin is a standard PII task object, using the pii_extract.build.task.MultiPiiTask class definition. It will be called, as all PII task objects, with a DocumentChunk object containing the data to analyze. The chunk must contain language specification in its metadata, so that Presidio knows which language to use (unless the plugin task has been built with only one language; in that case if the chunk does not contain a language specification, it will use that single language).

info script

pii-extract-presidio-info is a command-line script which provides information about the plugin capabilities:

  • version: installed package versions
  • presidio-recognizers: list of recognizers in Presidio
  • presidio-entities: the total list of entities Presidio can generate
  • pii-entities: the PIISA tasks that this plugin will create, by translating from the entities detected by Presidio (this depends on the PIISA config used)

Building

The provided Makefile can be used to process the package:

  • make pkg will build the Python package, creating a file that can be installed with pip
  • make unit will launch all unit tests (using pytest, so pytest must be available)
  • make install will install the package in a Python virtualenv. The virtualenv will be chosen as, in this order:
    • the one defined in the VENV environment variable, if it is defined
    • if there is a virtualenv activated in the shell, it will be used
    • otherwise, a default is chosen as /opt/venv/pii (it will be created if it does not exist)

About

pii-extract plugin for PII Detection wrapping the Presidio library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published