Most of Presidio's services are written in Go. The presidio-analyzer
module, in charge of detecting entities in text, is written in Python. This document details the required parts for developing for Presidio.
-
Install go 1.11 and Python 3.7
-
Install the golang packages via dep
dep ensure
-
Install tesseract OCR framework. (Optional, only for Image anonymization)
-
Build and install re2 (Optional. Presidio will use
regex
instead ofpyre2
ifre2
is not installed)re2_version="2018-12-01" wget -O re2.tar.gz https://github.com/google/re2/archive/${re2_version}.tar.gz mkdir re2 tar --extract --file "re2.tar.gz" --directory "re2" --strip-components 1 cd re2 && make install
-
Install pipenv
Pipenv is a Python workflow manager, handling dependencies and environment for python packages, it is used in the Presidio's Analyzer project as the dependencies manager
pip3 install --user pipenv
brew install pipenv
Additional installation instructions: https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv
-
Create virtualenv for the project and install all requirements in the Pipfile, including dev requirements. In the
presidio-analyzer
folder, run:pipenv install --dev --sequential --skip-lock
-
Download spacy model
pipenv run python -m spacy download en_core_web_lg
-
Run all tests
pipenv run pytest
-
To run arbitrary scripts within the virtual env, start the command with
pipenv run
. For example:pipenv run flake8 analyzer --exclude "*pb2*.py"
pipenv run pylint analyzer
pipenv run pip freeze
-
Start shell:
pipenv shell
-
Run commands in the shell
pytest pylint analyzer pip freeze
- To use presidio-analyzer as a python library, see Installing presidio-analyzer as a standalone Python package
- To add new recognizers in order to support new entities, see Adding new custom recognizers
- Installing and building the entire Presidio solution is currently not supported on Windows. However, installing and building the different docker images, or the Python package for detecting entities (presidio-analyzer) is possible on Windows. See here
- Build the bins with
make build
- Build the base containers with
make docker-build-deps DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_DEPS_LABEL=${PRESIDIO_DEPS_LABEL}
(If you do not specify a valid, logged-in, registry a warning will echo to the standard output) - Build the the Docker image with
make docker-build DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_DEPS_LABEL=${PRESIDIO_DEPS_LABEL} PRESIDIO_LABEL=${PRESIDIO_LABEL}
- Push the Docker images with
make docker-push DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL}
- Run the tests with
make test
- Adding a file in go requires the
make go-format
command before running and building the service. - Run functional tests with
make test-functional
- Updating python dependencies instructions
- These steps are verified on every pull request validation to a presidio branch. do not alter this document without referring to the implemented steps in the pipeline
GRPC_PORT
:3001
GRPC listen port
GRPC_PORT
:3002
GRPC listen port
WEB_PORT
:8080
HTTP listen portREDIS_URL
:localhost:6379
, Optional: Redis addressANALYZER_SVC_ADDRESS
:localhost:3001
, Analyzer addressANONYMIZER_SVC_ADDRESS
:localhost:3002
, Anonymizer address
Developing presidio as a whole on Windows is currently not supported. However, it is possible to run and test the presidio-analyzer module, in charge of detecting entities in text, on Windows using Docker:
- Run locally the core services Presidio needs to operate:
docker run --rm --name test-redis --network testnetwork -d -p 6379:6379 redis
docker run --rm --name test-presidio-anonymizer --network testnetwork -d -p 3001:3001 -e GRPC_PORT=3001 mcr.microsoft.com/presidio-anonymizer:latest
docker run --rm --name test-presidio-recognizers-store --network testnetwork -d -p 3004:3004 -e GRPC_PORT=3004 -e REDIS_URL=test-redis:6379 mcr.microsoft.com/presidio-recognizers-store:latest
-
Navigate to
<Presidio folder>/presidio-analyzer
-
Install the python packages if didn't do so yet:
pipenv install --dev --sequential
- If you want to experiment with
analyze
requests, navigate into theanalyzer
folder and start serving the analyzer service:
pipenv run python app.py serve --grpc-port 3000
- In a new
pipenv shell
window you can runanalyze
requests, for example:
pipenv run python app.py analyze --text "John Smith drivers license is AC432223" --fields "PERSON" "US_DRIVER_LICENSE" --grpc-port 3000
-
Edit
post.lua
. Change the template name -
Run wrk
wrk -t2 -c2 -d30s -s post.lua http://<api-service-address>/api/v1/projects/<my-project>/analyze
-
If deploying from a private registry, verify that Kubernetes has access to the Docker Registry.
-
If using a Kubernetes secret to manage the registry authentication, make sure it is registered under 'presidio' namespace
Edit charts/presidio/values.yaml to:
- Setup secret name (for private registries)
- Change presidio services version
- Change default scale
-
The nlp engines deployed are set on start up based on the yaml configuration files in
presidio-analyzer/conf/
. The default nlp engine is the large English SpaCy model (en_core_web_lg
) set indefault.yaml
. -
The format of the yaml file is as follows:
nlp_engine_name: spacy # {spacy, stanza}
models:
-
lang_code: en # code corresponds to `supported_language` in any custom recognizers
model_name: en_core_web_lg # the name of the SpaCy or Stanza model
-
lang_code: de # more than one model is optional, just add more items
model_name: de
-
By default, we call the method
load_predefined_recognizers
of theRecognizerRegistry
class to load language specific and language agnostic recognizers. -
Downloading additional engines.
- SpaCy NLP Models: models download page
- Stanza NLP Models: models download page
# download models - tldr
# spacy
python -m spacy download en_core_web_lg
# stanza
python -c 'import stanza; stanza.download("en");'