Name	Name	Last commit message	Last commit date
parent directory ..
Dockerfile	Dockerfile
README.md	README.md
docker-compose.yml	docker-compose.yml
logstash_fingerprint.conf	logstash_fingerprint.conf
logstash_script_fingerprint.conf	logstash_script_fingerprint.conf
pipelines.yml	pipelines.yml
pseudonymise.rb	pseudonymise.rb
sample_docs	sample_docs

GDPR Pseudonymization Examples

This example provides a supporting example to the blog post Pseudonymization and the Elastic Stack - Part 1, focusing on the pseudonymization of data using the Logstash Fingerprint filter and Ruby Filter.

In order to provide a convenient means to execute the example we utilise a dockerised approach. This example includes two approaches to pseudonymizing field data:

The first approach demonstrates pseudonymization using purely logstash configuration and the existing Logstash Fingerprint filter.
The second approach, as discussed in the blog, aims to simplify the configuration using a generic file Ruby based script.

These two approaches are deployed as separate Logstash pipelines, both accepting data through a tcp input on different ports. For both examples we pseudonymize the fields username and ip, although this could be reconfigured by the user.

Versions

This example has been tested in following versions:

Docker 17.12.0
Docker Compose 1.18.0
netcat
curl

Important Notes

The second example utilises a file based script. This script is vulnerable to a bug identified in the filter, due for resolution in 6.3. We therefore distribute a custom Dockerfile for Logstash which extends the 6.2.2 distribution with a fix for the filter. This will be removed on release of 6.3.
The second scripted approach in no way represents a final solution - the script is missing tests, doesn’t handle complex data structures and only supports SHA256! It does, however, highlight one possible approach and provides a starting point for those needing to pseudonymize their data. The use of a script here allows the user to potentially explore other hashing algorithms not exposed in the fingerprint filter, or even do many hash rotations to add additional complexity and resilience to attack.
The compose file is no way a production ready Elasticsearch configuration. Memory and cluster configuration settings are minimal and are applicable to an example only. See here for further details.
The approaches represent examples only. All data is indexed using the elastic user - this does not represent realistic access controls in a production environment.
The hash key is currently passed as a environment variable. In a production environment we would recommend storing this value in the Logstash secret store. This could be rotated without restarting Logstash by adding a new key and modifying configurations - assuming automatic reload is enabled.

Download Example Files

Download the following files in this repo to a local directory:

docker-compose.yml - compose file to run the installation
logstash_fingerprint.conf - logstash configuration for pseudonymization of fields using Fingerprint filter
logstash_script_fingerprint.conf - logstash configuration for pseudonymization of fields using a file based Ruby script
pipelines.yml - logstash configuration to define a separate pipeline for each of the examples
pseudonymise.rb - ruby script used by the above Logstash pipeline
Dockerfile - docker file for Logstash installation - see "Important Notes". Required for 6.2.2 only.
sample_docs - 200 sample docs with which to test the pipelines. The field values for the username and ip` fields are unique.

Unfortunately, Github does not provide a convenient one-click option to download entire contents of a subfolder in a repo.

You can either (a) download or clone the entire examples repo and navigate to Miscellaneious/gdpr/pseudonymization subfolder, or (b) individually download the above files. The code below makes option (b) a little easier:

curl -O https://raw.githubusercontent.com/elastic/examples/master/Miscellaneous/gdpr/pseudonymization/docker-compose.yml
curl -O https://raw.githubusercontent.com/elastic/examples/master/Miscellaneous/gdpr/pseudonymization/logstash_fingerprint.conf
curl -O https://raw.githubusercontent.com/elastic/examples/master/Miscellaneous/gdpr/pseudonymization/logstash_script_fingerprint.conf
curl -O https://raw.githubusercontent.com/elastic/examples/master/Miscellaneous/gdpr/pseudonymization/pipelines.yml
curl -O https://raw.githubusercontent.com/elastic/examples/master/Miscellaneous/gdpr/pseudonymization/pseudonymise.rb
curl -O https://raw.githubusercontent.com/elastic/examples/master/Miscellaneous/gdpr/pseudonymization/Dockerfile
curl -O https://raw.githubusercontent.com/elastic/examples/master/Miscellaneous/gdpr/pseudonymization/sample_docs

Installation and Setup

The included compose file starts both Logstash and Elasticsearch. The former is started with 2 Logstash pipelines: one for each of the approaches. Each pipeline uses a tcp input, using distinct ports, to accept data - 5000 for the Fingerprint filter approach, 6000 for the Ruby script approach.

Ensure the directory containing the downloaded files is shared with docker - for OSX and Windows see here and here respectively.
Navigate to the directory containing the downloaded files and execute the following command:

ELASTIC_PASSWORD=changeme TAG=6.2.2 docker-compose up.

Feel free to change the value of ES_PASSWORD.
The following log line should indicate when Logstash has started and is ready to accept data

logstash_1 | [2018-03-20T12:40:33,638][INFO ][logstash.agent ] Pipelines running {:count=>2, :pipelines=>["fingerprint_filter", "ruby_filter"]}

Fingerprint Filter Approach

To utilise the first approach, based on the FingerPrint Filter execute the following command:

cat sample_docs | nc localhost 5000

This should also take a few seconds to execute and index the 100 documents from the sample file.

Ruby Script File Approach

To utilise the second approach, based on the Ruby File Script execute the following command:

cat sample_docs | nc localhost 6000

This should also take a few seconds to execute and index the 100 documents from the sample file.

Inspecting and using the data

Pseudonymized Documents will be indexed to an events index. These can be accessed through the following query:

curl "http://localhost:9200/events/_search?pretty" -u elastic:changeme

Example pseudonymized document below:

    {
        "_index": "events",
        "_type": "doc",
        "_id": "tQOjQ2IBED8Jv9YVVDxs",
        "_score": 1,
        "_source": {
          "host": "gateway",
          "user_agent": "Mozilla/5.0 (Macintosh; PPC Mac OS X 10_6_7) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.790.0 Safari/535.1",
          "job_title": "Electrical Engineer",
          "username": "95b88d8d477e18a8acca833e7bcbd2c5d5f646b29e2d1c9604a1d930e2f63313",
          "@timestamp": "2018-03-20T13:39:59.799Z",
          "ip": "e85022a9801b356dd8c3ed6b2e02f0061a3aeea5bbad15a9ff4aed35b5bb3a42",
          "source": "ruby_pipeline",
          "city": "Komsomol’skiy",
          "title": "Mr",
          "country_code": "UZ",
          "@version": "1",
          "gender": "Female",
          "country": "Uzbekistan",
          "port": 41126
        }
     }

"Identity documents" (effectively key-value pair lookups), are indexed into a identities index. These can be accessed through the following query:

curl "http://localhost:9200/identities/_search?pretty" -u elastic:changeme

  {
    "_index": "identities",
    "_type": "doc",
    "_id": "1924d02bd98a46c795cb2a925b98a22ae59c563e0de49f4ba4aa49e6cab072ad",
    "_score": 1,
    "_source": {
      "key": "1924d02bd98a46c795cb2a925b98a22ae59c563e0de49f4ba4aa49e6cab072ad",
      "value": "174.145.248.21",
      "tags": [
        "identities"
      ],
      "@timestamp": "2018-03-20T13:39:59.957Z",
      "@version": "1",
      "source": "ruby_pipeline"
    }
  }

Data Notes

The data produced by both examples is identical - with the exception of the source field which indicates the pipeline used. If running both examples once, you will end up with a duplicate of each document in the events index - total 200, and 100 thereafter for each execution.
The pseudonyms index should always contain 200 documents no matter how many times you index the data - a document for each unique field value of username and ip.
In order to lookup a pseudonymized value, the user can simply do a lookup by id on the identities index. For example, if needing the original value for 6efda88d5338599ef1cc29df5dad8da681984580dc1f7f495dcf17ebcf7191f8 simply execute:

    curl "http://localhost:9200/identities/doc/6efda88d5338599ef1cc29df5dad8da681984580dc1f7f495dcf17ebcf7191f8?pretty" -u elastic:changeme

Shutdown

Use ctl+c to exit the compose terminal. The following command will remove all containers and associated data:

ELASTIC_PASSWORD=changeme TAG=6.2.2 docker-compose down -v

Extending and Modifying

The user may wish to modify the environment. Common requirements and approaches:

Modify the Hash key - This is passed through the environment variable FINGERPRINT_KEY to the Logstash instance and can be modified as required. In a production environment we would recommend storing this value in the Logstash secret store.
Modify the Logstash pipelines e.g. to pseudonymize other fields. Modify the respective .conf file as described in the blog post.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pseudonymization

pseudonymization

README.md

GDPR Pseudonymization Examples

Versions

Important Notes

Download Example Files

Installation and Setup

Fingerprint Filter Approach

Ruby Script File Approach

Inspecting and using the data

Data Notes

Shutdown

Extending and Modifying

Files

pseudonymization

Directory actions

More options

Directory actions

More options

Latest commit

History

pseudonymization

Folders and files

parent directory

README.md

GDPR Pseudonymization Examples

Versions

Important Notes

Download Example Files

Installation and Setup

Fingerprint Filter Approach

Ruby Script File Approach

Inspecting and using the data

Data Notes

Shutdown

Extending and Modifying