Version 1.1
Kyle Miller - [email protected]
Anthony Wertz - [email protected]
Carnegie Mellon University - Auton Lab
NOTE This readme is valid for the version of the software specified at the top of this readme, not necessarily for the current repository revision!
This is a service developed for the DARPA MEMEX program that performs temporal
anomaly detection on advertisments. The algorithm looks at ad volume over a specified
time range for a given target query compared with a baseline query, making the assumption
that the ratio of target to baseline ad volume will be consistent across time. Fisher's
exact test is used to test this assumption, providing a time series of p-values
indicating
when the assumption breaks down. This metric (or, more commonly, -log(p-value)
)
is used to detect anomalies. This algorithm is useful in identifying interesting changes
that might be indicative of increases or decreases in supply or demand over an event
of interest.
The service has two main components that can be built. First, there is the service itself which can run stand-alone. Second, a docker image can be built (and there is a ready-made image on Dockerhub) to contain the service and all required services.
There is nothing to build for the python service. From a bash
shell on Linux or macOS
the start_worker
and start_service
scripts can be used to start the worker and REST
client, or on any OS the commands can be run directly from the terminal (see the next
section). These scripts assume Celery is installed and a RabbitMQ server is running with
default guest:guest credentials.
To build the docker container, move to the samples/docker
directory and run
the ./build-docker
script. Alternatively, you can pull the already built image
from dockerhub with docker pull autonlab/tad
.
Note that whether the service is run via the python scripts directly or through the docker image an elastic search database is required. Instructions on configuring TAD to find your elastic search database will come later in this document, but if there is no elastic search database configured you must install and configure the service first. Instructions can be found on their website.
The python service will attempt to connect to Celery and RabbitMQ on localhost. In
addition, the script also needs a configuration in order to access elastic search.
For elastic search, you can create a configuration file: see config/tad.cfg.example.
The file should be named tad.cfg and should be available in the "config" directory
relative to the working directory. For other settings, there are no script inputs,
so if you want to change these settings you'll need to modify the tad.py
script.
The script requires some python packages: numpy
, fisher
, flask
, flask_restful
,
celery
, elasticsearch
are required, all of which can also be installed via pip
,
either individually or with the requirements.txt
file:
pip install -r requirements.txt
This will install Celery (which you can learn more about here)
but to use the service you'll also need RabbitMQ
. Instructions on installing and
configuring that service can be found on their website.
To run the service, a celery worker can be spawned from within the service
directory:
celery -A tad.celery worker -l info
and the TAD REST client can be started:
python tad.py
These commands are in the service/start_worker
and service/start_service
bash
scripts, respectively. You can run those if you're using a bash
terminal, e.g. on
macOS or Linux.
After started, either service can be killed using Ctrl+C
in the shell used to start
the service. The process can also be found and killed through a process monitor like
top
and htop
.
By either building the docker image or by pulling it from dockerhub, you can
load everything pre-configured by simply running the container. The easiest
way to do this is using the script samples/docker/run-docker
. In general
it should be provided with a log output directory and a configuration directory
like this:
./run-docker my/log/dir my/config/dir
After running this, the RESTful communication port will not be 5000 on the host machine (that is the local port in the docker image but not the port on the docker host). To find the docker port run something like
docker inspect autonlab-tad
or more specifically
docker inspect autonlab-tad | grep -n HostPort | awk '{print $3}' | sed 's/\"//g'
to grab the port number to use.
In samples/client
there is a simple curl client implemented which, with
little or no error checking, will send a request and continuously monitor
the results until killed.
The general approach to a query is
- Send a request. A status message will be returned indicating that the request has been queued. Once a worker process is available it will handle the request. The task ID will be returned. This is a POST request to server:port/event-report
- Periodically ping for a status update using the task ID provided in the reponse from step one. Either a status update will be returned or the results if finished. This is a GET request to server:port/event-report/task-id
- If the results of that task are required again at some point in time, just query for the progress of the same task ID.
This message must be sent in order to create an event report. An event report query is performed to compare ad volume over a specific period of time with past ad volume, comparing two different regions. In other words, the report will indicate whether ad volume of the target definition relative to the baseline definition in the past is significantly different than ad volume of the target definition to the baseline definition in the period of interest. This is done using the Fisher exact test for independence using the ad volumes in the reference and current time periods for the target and baseline definitions. The contingency table looks like this:
| reference | current
------------|:---------:|:------: target | # | # baseline | # | #
The body contains the following fields:
Field | Values (default) | Explanation |
---|---|---|
target-filters | assoc array | Target filter(s) (key-value pairs like {"city": "Buffalo", "state": "New York"}) |
baseline-filters | assoc array | Baseline filter(s) |
keylist | string array | Search keywords |
analysis-start-date | date string | Start date of analysis period in format %Y/%m/%d |
analysis-end-date | date string | End date of analysis period |
current-window | int > 0 (7) | Size (in days) of current window |
reference-window | int > 0 (91) | Size (in days) of reference window |
lag | int >= 0 (0) | Lag (in days) of reference window behind current window |
constant-baseline | bool | Whether the baseline should be constant valued or not. |
index | string | The elastic search index to target. |
time-field | string | The field representing the date and time. |
The target and baseline indicate the constraints to be used for defining
the target and baseline comparisons. The filters can include any fields from
the elastic search database in proper JSON format. They matches are exact.
Keylist (which can be empty) is a set of keywords you wish to search for.
A result will be returned so long as it includes at least one of the keywords.
Analysis start and end date represent the period of time your interested
in analysing and are represented using the strptime
format of %Y-%m-%d
,
so for example "2015-05-14".
The window sizes represent the windows to aggregate counts in for the
current time period and the reference period. For example, you may compare
the seven days leading up to a date of interest (current-window = 7) to the
previous three months (reference-window = 90). Use the lag parameter if you
want to create more separation between the windows. For example, maybe you
want to compare it to the same time last year. You could set the current
and reference windows to be the same size and the lag to be 365 - current-window
days long. The constant baseline, if selected, will force the service to ignore
the baseline filters and use a constant count value (one each day) to compare
against the target.
The index can be sent if something other than the default index specified in the configuration should be used. Similarly, time-field specifies the field that represents the date and (optionally) time to aggregate on. Time is less important as TAD currently only aggregates daily.
This message will allow querying for progress on the task. The body contains just one important field:
Field | Values | Explanation |
---|---|---|
task-id | string | ID number of the task |
The task ID is returned in the response when an event report is requested. The reponse to this message will include the status of the current task or, if finished processing, the results of the query.
When a query is run, a record is returned for each day within the analysis time period. Each record is a multi-type array containing the following fields (in order, accessed by index).
First is the date. Next is the reference period counts for the baseline, then for the target, followed by the current period counts for the baseline, then for the target. After that there are three p-values from the Fisher test: the left tail, two tail, and right tail p-values. A lower tail test will yield significant p-values when there appears to be an decrease in volume in the current time period of interest in the target relative to the baseline. An upper tailed test will yield significant p-values when the ad volume in the current time period in the target seems to increase. Two-tailed test will detect both of these changes but does not yield as significant p-values.
For configuring elastic search on your system, refer to online documentation. The
elastic search documents should contain the time field specified in the configuration
along with any other fields used to filter the documents for baseline and target
definitions. The configuration information needs to be in a file config/tad.cfg
relative to the working directory from which you start the service. There is an
example of a configuration file in the repository which just needs some basic
host information and credentials.
For configuring elastic search on the docker image, you just need to supply the
configuration directory which is the second parameter of the run_docker
script
which is mounted to config
.
The MIT License (MIT)
Copyright (c) 2015 Carnegie Mellon University Auton Lab
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.