Skip to content

Commit

Permalink
Uses cases working: window, mr, alerting, and join/groupby both
Browse files Browse the repository at this point in the history
streaming and batching
  • Loading branch information
nathanielc committed Oct 2, 2015
1 parent 842edf8 commit d5b80ef
Show file tree
Hide file tree
Showing 92 changed files with 14,351 additions and 290 deletions.
61 changes: 35 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,39 +25,43 @@ There are two different ways to consume Kapacitor.
* Select data from an existing InfluxDB host and save it:

```sh
$ kapacitor record stream --host address_of_influxdb --query 'select value from cpu_idle where time > start and time < stop'
RecordingID=2869246
$ kapacitor record query -addr http://address_of_influxdb -query 'select value from cpu_idle where time > start and time < stop'
b6d1de3f-b27f-4420-96ee-b0365d859d1c
```
* Or record the live stream for a bit:

```sh
$ kapacitor start-recording
$ sleep 60
$ kapacitor stop-recording
RecordingID=2869246
$ kapacitor record stream -duration 60s
b6d1de3f-b27f-4420-96ee-b0365d859d1c
```

4. Define a Kapacitor `streamer`. A `streamer` is an entity that defines what data should be processed and how.

```sh
$ kapacitor define streamer \
--name alert_cpu_idle_any_host \
--script path/to/dsl/script
$ kapacitor define \
-type streamer \
-name alert_cpu_idle_any_host \
-tick path/to/tick/script
```

5. Replay the recording to test the `streamer`.

```sh
$ kapacitor replay 2869246 alert_cpu_idle_any_host
$ kapacitor replay \
b6d1de3f-b27f-4420-96ee-b0365d859d1c \
alert_cpu_idle_any_host
```

6. Edit the `streamer` and test until its working

```sh
$ kapacitor define streamer \
--name alert_cpu_idle_any_host \
--script path/to/dsl/script
$ kapacitor replay 2869246 alert_cpu_idle_any_host
$ kapacitor define \
-type streamer \
-name alert_cpu_idle_any_host \
-tick path/to/tick/script
$ kapacitor replay \
b6d1de3f-b27f-4420-96ee-b0365d859d1c \
alert_cpu_idle_any_host
```

7. Enable or push the `streamer` once you are satisfied that it is working
Expand All @@ -66,7 +70,7 @@ There are two different ways to consume Kapacitor.
$ # enable the streamer locally
$ kapacitor enable alert_cpu_idle_any_host
$ # or push the tested streamer to a prod server
$ kapacitor push --remote address_to_remote_kapacitor alert_cpu_idle_any_host
$ kapacitor push -remote http://address_to_remote_kapacitor alert_cpu_idle_any_host
```

# Batch workflow
Expand All @@ -80,39 +84,45 @@ There are two different ways to consume Kapacitor.
1. Define a `batcher`. Like a `streamer` a `batcher` defines what data to process and how, only it operates on batches of data instead of streams.

```sh
$ kapacitor define batcher \
--name alert_mean_cpu_idle_logs_by_dc \
--script path/to/dsl/script
$ kapacitor define \
-type batcher \
-name alert_mean_cpu_idle_logs_by_dc \
-tick path/to/tick/script
```
2. Save a batch of data for replaying using the definition in the `batcher`.

```sh
$ kapacitor record batch alert_mean_cpu_idle_logs_by_dc
RecordingID=2869246
b6d1de3f-b27f-4420-96ee-b0365d859d1c
```

3. Replay the batch of data to the `batcher`.

```sh
$ kapacitor replay 2869246 alert_mean_cpu_idle_logs_by_dc
$ kapacitor replay \
b6d1de3f-b27f-4420-96ee-b0365d859d1c \
alert_mean_cpu_idle_logs_by_dc
```

4. Iterate on the `batcher` definition until it works

```sh
$ kapacitor define batcher \
--name alert_mean_cpu_idle_logs_by_dc \
--script path/to/dsl/script
$ kapacitor replay 2869246 alert_mean_cpu_idle_logs_by_dc
-type batcher \
-name alert_mean_cpu_idle_logs_by_dc \
-tick path/to/tick/script
$ kapacitor replay \
b6d1de3f-b27f-4420-96ee-b0365d859d1c \
alert_mean_cpu_idle_logs_by_dc
```

5. Once it works enable locally or push to remote
5. Once it works, enable locally or push to remote

```sh
$ # enable the batcher locally
$ kapacitor enable alert_mean_cpu_idle_logs_by_dc
$ # or push the tested batcher to a prod server
$ kapacitor push --remote address_to_remote_kapacitor alert_mean_cpu_idle_logs_by_dc
$ kapacitor push -remote http://address_to_remote_kapacitor alert_mean_cpu_idle_logs_by_dc
```

# Data processing with pipelines
Expand Down Expand Up @@ -275,7 +285,6 @@ stream
.period(1m)
.every(1m)
.mapReduce(influxql.count, "value")
.where("count == 0")
.alert();

//Now define normal processing on the stream
Expand Down
195 changes: 195 additions & 0 deletions TUTORIAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Getting Started with Kapacitor

This document will walk you through getting started with two simple use cases of Kapacitor.


## Alert on high cpu usage

Classic example, how to get an alert when a server is over loaded.
The following will walk you through how to get data into Kapacitor and how to setup and alert based on that stream of data.
All along showcasing some of the neat features.


First we need to get data into Kapacitor.
This can be done simply via [Telegraf](https://github.com/influxdb/telegraf).
Since we are concerned only about cpu right now lets just use this simple configuration:

```
[agent]
interval = "1s"
[outputs]
# Configuration to send data to Kapacitor.
# Since Kapacitor acts like an InfluxDB server the configuration is the same.
[outputs.influxdb]
# Note the port 9092, this is the default port that Kapacitor uses.
urls = ["http://localhost:9092"]
database = "telegraf"
user_agent = "telegraf"
# Read metrics about cpu usage
[cpu]
percpu = false
totalcpu = true
drop = ["cpu_time"]
```

Go ahead an start Telegraf with the above configuration.

```sh
$ telegraf -config telegraf.conf
```

It will complain about not being able to connect to Kapacitor but thats fine, it will keep trying.


Now lets start Kapacitor

```sh
$ kapacitord
```

That's it. In a sec Telegraf will connect to Kapacitor and start sending it cpu metrics.


Now we need to tell Kapacitor what to do.
Kapacitor's behavior is very dynamic and so it not controlled via configuration but through an HTTP API.
We provide a simple cli utility to call the API to tell Kapacitor what to do.

We want to first create a snapshot of data for testing.

```sh
$ rid=$(kapacitor record stream -duration 60s) # save the id for later use
$ echo $rid
RECORDING_ID_HERE
```

OK, so we want to get an alert if the CPU usage gets too high.
We can define that like so:

```
stream
// Select just the cpu_usage_idle measurement
.from("cpu_usage_idle")
.alert()
// We are using idle so we want to check
// if idle drops below 70% (aka cpu used > 30%)
.predicate("value < 70")
// Post the data for the point to a URL
.post("http://localhost:8000");
```


The above script is called a `TICK` script.
It is written in a custom language that makes it easy to define actions on a series of data.
Go ahead and save the script to a file called `cpu_idle_alert.tick`.

Now that we have our `TICK` script we need to hand it to Kapacitor so it can run it.

```sh
$ kapacitor define -name cpu_alert -type streamer -tick cpu_idle_alert.tick
```

Here we have defined a `task` for Kapacitor to run. The `task` has a `name`, `type`, and a `tick` script.
The name needs to be unique and the type is `streamer` in this case since we are streaming data from Telegraf to Kapacitor.


Since the `alert` is a POST to a url we need to give Kapacitor something to hit.

In a seperate terminal run this:

```sh
$ # Print to STDOUT anything POSTed to http://localhost:8000
$ mkfifo fifo
$ cat fifo | nc -k -l 8000 | tee fifo
```

You can `rm fifo` once you are done.


Now we want to see it in action. Replay the recording from a bit ago to the task called `cpu_alert`.

```sh
$ kapacitor replay -id $rid -name cpu_alert -fast
```

Did you catch any alerts? Maybe not if your system wasn't too busy during the recording.
If not then lets lower the threshold so we will see some alerts.

Note the `-fast` flag tells Kapacitor to replay the data as fast as possible but it still emulates the time in the recording.
Without the `-fast` Kapacitor would replay the data in real time.

Edit the `.predicate("value < 70")` line to be `.predicate("value < 99")`.
Now if your system is at least 1% busy you will get an alert.

Redefine the `task` so that Kapacitor knows about your update.

```sh
$ kapacitor define -name cpu_alert -type streamer -tick cpu_idle_alert.tick
$ # Now run replay the data agian and see if we go any alerts.
$ kapacitor replay -id $rid -name cpu_alert -fast
```


Since we recorded a snapshot of the data we can test again and again with the exact same dataset.
This is powerful for both reproducing bugs with your `TICK` scripts or just knowing that the data isn't changing with each test to keep your sanity.
Run the replay again if you like to see that you get the exact same alerts.


But now we want to see it in action with the live data.
`Enable` your task so it starts working on the live data stream.

```sh
$ kapacitor enable cpu_alert
```

Now just about every second you are probably getting an alert that your system is busy.
That's way to noisy: we could just move the threshold back but that isn't good enough.
We want to only get alerts when things are really bad. Try this:

```
stream
.from("cpu_usage_idle")
.alert()
.predicate("sigma(value) > 3")
.post("http://localhost:8000");
```

Just like that we have told Kapacitor to only alert us if the current value is more than `3 sigma` away from the running mean.
Now if the system cpu climbs throughout the day and drops throughout the night you will still get and alert if it spikes at night or drops during the day!


Stop the noise!

```sh
$ kapacitor define -name cpu_alert -type streamer -tick cpu_idle_alert.tick
$ # The old task definition continues to run until you disable/enable the task.
$ kapacitor disable cpu_alert
$ kapacitor enable cpu_alert
```

What about aggregating our alerts?
If the cpu data coming from Telegraf were tagged with a `service` name then we could do something like this.

```
stream
.from("cpu_usage_idle")
.groupBy("service")
.window()
.period(10s)
.every(5s)
.mapReduce(influxql.mean, "value")
.alert()
.predicate("sigma(value) > 3")
.post("http://localhost:8000");
```

This `TICK` script alerts if the `mean` idle cpu, over the last `10s` `window` for each `service` group is `3` sigma away from the running mean, every `5s`.
Wow, just like that we are aggregating across potentially thousands of servers and getting alerts that are actionable, not just a bunch of noise.
Go ahead and try it out.




Loading

0 comments on commit d5b80ef

Please sign in to comment.