- #621: Fix obscure error about single vs double quotes.
- #623: Fix issues with recording metadata missing data url.
The ability to create and use template tasks has been added. you can define a template for a task and reuse that template across multiple tasks.
A simple example:
// Which measurement to consume
var measurement string
// Optional where filter
var where_filter = lambda: TRUE
// Optional list of group by dimensions
var groups = [*]
// Which field to process
var field string
// Warning criteria, has access to 'mean' field
var warn lambda
// Critical criteria, has access to 'mean' field
var crit lambda
// How much data to window
var window = 5m
// The slack channel for alerts
var slack_channel = '#alerts'
stream
|from()
.measurement(measurement)
.where(where_filter)
.groupBy(groups)
|window()
.period(window)
.every(window)
|mean(field)
|alert()
.warn(warn)
.crit(crit)
.slack()
.channel(slack_channel)
Then you can define the template like so:
kapacitor define-template generic_mean_alert -tick path/to/above/script.tick -type stream
Next define a task that uses the template:
kapacitor define cpu_alert -template generic_mean_alert -vars cpu_vars.json -dbrp telegraf.default
Where cpu_vars.json
would like like this:
{
"measurement": {"type" : "string", "value" : "cpu" },
"where_filter": {"type": "lambda", "value": "\"cpu\" == 'cpu-total'"},
"groups": {"type": "list", "value": [{"type":"string", "value":"host"},{"type":"string", "value":"dc"}]},
"field": {"type" : "string", "value" : "usage_idle" },
"warn": {"type" : "lambda", "value" : " \"mean\" < 30.0" },
"crit": {"type" : "lambda", "value" : " \"mean\" < 10.0" },
"window": {"type" : "duration", "value" : "1m" },
"slack_channel": {"type" : "string", "value" : "#alerts_testing" }
}
With this release you can now replay data directly against a task from InfluxDB without having to first create a recording.
Replay the queries defined in the batch task cpu_alert
for the past 10 hours.
kapacitor replay-live batch -task cpu_alert -past 10h
Or for a stream task with use a query directly:
kapacitor replay-live query -task cpu_alert -query 'SELECT usage_idle FROM telegraf."default".cpu WHERE time > now() - 10h'
Now InfluxDB and Kapacitor support HTTP/S based subscriptions. This means that Kapacitor need only listen on a single port for the HTTP service, greatly simplifying configuration and setup.
In order to start using HTTP subscriptions change the subscription-protocol
option for your configured InfluxDB clusters.
For example:
[[influxdb]]
enabled = true
urls = ["http://localhost:8086",]
subscription-protocol = "http"
# or to use https
#subscription-protocol = "https"
On startup Kapacitor will detect the change and recreate the subscriptions in InfluxDB to use the HTTP protocol.
NOTE: While HTTP itself is a TCP transport such that packet loss shouldn't be an issue, if Kapacitor starts to slow down for whatever reason, InfluxDB will drop the subscription writes to Kapacitor. In order to know if subscription writes are being dropped you should monitor the measurement
_internal.monitor.subscriber
for the fieldwriteFailures
.
This release contains an new Holt Winters InfluxQL function.
With this forecasting method one can now define an alert based off forecasted future values.
For example, the following TICKscript will take the last 30 days of disk usage stats and using holt-winters forecast the next 7 days. If the forecasted value crosses a threshold an alert is triggered.
The result is now Kapacitor will alert you 7 days in advance of a disk filling up. This assumes a slow growth but by changing the vars in the script you could check for shorter growth intervals.
// The interval on which to aggregate the disk usage
var growth_interval = 1d
// The number of `growth_interval`s to forecast into the future
var forecast_count = 7
// The amount of historical data to use for the fit
var history = 30d
// The critical threshold on used_percent
var threshold = 90.0
batch
|query('''
SELECT max(used_percent) as used_percent
FROM "telegraf"."default"."disk"
''')
.period(history)
.every(growth_interval)
.align()
.groupBy(time(growth_interval), *)
|holtWinters('used_percent', forecast_count, 0, growth_interval)
.as('used_percent')
|max('used_percent')
.as('used_percent')
|alert()
// Trigger alert if the forecasted disk usage is greater than threshold
.crit(lambda: "used_percent" > threshold)
- #283: Add live replays.
- #500: Support Float,Integer,String and Boolean types.
- #82: Multiple services for PagerDuty alert. thanks @savagegus!
- #558: Preserve fields as well as tags on selector InfluxQL functions.
- #259: Template Tasks have been added.
- #562: HTTP based subscriptions.
- #595: Support counting and summing empty batches to 0.
- #596: Support new group by time offset i.e. time(30s, 5s)
- #416: Track ingress counts by database, retention policy, and measurement. Expose stats via cli.
- #586: Add spread stateful function. thanks @upccup!
- #600: Add close http response after handler laert post, thanks @jsvisa!
- #606: Add Holt-Winters forecasting method.
- #605: BREAKING: StatsNode for batch edge now count the number of points in a batch instead of count batches as a whole. This is only breaking if you have a deadman switch configured on a batch edge.
- #611: Adds bash completion to the kapacitor CLI tool.
- #540: Fixes bug with log level API endpoint.
- #521: EvalNode now honors groups.
- #561: Fixes bug when lambda expressions would return error about types with nested binary expressions.
- #555: Fixes bug where "time" functions didn't work in lambda expressions.
- #570: Removes panic in SMTP service on failed close connection.
- #587: Allow number literals without leading zeros.
- #584: Do not block during startup to send usage stats.
- #553: Periodically check if new InfluxDB DBRPs have been created.
- #602: Fix missing To property on email alert handler.
- #581: Record/Replay batch tasks get cluster info from task not API.
- #613: BREAKING: Allow the ID of templates and tasks to be updated via the PATCH method. The breaking change is that now PATCH request return a 200 with the template or task definition, where before they returned 204.
Breaking changes may require special upgrade steps from versions <= 0.12, please read the 0.13.0 release notes
Along with the API changes of 0.13.0, validation logic was added to task IDs, but this was not well documented. This minor release remedies that.
All IDs (tasks, recordings, replays) must match this regex ^[-\._\p{L}0-9]+$
, which is essentially numbers, unicode letters, '-', '.' and '_'.
If you have existing tasks which do not match this pattern they should continue to function normally.
- #545: Fixes inconsistancy with API docs for creating a task.
- #544: Fixes issues with existings tasks and invalid names.
- #543: Fixes default values not being set correctly in API calls.
Breaking changes may require special upgrade steps please read below.
Changes to how and where task data is store have been made. In order to safely upgrade to version 0.13 you need to follow these steps:
-
Upgrade InfluxDB to version 0.13 first.
-
Update all TICKscripts to use the new
|
and@
operators. Once Kapacitor no longer issues anyDEPRECATION
warnings you are ready to begin the upgrade. The upgrade will work without this step but tasks using the old syntax cannot be enabled, until modified to use the new syntax. -
Upgrade the Kapacitor binary/package.
-
Configure new database location. By default the location
/var/lib/kapacitor/kapacitor.db
is chosen for package installs or./kapacitor.db
for manual installs. Do not remove the configuration for the location of the old task.db database file since it is still needed to do the migration.[storage] boltdb = "/var/lib/kapacitor/kapacitor.db"
-
Restart Kapacitor. At this point Kapacitor will migrate all existing data to the new database file. If any errors occur Kapacitor will log them and fail to startup. This way if Kapacitor starts up you can be sure the migration was a success and can continue normal operation. The old database is opened in read only mode so that existing data cannot be corrupted. Its recommended to start Kapacitor in debug logging mode for the migration so you can follow the details of the migration process.
At this point you may remove the configuration for the old task
dir
and restart Kapacitor to ensure everything is working.
Kapacitor will attempt the migration on every startup while the old configuration and db file exist, but will skip any data that was already migrated.
With this release the API has been updated to what we believe will be the stable version for a 1.0 release. Small changes may still be made but the significant work to create a RESTful HTTP API is complete. Many breaking changes introduced, see the client/API.md doc for details on how the API works now.
Along with the API changes, breaking changes where also made to the kapacitor
CLI command.
Here is a break down of the CLI changes:
- Every thing has an ID now: tasks, recordings, even replays.
The
name
used before to define a task is now itsID
. As such instead of using-name
and-id
to refer to tasks and recordings, the flags have been changed to-task
and-recording
accordingly. - Replays can be listed and deleted like tasks and recordings.
- Replays default to
fast
clock mode. - The record and replay commands now have a
-no-wait
option to start but not wait for the recording/replay to complete. - Listing recordings and replays displays the status of the respective action.
- Record and Replay command now have an optional flag
-replay-id
/-recording-id
to specify the ID of the replay or recording. If not set then a random ID will be chosen like the previous behavior.
UDF can now be managed externally to Kapacitor via Unix sockets. A process or container can be launched independent of Kapacitor exposing a socket. On startup Kapacitor will connect to the socket and begin communication.
Example UDF config for a socket based UDF.
[udf]
[udf.functions]
[udf.functions.myCustomUDF]
socket = "/path/to/socket"
timeout = "10s"
Alert data can now be consumed directly from within TICKscripts.
For example, let's say we want to store all data that triggered an alert in InfluxDB with a tag level
containing the level string value (i.e CRITICAL).
...
|alert()
.warn(...)
.crit(...)
.levelTag('level')
// and/or use a field
//.levelField('level')
// Also tag the data with the alert ID
.idTag('id')
// and/or use a field
//.idField('id')
|influxDBOut()
.database('alerts')
...
- #360: Forking tasks by measurement in order to improve performance
- #386: Adds official Go HTTP client package.
- #399: Allow disabling of subscriptions.
- #417: UDFs can be connected over a Unix socket. This enables UDFs from across Docker containers.
- #451: StreamNode supports
|groupBy
and|where
methods. - #93: AlertNode now outputs data to child nodes. The output data can have either a tag or field indicating the alert level.
- #281: AlertNode now has an
.all()
property that specifies that all points in a batch must match the criteria in order to trigger an alert. - #384: Add
elapsed
function to compute the time difference between subsequent points. - #230: Alert.StateChangesOnly now accepts optional duration arg. An alert will be triggered for every interval even if the state has not changed.
- #426: Add
skip-format
query parameter to theGET /task
endpoint so that returned TICKscript content is left unmodified from the user input. - #388: The duration of an alert is now tracked and exposed as part of the alert data as well as can be set as a field via
.durationField('duration')
. - #486: Default config file location.
- #461: Make Alerta
event
property configurable. - #491: BREAKING: Rewriting stateful expression in order to improve performance, the only breaking change is: short circuit evaluation for booleans - for example:
lambda: "bool_value" && (count() > 100)
if "bool_value" is false, we won't evaluate "count". - #504: BREAKING: Many changes to the API and underlying storage system. This release requires a special upgrade process.
- #511: Adds DefaultNode for providing default values for missing fields or tags.
- #285: Track created,modified and last enabled dates on tasks.
- #533: Add useful statistics for nodes.
- #499: Fix panic in InfluxQL nodes if field is missing or incorrect type.
- #441: Fix panic in UDF code.
- #429: BREAKING: Change TICKscript parser to be left-associative on equal precedence operators. For example previously this statement
(1+2-3*4/5)
was evaluated as(1+(2-(3*(4/5))))
which is not the typical/expected behavior. Now using left-associative parsing the statement is evaluated as((1+2)-((3*4)/5))
. - #456: Fixes Alerta integration to let server set status, fix
rawData
attribute and set default severity toindeterminate
. - #425: BREAKING: Preserving tags on influxql simple selectors - first, last, max, min, percentile
- #423: Recording stream queries with group by now correctly saves data in time order not group by order.
- #331: Fix panic when missing
.as()
for JoinNode. - #523: JoinNode will now emit join sets as soon as they are ready. If multiple joinable sets arrive in the same tolerance window than each will be emitted (previously the first points were dropped).
- #537: Fix panic in alert node when batch is empty.
New TICKscript syntax that uses a different operators for chaining methods vs property methods vs UDF methods.
- A chaining method is a method that creates a new node in the pipeline. Uses the
|
operator. - A property method is a method that changes a property on a node. Uses the
.
operator. - A UDF method is a method that calls out to a UDF. Uses the
@
operator.
For example below the from
, mean
, and alert
methods create new nodes,
the detectAnomalies
method calls a UDF,
and the other methods modify the nodes as property methods.
stream
|from()
.measurement('cpu')
.where(lambda: "cpu" == 'cpu-total')
|mean('usage_idle')
.as('value')
@detectAnomalies()
.field('mean')
|alert()
.crit(lambda: "anomaly_score" > 10)
.log('/tmp/cpu.log')
With this change a new binary is provided with Kapacitor tickfmt
which will
format a TICKscript file according to a common standard.
- #299: Changes TICKscript chaining method operators and adds
tickfmt
binary. - #389: Adds benchmarks to Kapacitor for basic use cases.
- #390: BREAKING: Remove old
.mapReduce
functions. - #381: Adding enable/disable/delete/reload tasks by glob.
- #401: Add
.align()
property to BatchNode so you can align query start and stop times.
- #378: Fix issue where derivative would divide by zero.
- #387: Add
.quiet()
option to EvalNode so errors can be suppressed if expected. - #400: All query/connection errors are counted and reported in BatchNode stats.
- #412: Fix issues with batch queries dropping points because of nil fields.
- #413: Allow disambiguation between ".groupBy" and "|groupBy".
Kapacitor is now using the functions from the new query engine in InfluxDB core.
Along with this change is a change in the TICKscript API so that using the InfluxQL functions is easier.
Simply call the desired method directly no need to call .mapReduce
explicitly.
This change now hides the mapReduce aspect and handles it internally.
Using .mapReduce
is officially deprecated in this release and will be remove in the next major release.
We feel that this change improves the readability of TICKscripts and exposes less implementation details
to the end user.
Updating your exising TICKscripts is simple.
If previously you had code like this:
stream.from()...
.window()...
.mapReduce(influxql.count('value'))
then update it to look like this:
stream.from()...
.window()...
.count('value')
a simple regex could fix all your existing scripts.
Kapacitor now exposes more internal metrics for determining the performance of a given task.
The internal statistics includes a new measurement named node
that contains any stats a node provides, tagged by the task, node, task type and kind of node (i.e. window vs union).
All nodes provide an averaged execution time for the node.
These stats are also available in the DOT output of the Kapacitor show command.
Significant performance improvements have also been added. In some cases Kapacitor throughput has improved by 4X.
Kapacitor can now connect to different InfluxDB clusters.
Multiple InfluxDB config sections can be defined and one will be marked as default.
To upgrade convert an influxdb
config.
From this:
[influxdb]
enabled = true
...
to this:
[[influxdb]]
enabled = true
default = true
name = "localhost"
...
Various improvements to joining features have been implemented. With #144 you can now join streams with differing group by dimensions.
If you previously configured Email, Slack or HipChat globally now you must also set the state-changes-only
option to true as well if you want to preserve the original behavior.
For example:
[slack]
enable = true
global = true
state-changes-only = true
- #236: Implement batched group by
- #231: Add ShiftNode so values can be shifted in time for joining/comparisons.
- #190: BREAKING: Deadman's switch now triggers off emitted counts and is grouped by to original grouping of the data.
The breaking change is that the 'collected' stat is no longer output for
.stats
and has been replaced byemitted
. - #145: The InfluxDB Out Node now writes data to InfluxDB in buffers.
- #215: Add performance metrics to nodes for average execution times and node throughput values.
- #144: Can now join streams with differing dimensions using the join.On property.
- #249: Can now use InfluxQL functions directly instead of via the MapReduce method. Example
stream.from().count()
. - #233: BREAKING: Now you can use multiple InfluxDB clusters. The config changes to make this possible are breaking. See notes above for changes.
- #302: Can now use .Time in alert message.
- #239: Support more detailed TLS config when connecting to an InfluxDB host.
- #323: Stats for task execution are provided via JSON HTTP request instead of just DOT string. thanks @yosiat
- #358: Improved logging. Adds LogNode so any data in a pipeline can be logged.
- #366: HttpOutNode now allows chaining methods.
- #199: BREAKING: Various fixes for the Alerta integration.
The
event
property has been removed from the Alerta node and is now set as the value of the alert ID. - #232: Better error message for alert integrations. Better error message for VictorOps 404 response.
- #231: Fix window logic when there were gaps in the data stream longer than window every value.
- #213: Add SourceStreamNode so that yuou must always first call
.from
on thestream
object before filtering it, so as to not create confusing to understand TICKscripts. - #255: Add OPTIONS handler for task delete method so it can be preflighted.
- #258: Fix UDP internal metrics, change subscriptions to use clusterID.
- #240: BREAKING: Fix issues with Sensu integration. The breaking change is that the config no longer takes a
url
but rather ahost
option since the communication is raw TCP rather HTTP. - #270: The HTTP server will now gracefully stop.
- #300: Add OPTIONS method to /recording endpoint for deletes.
- #304: Fix panic if recording query but do not have an InfluxDB instance configured
- #289: Add better error handling to batch node.
- #142: Fixes bug when defining multiple influxdb hosts.
- #266: Fixes error log for HipChat that is not an error.
- #333: Fixes hang when replaying with .stats node. Fixes issues with batch and stats.
- #340: BREAKING: Decouples global setting for alert handlers from the state changes only setting.
- #348: config.go: refactor to simplify structure and fix support for array elements
- #362: Fix bug with join tolerance and batches.
This is a bug fix release that fixes many issues releated to the recent 0.10.0 release. The few additional features are focused on usability improvements from recent feedback.
Improved UDFs, lots of bug fixes and improvements on the API. There was a breaking change for UDFs protobuf messages, see #176.
There was a breaking change to the define
command, see #173 below.
- #176: BREAKING: Improved UDFs and groups. Now it is easy to deal with groups from the UDF process. There is a breaking change in the BeginBatch protobuf message for this change.
- #196: Adds a 'details' property to the alert node so that the email body can be defined. See also #75.
- #132: Make is so multiple calls to
where
simplyAND
expressions together instead of replacing or creating extra nodes in the pipeline. - #173: BREAKING: Added a
-no-reload
flag to the define command in the CLI. Now if the task is enabled define will automatically reload it unless-no-reload
is passed. - #194: Adds Talk integration for alerts. Thanks @wutaizeng!
- #320: Upgrade to go 1.6
- #177: Fix panic for show command on batch tasks.
- #185: Fix panic in define command with invalid dbrp value.
- #195: Fix panic in where node.
- #208: Add default stats dbrp to default subscription excludes.
- #203: Fix hang when deleteing invalid batch task.
- #182: Fix missing/incorrect Content-Type headers for various HTTP endpoints.
- #187: Retry connecting to InfluxDB on startup for up to 5 minutes by default.
This release marks the next major release of Kapacitor. With this release you can now run your own custom code for processing data within Kapacitor. See udf/agent/README.md for more details.
With the addition of UDFs it is now possible to run custom anomaly detection alogrithms suited to your needs. There are simple examples of how to use UDFs in udf/agent/examples.
The version has jumped significantly so that it is inline with other projects in the TICK stack. This way you can easily tell which versions of Telegraf, InfluxDB, Chronograf and Kapacitor work together.
See note on a breaking change in the HTTP API below. #163
- #137: Add deadman's switch. Can be setup via TICKscript and globally via configuration.
- #72: Add support for User Defined Functions (UDFs).
- #139: Alerta.io support thanks! @md14454
- #85: Sensu support using JIT clients. Thanks @sstarcher!
- #141: Time of day expressions for silencing alerts.
- #153: Fix panic if referencing non existant field in MapReduce function.
- #138: Change over to influxdata github org.
- #164: Update imports etc from InfluxDB as per the new meta store/client changes.
- #163: BREAKING CHANGE: Removed the 'api/v1' pathing from the HTTP API so that Kapacitor is path compatible with InfluxDB. While this is a breaking change the kapacitor cli has been updated accordingly and you will not experience any distruptions unless you were calling the HTTP API directly.
- #147: Compress .tar archives from builds.
- #118: Can now define multiple handlers of the same type on an AlertNode.
- #119: HipChat support thanks! @ericiles *2
- #113: OpsGenie support thanks! @ericiles
- #107: Enable TICKscript variables to be defined and then referenced from lambda expressions. Also fixes various bugs around using regexes.
- #124: Fix panic where there is an error starting a task.
- #122: Fixes panic when using WhereNode.
- #128: Fix not sending emails when using recipient list from config.
Bugfix #106 made a breaking change to the internal HTTP API. This was to facilitate integration testing and overall better design. Now POSTing a recording request will start the recording and immediately return. If you want to wait till it is complete do a GET for the recording info and it will block until its complete. The kapacitor cli has been updated accordingly.
- #96: Use KAPACITOR_URL env var for setting the kapacitord url in the client.
- #109: Add throughput counts to DOT format in
kapacitor show
command, if task is executing.
- #102: Fix race when start/stoping timeTicker in batch.go
- #106: Fix hang when replaying stream recording.
Some bug fixes including one that cause Kapacitor to deadlock.
- #83: Use enterprise usage client, remove deprecated enterprise register and reporting features.
- #86: Fix dealock form errors in tasks. Also fixes issue where task failures did not get logged.
- #95: Fix race in bolt usage when starting enabled tasks at startup.
Major public release.