- #2448: Changes the alert-handler match function duration() to be alertDuration() to avoid name collision with the type conversion function of the same name.
- #1839: Add Subscription path configuration option to allow Kapacitor to run behind a reverse proxy, thanks @aspring
- #1894: Add HTTP sources for sideload configuration, thanks @jregovic!
- #2055: Add support for correlate in the Alerta AlertNode, thanks @nermolaev!
- #2409: Optionally use kapacitor alert details as opsgenie description text, thanks @JamesClonk!
- #2441: Preallocate GroupIDs for increased performance by reducing allocations.
- #2456: Gzip data by default that is sent to influxdb.
- #2454: Add PrimaryProperty and SecondaryProperty methods to BigPanda AlertNode.
- #2462: BREAKING: we are forced to remove support for 386/darwin builds as go doesn't support them anymore.
- #2475: Add default color theme to teams alerts
- #2301: Allow for overriding OpsGenieV2's alert recovery action in tickSCRIPT, thanks @zabullet!
- #2388: Added templating for the url in the
httpPost
node and thealert().post()
node. - #2351: Upgraded github.com/gorhill/cronexpr, thanks @wuguanyu!
- #2416: Added a ServiceNow event handler.
- #2201: Added missing err check of a buf scanner, thanks @johncming!
- #2395: Added missing .Details to AlertTemplate.
- #1965: Alert handler for Microsoft Teams, thanks @mmindenhall!
- #2287: Added Discord Webhook Alert Handler, thanks @mattnotmitt!
- #2311: UDF Agent Python3 fixes, thanks @elohmeier!
- #2312: feat(build): switch from md5 to sha256
- #2322: Add support for TLS 1.3.
- #1980: Fix discovery service lost config, thanks @flisky!
- #2156: Use Systemd for Amazon Linux 2
- #2282: fix small typo.
- #2286: Corrected issue with
go vet
invocation in .hooks/pre-commit which would cause the hook to fail, thanks @mattnotmitt! - #2289: Update build.py to support arm64, thanks @povlhp
- #2335: Fix panic when setting a zero interval for ticker, this affected deadman and stats nodes.
- #2340: Fix a panic on int div-by-zero, instead return an error.
- #2358: Fix Kapacitor ignoring the pushover().userKey('') TICKScript operation.
- #2319: Update kafka lib; make kafka errors unsilent.
- #2312: Switched from md5 to sha256 for release checksums.
- #2202: Add templating for MQTT topics.
- #2276: Upgrade to support python 3 for UDFs, Thanks @N-Coder !
- #2253: Upgrade the kafka library to set the timestamp correctly.
- #2274: Upgrade to Go 1.13, fixes various go vet issues.
- #2154: Add ability to skip ssl verification with an alert post node. Thanks @itsHabib!
- #2193: Add TLS configuration options.
- #2167: Use default transport consistently.
- #2144: Fix deadlock in barrier node when delete is used.
- #2186: Make RPM create files with correct ownership on install.
- #2189: Delete group stats when a group is deleted
- #2207: Avoid extra allocation when building GroupID
- #2095: Add barrier node support to join node.
- #1157: Add ability to expire groups using the barrier node.
- #2099: Add
alert/persist-topics
to config - #2101: Add multiple field support to the change detect node.
- #1961: Add links to pagerduty2 alerts
- #1974: Add additional metadata to Sensu alerts.
- #2048: Fix join not catching up fast enough after a pause in the data stream.
- #1938: pagerduty2 should use routingKey rather than serviceKey
- #1982: Fix KafkaTopic not working from TICKscript
- #1989: Improve Kafka alert throughput.
- #1842: Add alert inhibitors that allow an alert to suppress events from other matching alerts.
- #1833: Config format updated to allow for more than one slack configuration.
- #1844: Added a new kapacitor node changeDetect that emits a value for each time a series field changes.
- #1828: Add recoverable field to JSON alert response to indicate whether the alert will auto-recover.
- #1823: Update OpsGenie integration to use the v2 API.
To upgrade to using the new API simply update your config and TICKscripts to use opsgenie2 instead of opsgenie.
If your
opsgenie
config uses therecovery_url
option, foropsgenie2
you will need to change it to therecovery_action
option. This is because the new v2 API is not structured with static URLs, and so only the action can be defined and not the entire URL. - #1690: Add https-private-key option to httpd config.
- #1561: Add .quiet to all nodes to silence any errors reported by the node.
- #1826: Add Kafka alert handler.
- #1794: Kapacitor ticks generating a hash instead of their actual given name.
- #1827: Fix deadlock in load service when task has an error.
- #1795: Support PagerDuty API v2
- #1776: Fix bug where you could not delete a topic handler with the same name as its topic.
- #1905: Adjust PagerDuty v2 service-test names and capture detailed error messages.
- #1913: Fix Kafka configuration.
- #1834: Fix bug where task type was invalid when using var for stream/batch
The v1.4.0 release has many new features, here is a list of some of the highlights:
- Load TICKscripts and alert handlers from a directory.
- Structed Logging with a logging API endpoints to be able to tail logs for given tasks.
- Autoscale support for Docker Swarm and EC2 Autoscaling.
- Sideload data into your TICKscript streams from external sources.
- Fully customizable POST body for the alert POST handler and the httpPost node.
See the complete list of bug fixes and features below.
- #1710: Idle Barrier is dropping all messages when source has clock offset
- #1719: Fix oddly generated TOML for mqtt & httppost
- #1703: Fix issues where log API checked the wrong header for the desired content type.
- #1622: Add support for AWS EC2 autoscaling services.
- #1566: Add BarrierNode to emit BarrierMessage periodically
- #1250: Fix VictorOps "data" field being a string instead of actual JSON.
- #1697: Fix panic with MQTT toml configuration generation.
- #1408: Add Previous state
- #1575: Add support to persist replay status after it finishes.
- #1461: alert.post and https_post timeouts needed.
- #1413: Add subscriptions modes to InfluxDB subscriptions.
- #1436: Add linear fill support for QueryNode.
- #1345: Add MQTT Alert Handler
- #1390: Add built in functions to convert timestamps to integers
- #1425: BREAKING: Change over internal API to use message passing semantics. The breaking change is that the Combine and Flatten nodes previously, but erroneously, operated across batch boundaries; this has been fixed.
- #1497: Add support for Docker Swarm autoscaling services.
- #1485: Add bools field types to UDFs.
- #1549: Add stateless now() function to get the current local time.
- #1545: Add support for timeout, tags and service template in the Alerta AlertNode
- #1568: Add support for custom HTTP Post bodies via a template system.
- #1569: Add support for add the HTTP status code as a field when using httpPost
- #1535: Add logfmt support and refactor logging.
- #1481: Add ability to load tasks/handlers from dir.
TICKscript was extended to be able to describe a task exclusively through a tickscript.
* tasks no longer need to specify their TaskType (Batch, Stream).
*
dbrp
expressions were added to tickscript. Topic-Handler file format was modified to include the TopicID and HandlerID in the file. Load service was added; the service can load tasks/handlers from a directory. - #1606: Update Go version to 1.9.1
- #1578: Add support for exposing logs via the API. API is released as a technical preview.
- #1605: Add support for {{ .Duration }} on Alert Message property.
- #1644: Add support for JSON lines for steaming HTTP logs.
- #1637: Add new node Sideload, that allows loading data from files into the stream of data. Data can be loaded using a hierarchy.
- #1667: Promote Alert API to stable v1 path.
- #1668: Change WARN level logs to INFO level.
- #916: Crash of Kapacitor on Windows x64 when starting a recording
- #1400: Allow for
.yml
file extensions indefine-topic-handler
- #1402: Fix http server error logging.
- #1500: Fix bugs with stopping running UDF agent.
- #1470: Fix error messages for missing fields which are arguments to functions are not clear
- #1516: Fix bad PagerDuty test the required server info.
- #1581: Add SNMP sysUpTime to SNMP Trap service
- #1547: Fix panic on recording replay with HTTPPostHandler.
- #1623: Fix k8s incluster master api dns resolution
- #1630: Remove the pidfile after the server has exited.
- #1641: Logs API writes multiple http headers.
- #1657: Fix missing dependency in rpm package.
- #1660: Force tar owner/group to be root.
- #1663: Fixed install/remove of kapacitor on non-systemd Debian/Ubuntu systems. Fixes packaging to not enable services on RHEL systems. Fixes issues with recusive symlinks on systemd systems.
- #1662: Fix invalid default MQTT config.
- #1520: Expose pprof without authentication if enabled
- #1512: Use details field from alert node in PagerDuty.
- #1415: Proxy from environment for HTTP request to slack
- #1414: Fix derivative node preserving fields from previous point in stream tasks.
The v1.3.0 release has two major features.
- Addition of scraping and discovering for Prometheus style data collection.
- Updates to the Alert Topic system
Here is a quick example of how to configure Kapacitor to scrape discovered targets. First configure a discoverer, here we use the file-discovery discoverer. Next configure a scraper to use that discoverer.
NOTE: The scraping and discovering features are released under technical preview, meaning that the configuration or API around the feature may change in a future release.
# Configure file discoverer
[[file-discovery]]
enabled = true
id = "discover_files"
refresh-interval = "10s"
##### This will look for prometheus json files
##### File format is here https://prometheus.io/docs/operating/configuration/#%3Cfile_sd_config%3E
files = ["/tmp/prom/*.json"]
# Configure scraper
[[scraper]]
enabled = true
name = "node_exporter"
discoverer-id = "discover_files"
discoverer-service = "file-discovery"
db = "prometheus"
rp = "autogen"
type = "prometheus"
scheme = "http"
metrics-path = "/metrics"
scrape-interval = "2s"
scrape-timeout = "10s"
Add the above snippet to your kapacitor.conf file.
Create the below snippet as the file /tmp/prom/localhost.json
:
[{
"targets": ["localhost:9100"]
}]
Start the Prometheus node_exporter locally.
Now startup Kapacitor and it will discover the localhost:9100
node_exporter target and begin scrapping it for metrics.
For more details on the scraping and discovery systems see the full documentation here.
The second major feature with this release, are changes to the alert topic system. The previous release introduce this new system as a technical preview, with this release the alerting service has been simplified. Alert handlers now only ever have a single action and belong to a single topic.
The handler definition has been simplified as a result. Here are some example alert handlers using the new structure:
id: my_handler
kind: pagerDuty
options:
serviceKey: XXX
id: aggregate_by_1m
kind: aggregate
options:
interval: 1m
topic: aggregated
id: publish_to_system
kind: publish
options:
topics: [ system ]
To define a handler now you must specify which topic the handler belongs to. For example to define the above aggregate handler on the system topic use this command:
kapacitor define-handler system aggregate_by_1m.yaml
For more details on the alerting system see the full documentation here.
- #1396: Fix broken ENV var config overrides for the kubernetes section.
- #1397: Update default configuration file to include sections for each discoverer service.
- #1379: Copy batch points slice before modification, fixes potential panics and data corruption.
- #1394: Use the Prometheus metric name as the measurement name by default for scrape data.
- #1392: Fix possible deadlock for scraper configuration updating.
- #1369: Fix panic with concurrent writes to same points in state tracking nodes.
- #1387: static-discovery configuration simplified
- #1378: Fix panic in InfluxQL node with missing field.
- #1370: Fix missing working_cardinality stats on stateDuration and stateCount nodes.
- #1299: Allowing sensu handler to be specified
- #1284: Add type signatures to Kapacitor functions.
- #1203: Add
isPresent
operator for verifying whether a value is present (part of #1284). - #1354: Add Kubernetes scraping support.
- #1359: Add groupBy exclude and Add dropOriginalFieldName to flatten.
- #1360: Add KapacitorLoopback node to be able to send data from a task back into Kapacitor.
- #1329: BREAKING: A bug was fixed around missing fields in the derivative node. The behavior of the node changes slightly in order to provide a consistent fix to the bug. The breaking change is that now, the time of the points returned are from the right hand or current point time, instead of the left hand or previous point time.
- #1353: Fix panic in scraping TargetManager.
- #1238: Use ProxyFromEnvironment for all outgoing HTTP traffic.
- #117: Add headers to alert POST requests.
- #1294: Fix bug where batch queries would be missing all fields after the first nil field.
- #1343: BREAKING: The UDF agent Go API has changed, the changes now make it so that the agent package is self contained.
- #1322: TLS configuration in Slack service for Mattermost compatibility
- #1330: Generic HTTP Post node
- #1159: Go version 1.7.4 -> 1.7.5
- #1175: BREAKING: Add generic error counters to every node type.
Renamed
query_errors
toerrors
in batch node. Renamedeval_errors
toerrors
in eval node. - #922: Expose server specific information in alert templates.
- #1162: Add Pushover integration.
- #1221: Add
working_cardinality
stat to each node type that tracks the number of groups per node. - #1211: Add StateDuration node.
- #1209: BREAKING: Refactor the Alerting service. The change is completely breaking for the technical preview alerting service, a.k.a. the new alert topic handler features. The change boils down to simplifying how you define and interact with topics. Alert handlers now only ever have a single action and belong to a single topic. An automatic migration from old to new handler definitions will be performed during startup. See the updated API docs.
- #1286: Default HipChat URL should be blank
- #507: Add API endpoint for performing Kapacitor database backups.
- #1132: Adding source for sensu alert as parameter
- #1346: Add discovery and scraping services.
- #1133: Fix case-sensitivity for Telegram
parseMode
value. - #1147: Fix pprof debug endpoint
- #1164: Fix hang in config API to update a config section. Now if the service update process takes too long the request will timeout and return an error. Previously the request would block forever.
- #1165: Make the alerta auth token prefix configurable and default it to Bearer.
- #1184: Fix logrotate file to correctly rotate error log.
- #1200: Fix bug with alert duration being incorrect after restoring alert state.
- #1199: BREAKING: Fix inconsistency with JSON data from alerts.
The alert handlers Alerta, Log, OpsGenie, PagerDuty, Post and VictorOps allow extra opaque data to be attached to alert notifications.
That opaque data was inconsistent and this change fixes that.
Depending on how that data was consumed this could result in a breaking change, since the original behavior was inconsistent
we decided it would be best to fix the issue now and make it consistent for all future builds.
Specifically in the JSON result data the old key
Series
is alwaysseries
, and the old keyErr
is now alwayserror
instead of for only some of the outputs. - #1181: Fix bug parsing dbrp values with quotes.
- #1228: Fix panic on loading replay files without a file extension.
- #1192: Fix bug in Default Node not updating batch tags and groupID. Also empty string on a tag value is now a sufficient condition for the default conditions to be applied. See #1233 for more information.
- #1068: Fix dot view syntax to use xlabels and not create invalid quotes.
- #1295: Fix curruption of recordings list after deleting all recordings.
- #1237: Fix missing "vars" key when listing tasks.
- #1271: Fix bug where aggregates would not be able to change type.
- #1261: Fix panic when the process cannot stat the data dir.
- #1323: Fix issue where credentials to InfluxDB could not be updated dynamically.
A new system for working with alerts has been introduced. This alerting system allows you to configure topics for alert events and then configure handlers for various topics. This way alert generation is decoupled from alert handling.
Existing TICKscripts will continue to work without modification.
To use this new alerting system remove any explicit alert handlers from your TICKscript and specify a topic. Then configure the handlers for the topic.
stream
|from()
.measurement('cpu')
.groupBy('host')
|alert()
// Specify the topic for the alert
.topic('cpu')
.info(lambda: "value" > 60)
.warn(lambda: "value" > 70)
.crit(lambda: "value" > 80)
// No handlers are configured in the script, they are instead defined on the topic via the API.
The API exposes endpoints to query the state of each alert and endpoints for configuring alert handlers. See the API docs for more details. The kapacitor CLI has been updated with commands for defining alert handlers.
This release introduces a new feature where you can window based off the number of points instead of their time. For example:
stream
|from()
.measurement('my-measurement')
// Emit window for every 10 points with 100 points per window.
|window()
.periodCount(100)
.everyCount(10)
|mean('value')
|alert()
.crit(lambda: "mean" > 100)
.slack()
.channel('#alerts')
With this change alert nodes will have an anonymous topic created for them. This topic is managed like all other topics preserving state etc. across restarts. As a result existing alert nodes will now remember the state of alerts after restarts and disiabling/enabling a task.
NOTE: The new alerting features are being released under technical preview. This means breaking changes may be made in later releases until the feature is considered complete. See the API docs on technical preview for specifics of how this effects the API.
- #1110: Add new query property for aligning group by intervals to start times.
- #1095: Add new alert API, with support for configuring handlers and topics.
- #1052: Move alerta api token to header and add option to skip TLS verification.
- #929: Add SNMP trap service for alerting.
- #913: Add fillPeriod option to Window node, so that the first emit waits till the period has elapsed before emitting.
- #898: Now when the Window node every value is zero, the window will be emitted immediately for each new point.
- #744: Preserve alert state across restarts and disable/enable actions.
- #327: You can now window based on count in addition to time.
- #251: Enable markdown in slack attachments.
- #1100: Fix issue with the Union node buffering more points than necessary.
- #1087: Fix panic during close of failed startup when connecting to InfluxDB.
- #1045: Fix panic during replays.
- #1043: logrotate.d ignores kapacitor configuration due to bad file mode.
- #872: Fix panic during failed aggregate results.
No changes to Kapacitor, only upgrading to go 1.7.4 for security patches.
New K8sAutoscale node that allows you to auotmatically scale Kubernetes deployments driven by any metrics Kapacitor consumes.
For example, to scale a deployment myapp
based off requests per second:
// The target requests per second per host
var target = 100.0
stream
|from()
.measurement('requests')
.where(lambda: "deployment" == 'myapp')
// Compute the moving average of the last 5 minutes
|movingAverage('requests', 5*60)
.as('mean_requests_per_second')
|k8sAutoscale()
.resourceName('app')
.kind('deployments')
.min(4)
.max(100)
// Compute the desired number of replicas based on target.
.replicas(lambda: int(ceil("mean_requests_per_second" / target)))
New API endpoints have been added to be able to configure InfluxDB clusters and alert handlers dynamically without needing to restart the Kapacitor daemon. Along with the ability to dynamically configure a service, API endpoints have been added to test the configurable services. See the API docs for more details.
NOTE: The
connect_errors
stat from the query node was removed since the client changed, all errors are now counted in thequery_errors
stat.
- #931: Add a Kubernetes autoscaler node. You can now autoscale your Kubernetes deployments via Kapacitor.
- #928: Add new API endpoint for dynamically overriding sections of the configuration.
- #980: Upgrade to using go 1.7
- #957: Add API endpoints for testing service integrations.
- #958: Add support for Slack icon emojis and custom usernames.
- #991: Bring Kapacitor up to parity with available InfluxQL functions in 1.1
- #984: Fix bug where keeping a list of fields that where not referenced in the eval expressions would cause an error.
- #955: Fix the number of subscriptions statistic.
- #999: Fix inconsistency with InfluxDB by adding config option to set a default retention policy.
- #1018: Sort and dynamically adjust column width in CLI output. Fixes #785
- #1019: Adds missing strLength function.
- #951: Fix bug where errors to save cluster/server ID files were ignored.
- #954: Create data_dir on startup if it does not exist.
- #873: Add TCP alert handler
- #869: Add ability to set alert message as a field
- #854: Add
.create
property to InfluxDBOut node, which when set will create the database and retention policy on task start. - #909: Allow duration / duration in TICKscript.
- #777: Add support for string manipulation functions.
- #886: Add ability to set specific HTTP port and hostname per configured InfluxDB cluster.
- #889: Some typo in the default config file
- #914: Change |log() output to be in JSON format so its self documenting structure.
- #915: Fix issue with TMax and the Holt-Winters method.
- #927: Fix bug with TMax and group by time.
Final release of v1.0.0.
- #842: Fix side-effecting modification in batch WhereNode.
- #827: Bring Kapacitor up to parity with available InfluxQL functions in 1.0
- #763: Fix NaNs begin returned from the
sigma
stateful function. - #468: Fix tickfmt munging escaped slashes in regexes.
Kapacitor now supports alert reset expressions. This way when an alert enters a state, it can only be lowered in severity if its reset expression evaluates to true.
Example:
stream
|from()
.measurement('cpu')
.where(lambda: "host" == 'serverA')
.groupBy('host')
|alert()
.info(lambda: "value" > 60)
.infoReset(lambda: "value" < 50)
.warn(lambda: "value" > 70)
.warnReset(lambda: "value" < 60)
.crit(lambda: "value" > 80)
.critReset(lambda: "value" < 70)
For example given the following values:
61 73 64 85 62 56 47
The corresponding alert states are:
INFO WARNING WARNING CRITICAL INFO INFO OK
- #740: Support reset expressions to prevent an alert from being lowered in severity. Thanks @minhdanh!
- #670: Add ability to supress OK recovery alert events.
- #804: Add API endpoint for refreshing subscriptions. Also fixes issue where subs were not relinked if the sub was deleted. UDP listen ports are closed when a database is dropped.
- #783: Fix panic when revoking tokens not already defined.
- #784: Fix several issues with comment formatting in TICKscript.
- #786: Deleting tags now updates the group by dimensions if needed.
- #772: Delete task snapshot data when a task is deleted.
- #797: Fix panic from race condition in task master.
- #811: Fix bug where subscriptions + tokens would not work with more than one InfluxDB cluster.
- #812: Upgrade to use protobuf version 3.0.0
Kapacitor now supports grouping by fields. First convert a field into a tag using the EvalNode. Then group by the new tag.
Example:
stream
|from()
.measurement('alerts')
// Convert field 'level' to tag.
|eval(lambda: string("level"))
.as('level')
.tags('level')
// Group by new tag 'level'.
|groupBy('alert', 'level')
|...
Note the field level
is now removed from the point since .keep
was not used.
See the docs for more details on how .tags
works.
In companion with being able to create new tags, you can now delete tags or fields.
Example:
stream
|from()
.measurement('alerts')
|delete()
// Remove the field `extra` and tag `uuid` from all points.
.field('extra')
.tag('uuid')
|...
- #702: Add plumbing for authentication backends.
- #624: BREAKING: Add ability to GroupBy fields. First use EvalNode to create a tag from a field and then group by the new tag. Also allows for grouping by measurement. The breaking change is that the group ID format has changed to allow for the measurement name.
- #759: Add mechanism for token based subscription auth.
- #745: Add if function for tick script, for example:
if("value" > 6, 1, 2)
.
- #710: Fix infinite loop when parsing unterminated regex in TICKscript.
- #711: Fix where database name with quotes breaks subscription startup logic.
- #719: Fix panic on replay.
- #723: BREAKING: Search for valid configuration on startup in ~/.kapacitor and /etc/kapacitor/. This is so that the -config CLI flag is not required if the configuration is found in a standard location. The configuration file being used is always logged to STDERR.
- #298: BREAKING: Change alert level evaluation so each level is independent and not required to be a subset of the previous level. The breaking change is that expression evaluation order changed. As a result stateful expressions that relied on that order are broken.
- #749: Fix issue with tasks with empty DAG.
- #718: Fix broken extra expressions for deadman's switch.
- #752: Fix various bugs relating to the
fill
operation on a JoinNode. Fill with batches and fill when using theon
property were broken. Also changes the DefaultNode set defaults for nil fields.
- #662: Add
-skipVerify
flag tokapacitor
CLI tool to skip SSL verification. - #680: Add Telegram Alerting option, thanks @burdandrei!
- #46: Can now create combinations of points within the same stream. This is kind of like join but instead joining a stream with itself.
- #669: Add size function for humanize byte size. thanks @jsvisa!
- #697: Can now flatten a set of points into a single points creating dynamcially named fields.
- #698: Join delimiter can be specified.
- #695: Bash completion filters by enabled disabled status. Thanks @bbczeuz!
- #706: Package UDF agents
- #707: Add size field to BeginBatch struct of UDF protocol. Provides hint as to size of incoming batch.
- #656: Fix issues where an expression could not be passed as a function parameter in TICKscript.
- #627: Fix where InfluxQL functions that returned a batch could drop tags.
- #674: Fix panic with Join On and batches.
- #665: BREAKING: Fix file mode not being correct for Alert.Log files. Breaking change is that integers numbers prefixed with a 0 in TICKscript are interpreted as octal numbers.
- #667: Align deadman timestamps to interval.
- #636: Change HTTP logs to be in Common Log format.
- #652: Add optional replay ID to the task API so that you can get information about a task inside a running replay.
- #621: Fix obscure error about single vs double quotes.
- #623: Fix issues with recording metadata missing data url.
- #631: Fix issues with using iterative lambda expressions in an EvalNode.
- #628: BREAKING: Change
kapacitord config
to not search default location for configuration files but rather require the-config
option. Since thekapacitord run
command behaves this way they should be consistent. Fix issue withkapacitord config > kapacitor.conf
when the output file was a default location for the config. - #626: Fix issues when changing the ID of an enabled task.
- #624: Fix issues where you could get a read error on a closed UDF socket.
- #651: Fix issues where an error during a batch replay would hang because the task wouldn't stop.
- #650: BREAKING: The default retention policy name was changed to
autogen
in InfluxDB. This changes Kapacitor to useautogen
for the default retention policy for the stats. You may need to update your task DBRPs to useautogen
instead ofdefault
.
The ability to create and use template tasks has been added. you can define a template for a task and reuse that template across multiple tasks.
A simple example:
// Which measurement to consume
var measurement string
// Optional where filter
var where_filter = lambda: TRUE
// Optional list of group by dimensions
var groups = [*]
// Which field to process
var field string
// Warning criteria, has access to 'mean' field
var warn lambda
// Critical criteria, has access to 'mean' field
var crit lambda
// How much data to window
var window = 5m
// The slack channel for alerts
var slack_channel = '#alerts'
stream
|from()
.measurement(measurement)
.where(where_filter)
.groupBy(groups)
|window()
.period(window)
.every(window)
|mean(field)
|alert()
.warn(warn)
.crit(crit)
.slack()
.channel(slack_channel)
Then you can define the template like so:
kapacitor define-template generic_mean_alert -tick path/to/above/script.tick -type stream
Next define a task that uses the template:
kapacitor define cpu_alert -template generic_mean_alert -vars cpu_vars.json -dbrp telegraf.default
Where cpu_vars.json
would like like this:
{
"measurement": {"type" : "string", "value" : "cpu" },
"where_filter": {"type": "lambda", "value": "\"cpu\" == 'cpu-total'"},
"groups": {"type": "list", "value": [{"type":"string", "value":"host"},{"type":"string", "value":"dc"}]},
"field": {"type" : "string", "value" : "usage_idle" },
"warn": {"type" : "lambda", "value" : " \"mean\" < 30.0" },
"crit": {"type" : "lambda", "value" : " \"mean\" < 10.0" },
"window": {"type" : "duration", "value" : "1m" },
"slack_channel": {"type" : "string", "value" : "#alerts_testing" }
}
With this release you can now replay data directly against a task from InfluxDB without having to first create a recording.
Replay the queries defined in the batch task cpu_alert
for the past 10 hours.
kapacitor replay-live batch -task cpu_alert -past 10h
Or for a stream task with use a query directly:
kapacitor replay-live query -task cpu_alert -query 'SELECT usage_idle FROM telegraf."default".cpu WHERE time > now() - 10h'
Now InfluxDB and Kapacitor support HTTP/S based subscriptions. This means that Kapacitor need only listen on a single port for the HTTP service, greatly simplifying configuration and setup.
In order to start using HTTP subscriptions change the subscription-protocol
option for your configured InfluxDB clusters.
For example:
[[influxdb]]
enabled = true
urls = ["http://localhost:8086",]
subscription-protocol = "http"
# or to use https
#subscription-protocol = "https"
On startup Kapacitor will detect the change and recreate the subscriptions in InfluxDB to use the HTTP protocol.
NOTE: While HTTP itself is a TCP transport such that packet loss shouldn't be an issue, if Kapacitor starts to slow down for whatever reason, InfluxDB will drop the subscription writes to Kapacitor. In order to know if subscription writes are being dropped you should monitor the measurement
_internal.monitor.subscriber
for the fieldwriteFailures
.
This release contains an new Holt Winters InfluxQL function.
With this forecasting method one can now define an alert based off forecasted future values.
For example, the following TICKscript will take the last 30 days of disk usage stats and using holt-winters forecast the next 7 days. If the forecasted value crosses a threshold an alert is triggered.
The result is now Kapacitor will alert you 7 days in advance of a disk filling up. This assumes a slow growth but by changing the vars in the script you could check for shorter growth intervals.
// The interval on which to aggregate the disk usage
var growth_interval = 1d
// The number of `growth_interval`s to forecast into the future
var forecast_count = 7
// The amount of historical data to use for the fit
var history = 30d
// The critical threshold on used_percent
var threshold = 90.0
batch
|query('''
SELECT max(used_percent) as used_percent
FROM "telegraf"."default"."disk"
''')
.period(history)
.every(growth_interval)
.align()
.groupBy(time(growth_interval), *)
|holtWinters('used_percent', forecast_count, 0, growth_interval)
.as('used_percent')
|max('used_percent')
.as('used_percent')
|alert()
// Trigger alert if the forecasted disk usage is greater than threshold
.crit(lambda: "used_percent" > threshold)
- #283: Add live replays.
- #500: Support Float,Integer,String and Boolean types.
- #82: Multiple services for PagerDuty alert. thanks @savagegus!
- #558: Preserve fields as well as tags on selector InfluxQL functions.
- #259: Template Tasks have been added.
- #562: HTTP based subscriptions.
- #595: Support counting and summing empty batches to 0.
- #596: Support new group by time offset i.e. time(30s, 5s)
- #416: Track ingress counts by database, retention policy, and measurement. Expose stats via cli.
- #586: Add spread stateful function. thanks @upccup!
- #600: Add close http response after handler laert post, thanks @jsvisa!
- #606: Add Holt-Winters forecasting method.
- #605: BREAKING: StatsNode for batch edge now count the number of points in a batch instead of count batches as a whole. This is only breaking if you have a deadman switch configured on a batch edge.
- #611: Adds bash completion to the kapacitor CLI tool.
- #540: Fixes bug with log level API endpoint.
- #521: EvalNode now honors groups.
- #561: Fixes bug when lambda expressions would return error about types with nested binary expressions.
- #555: Fixes bug where "time" functions didn't work in lambda expressions.
- #570: Removes panic in SMTP service on failed close connection.
- #587: Allow number literals without leading zeros.
- #584: Do not block during startup to send usage stats.
- #553: Periodically check if new InfluxDB DBRPs have been created.
- #602: Fix missing To property on email alert handler.
- #581: Record/Replay batch tasks get cluster info from task not API.
- #613: BREAKING: Allow the ID of templates and tasks to be updated via the PATCH method. The breaking change is that now PATCH request return a 200 with the template or task definition, where before they returned 204.
Breaking changes may require special upgrade steps from versions <= 0.12, please read the 0.13.0 release notes
Along with the API changes of 0.13.0, validation logic was added to task IDs, but this was not well documented. This minor release remedies that.
All IDs (tasks, recordings, replays) must match this regex ^[-\._\p{L}0-9]+$
, which is essentially numbers, unicode letters, '-', '.' and '_'.
If you have existing tasks which do not match this pattern they should continue to function normally.
- #545: Fixes inconsistency with API docs for creating a task.
- #544: Fixes issues with existings tasks and invalid names.
- #543: Fixes default values not being set correctly in API calls.
Breaking changes may require special upgrade steps please read below.
Changes to how and where task data is store have been made. In order to safely upgrade to version 0.13 you need to follow these steps:
-
Upgrade InfluxDB to version 0.13 first.
-
Update all TICKscripts to use the new
|
and@
operators. Once Kapacitor no longer issues anyDEPRECATION
warnings you are ready to begin the upgrade. The upgrade will work without this step but tasks using the old syntax cannot be enabled, until modified to use the new syntax. -
Upgrade the Kapacitor binary/package.
-
Configure new database location. By default the location
/var/lib/kapacitor/kapacitor.db
is chosen for package installs or./kapacitor.db
for manual installs. Do not remove the configuration for the location of the old task.db database file since it is still needed to do the migration.[storage] boltdb = "/var/lib/kapacitor/kapacitor.db"
-
Restart Kapacitor. At this point Kapacitor will migrate all existing data to the new database file. If any errors occur Kapacitor will log them and fail to startup. This way if Kapacitor starts up you can be sure the migration was a success and can continue normal operation. The old database is opened in read only mode so that existing data cannot be corrupted. Its recommended to start Kapacitor in debug logging mode for the migration so you can follow the details of the migration process.
At this point you may remove the configuration for the old task
dir
and restart Kapacitor to ensure everything is working.
Kapacitor will attempt the migration on every startup while the old configuration and db file exist, but will skip any data that was already migrated.
With this release the API has been updated to what we believe will be the stable version for a 1.0 release. Small changes may still be made but the significant work to create a RESTful HTTP API is complete. Many breaking changes introduced, see the client/API.md doc for details on how the API works now.
Along with the API changes, breaking changes where also made to the kapacitor
CLI command.
Here is a break down of the CLI changes:
- Every thing has an ID now: tasks, recordings, even replays.
The
name
used before to define a task is now itsID
. As such instead of using-name
and-id
to refer to tasks and recordings, the flags have been changed to-task
and-recording
accordingly. - Replays can be listed and deleted like tasks and recordings.
- Replays default to
fast
clock mode. - The record and replay commands now have a
-no-wait
option to start but not wait for the recording/replay to complete. - Listing recordings and replays displays the status of the respective action.
- Record and Replay command now have an optional flag
-replay-id
/-recording-id
to specify the ID of the replay or recording. If not set then a random ID will be chosen like the previous behavior.
UDF can now be managed externally to Kapacitor via Unix sockets. A process or container can be launched independent of Kapacitor exposing a socket. On startup Kapacitor will connect to the socket and begin communication.
Example UDF config for a socket based UDF.
[udf]
[udf.functions]
[udf.functions.myCustomUDF]
socket = "/path/to/socket"
timeout = "10s"
Alert data can now be consumed directly from within TICKscripts.
For example, let's say we want to store all data that triggered an alert in InfluxDB with a tag level
containing the level string value (i.e CRITICAL).
...
|alert()
.warn(...)
.crit(...)
.levelTag('level')
// and/or use a field
//.levelField('level')
// Also tag the data with the alert ID
.idTag('id')
// and/or use a field
//.idField('id')
|influxDBOut()
.database('alerts')
...
- #360: Forking tasks by measurement in order to improve performance
- #386: Adds official Go HTTP client package.
- #399: Allow disabling of subscriptions.
- #417: UDFs can be connected over a Unix socket. This enables UDFs from across Docker containers.
- #451: StreamNode supports
|groupBy
and|where
methods. - #93: AlertNode now outputs data to child nodes. The output data can have either a tag or field indicating the alert level.
- #281: AlertNode now has an
.all()
property that specifies that all points in a batch must match the criteria in order to trigger an alert. - #384: Add
elapsed
function to compute the time difference between subsequent points. - #230: Alert.StateChangesOnly now accepts optional duration arg. An alert will be triggered for every interval even if the state has not changed.
- #426: Add
skip-format
query parameter to theGET /task
endpoint so that returned TICKscript content is left unmodified from the user input. - #388: The duration of an alert is now tracked and exposed as part of the alert data as well as can be set as a field via
.durationField('duration')
. - #486: Default config file location.
- #461: Make Alerta
event
property configurable. - #491: BREAKING: Rewriting stateful expression in order to improve performance, the only breaking change is: short circuit evaluation for booleans - for example:
lambda: "bool_value" && (count() > 100)
if "bool_value" is false, we won't evaluate "count". - #504: BREAKING: Many changes to the API and underlying storage system. This release requires a special upgrade process.
- #511: Adds DefaultNode for providing default values for missing fields or tags.
- #285: Track created,modified and last enabled dates on tasks.
- #533: Add useful statistics for nodes.
- #499: Fix panic in InfluxQL nodes if field is missing or incorrect type.
- #441: Fix panic in UDF code.
- #429: BREAKING: Change TICKscript parser to be left-associative on equal precedence operators. For example previously this statement
(1+2-3*4/5)
was evaluated as(1+(2-(3*(4/5))))
which is not the typical/expected behavior. Now using left-associative parsing the statement is evaluated as((1+2)-((3*4)/5))
. - #456: Fixes Alerta integration to let server set status, fix
rawData
attribute and set default severity toindeterminate
. - #425: BREAKING: Preserving tags on influxql simple selectors - first, last, max, min, percentile
- #423: Recording stream queries with group by now correctly saves data in time order not group by order.
- #331: Fix panic when missing
.as()
for JoinNode. - #523: JoinNode will now emit join sets as soon as they are ready. If multiple joinable sets arrive in the same tolerance window than each will be emitted (previously the first points were dropped).
- #537: Fix panic in alert node when batch is empty.
New TICKscript syntax that uses a different operators for chaining methods vs property methods vs UDF methods.
- A chaining method is a method that creates a new node in the pipeline. Uses the
|
operator. - A property method is a method that changes a property on a node. Uses the
.
operator. - A UDF method is a method that calls out to a UDF. Uses the
@
operator.
For example below the from
, mean
, and alert
methods create new nodes,
the detectAnomalies
method calls a UDF,
and the other methods modify the nodes as property methods.
stream
|from()
.measurement('cpu')
.where(lambda: "cpu" == 'cpu-total')
|mean('usage_idle')
.as('value')
@detectAnomalies()
.field('mean')
|alert()
.crit(lambda: "anomaly_score" > 10)
.log('/tmp/cpu.log')
With this change a new binary is provided with Kapacitor tickfmt
which will
format a TICKscript file according to a common standard.
- #299: Changes TICKscript chaining method operators and adds
tickfmt
binary. - #389: Adds benchmarks to Kapacitor for basic use cases.
- #390: BREAKING: Remove old
.mapReduce
functions. - #381: Adding enable/disable/delete/reload tasks by glob.
- #401: Add
.align()
property to BatchNode so you can align query start and stop times.
- #378: Fix issue where derivative would divide by zero.
- #387: Add
.quiet()
option to EvalNode so errors can be suppressed if expected. - #400: All query/connection errors are counted and reported in BatchNode stats.
- #412: Fix issues with batch queries dropping points because of nil fields.
- #413: Allow disambiguation between ".groupBy" and "|groupBy".
Kapacitor is now using the functions from the new query engine in InfluxDB core.
Along with this change is a change in the TICKscript API so that using the InfluxQL functions is easier.
Simply call the desired method directly no need to call .mapReduce
explicitly.
This change now hides the mapReduce aspect and handles it internally.
Using .mapReduce
is officially deprecated in this release and will be remove in the next major release.
We feel that this change improves the readability of TICKscripts and exposes less implementation details
to the end user.
Updating your exising TICKscripts is simple.
If previously you had code like this:
stream.from()...
.window()...
.mapReduce(influxql.count('value'))
then update it to look like this:
stream.from()...
.window()...
.count('value')
a simple regex could fix all your existing scripts.
Kapacitor now exposes more internal metrics for determining the performance of a given task.
The internal statistics includes a new measurement named node
that contains any stats a node provides, tagged by the task, node, task type and kind of node (i.e. window vs union).
All nodes provide an averaged execution time for the node.
These stats are also available in the DOT output of the Kapacitor show command.
Significant performance improvements have also been added. In some cases Kapacitor throughput has improved by 4X.
Kapacitor can now connect to different InfluxDB clusters.
Multiple InfluxDB config sections can be defined and one will be marked as default.
To upgrade convert an influxdb
config.
From this:
[influxdb]
enabled = true
...
to this:
[[influxdb]]
enabled = true
default = true
name = "localhost"
...
Various improvements to joining features have been implemented. With #144 you can now join streams with differing group by dimensions.
If you previously configured Email, Slack or HipChat globally now you must also set the state-changes-only
option to true as well if you want to preserve the original behavior.
For example:
[slack]
enable = true
global = true
state-changes-only = true
- #236: Implement batched group by
- #231: Add ShiftNode so values can be shifted in time for joining/comparisons.
- #190: BREAKING: Deadman's switch now triggers off emitted counts and is grouped by to original grouping of the data.
The breaking change is that the 'collected' stat is no longer output for
.stats
and has been replaced byemitted
. - #145: The InfluxDB Out Node now writes data to InfluxDB in buffers.
- #215: Add performance metrics to nodes for average execution times and node throughput values.
- #144: Can now join streams with differing dimensions using the join.On property.
- #249: Can now use InfluxQL functions directly instead of via the MapReduce method. Example
stream.from().count()
. - #233: BREAKING: Now you can use multiple InfluxDB clusters. The config changes to make this possible are breaking. See notes above for changes.
- #302: Can now use .Time in alert message.
- #239: Support more detailed TLS config when connecting to an InfluxDB host.
- #323: Stats for task execution are provided via JSON HTTP request instead of just DOT string. thanks @yosiat
- #358: Improved logging. Adds LogNode so any data in a pipeline can be logged.
- #366: HttpOutNode now allows chaining methods.
- #199: BREAKING: Various fixes for the Alerta integration.
The
event
property has been removed from the Alerta node and is now set as the value of the alert ID. - #232: Better error message for alert integrations. Better error message for VictorOps 404 response.
- #231: Fix window logic when there were gaps in the data stream longer than window every value.
- #213: Add SourceStreamNode so that yuou must always first call
.from
on thestream
object before filtering it, so as to not create confusing to understand TICKscripts. - #255: Add OPTIONS handler for task delete method so it can be preflighted.
- #258: Fix UDP internal metrics, change subscriptions to use clusterID.
- #240: BREAKING: Fix issues with Sensu integration. The breaking change is that the config no longer takes a
url
but rather ahost
option since the communication is raw TCP rather HTTP. - #270: The HTTP server will now gracefully stop.
- #300: Add OPTIONS method to /recording endpoint for deletes.
- #304: Fix panic if recording query but do not have an InfluxDB instance configured
- #289: Add better error handling to batch node.
- #142: Fixes bug when defining multiple influxdb hosts.
- #266: Fixes error log for HipChat that is not an error.
- #333: Fixes hang when replaying with .stats node. Fixes issues with batch and stats.
- #340: BREAKING: Decouples global setting for alert handlers from the state changes only setting.
- #348: config.go: refactor to simplify structure and fix support for array elements
- #362: Fix bug with join tolerance and batches.
This is a bug fix release that fixes many issues releated to the recent 0.10.0 release. The few additional features are focused on usability improvements from recent feedback.
Improved UDFs, lots of bug fixes and improvements on the API. There was a breaking change for UDFs protobuf messages, see #176.
There was a breaking change to the define
command, see #173 below.
- #176: BREAKING: Improved UDFs and groups. Now it is easy to deal with groups from the UDF process. There is a breaking change in the BeginBatch protobuf message for this change.
- #196: Adds a 'details' property to the alert node so that the email body can be defined. See also #75.
- #132: Make is so multiple calls to
where
simplyAND
expressions together instead of replacing or creating extra nodes in the pipeline. - #173: BREAKING: Added a
-no-reload
flag to the define command in the CLI. Now if the task is enabled define will automatically reload it unless-no-reload
is passed. - #194: Adds Talk integration for alerts. Thanks @wutaizeng!
- #320: Upgrade to go 1.6
- #177: Fix panic for show command on batch tasks.
- #185: Fix panic in define command with invalid dbrp value.
- #195: Fix panic in where node.
- #208: Add default stats dbrp to default subscription excludes.
- #203: Fix hang when deleteing invalid batch task.
- #182: Fix missing/incorrect Content-Type headers for various HTTP endpoints.
- #187: Retry connecting to InfluxDB on startup for up to 5 minutes by default.
This release marks the next major release of Kapacitor. With this release you can now run your own custom code for processing data within Kapacitor. See udf/agent/README.md for more details.
With the addition of UDFs it is now possible to run custom anomaly detection alogrithms suited to your needs. There are simple examples of how to use UDFs in udf/agent/examples.
The version has jumped significantly so that it is inline with other projects in the TICK stack. This way you can easily tell which versions of Telegraf, InfluxDB, Chronograf and Kapacitor work together.
See note on a breaking change in the HTTP API below. #163
- #137: Add deadman's switch. Can be setup via TICKscript and globally via configuration.
- #72: Add support for User Defined Functions (UDFs).
- #139: Alerta.io support thanks! @md14454
- #85: Sensu support using JIT clients. Thanks @sstarcher!
- #141: Time of day expressions for silencing alerts.
- #153: Fix panic if referencing non existant field in MapReduce function.
- #138: Change over to influxdata github org.
- #164: Update imports etc from InfluxDB as per the new meta store/client changes.
- #163: BREAKING CHANGE: Removed the 'api/v1' pathing from the HTTP API so that Kapacitor is path compatible with InfluxDB. While this is a breaking change the kapacitor cli has been updated accordingly and you will not experience any distruptions unless you were calling the HTTP API directly.
- #147: Compress .tar archives from builds.
- #118: Can now define multiple handlers of the same type on an AlertNode.
- #119: HipChat support thanks! @ericiles *2
- #113: OpsGenie support thanks! @ericiles
- #107: Enable TICKscript variables to be defined and then referenced from lambda expressions. Also fixes various bugs around using regexes.
- #124: Fix panic where there is an error starting a task.
- #122: Fixes panic when using WhereNode.
- #128: Fix not sending emails when using recipient list from config.
Bugfix #106 made a breaking change to the internal HTTP API. This was to facilitate integration testing and overall better design. Now POSTing a recording request will start the recording and immediately return. If you want to wait till it is complete do a GET for the recording info and it will block until its complete. The kapacitor cli has been updated accordingly.
- #96: Use KAPACITOR_URL env var for setting the kapacitord url in the client.
- #109: Add throughput counts to DOT format in
kapacitor show
command, if task is executing.
- #102: Fix race when start/stoping timeTicker in batch.go
- #106: Fix hang when replaying stream recording.
Some bug fixes including one that cause Kapacitor to deadlock.
- #83: Use enterprise usage client, remove deprecated enterprise register and reporting features.
- #86: Fix dealock form errors in tasks. Also fixes issue where task failures did not get logged.
- #95: Fix race in bolt usage when starting enabled tasks at startup.
Major public release.