-
Notifications
You must be signed in to change notification settings - Fork 21
Datasets
In snorkel, data is subdivided into datasets. Each dataset is made up of samples. All fields in a sample are one of three types: integer, string or set (of strings). Using the knowledge of the 3 data types, snorkel knows how to treat the different fields in the UI and populates the view controls with the appropriately relevant fields.
In general, string fields are meant for GROUP BY queries, while integer fields are used for aggregations and Set fields are used for filtering.
NOTE: all samples must have a time field that is seconds since the epoch (or equivalent field that can be used as a timestamp)
{
integer: {
dom_load: 300,
dns_lookup: 20,
dom_complete: 900,
resources_loaded: 30,
time: <TIMESTAMP> // put your timestamp of seconds since epoch here
},
string: {
page: "/home",
user_id: "12912",
network: "DSL",
country: "USA",
browser_family: "firefox",
browser_major: "23",
os_family: "Windows"
},
set: {
perf_experiments: [ "socket_delivery", "XHR_chunks", "pipelined_delivery" ]
}
}
Let's say you were monitoring the performance or load of your machines. An example data scheme might look like:
{
integer: {
free_ram: 288888,
load_avg: 10, // out of 100
time: <TIMESTAMP>,
requests_per_second: 50,
avg_request_delay: 100 // 100ms delay
},
string: {
cluster: "data-center-03",
region: "NW",
machine_id: "dc03-027",
},
set: {
services: ["nagios", "cacti", "ganglion"]
}
}
This scheme would let you do GROUP BY on cluster, region or machine_id as well as calculate the AVG, SUM and COUNT of the various integer fields.