Skip to content

Commit

Permalink
Rolling timeseries (blue-yonder#170)
Browse files Browse the repository at this point in the history
* Added a rolling parameter to the extract function and the normalize function, to roll out time series in time in both directions

* Added documentation in the code on the new rolling feature

* Fixed the normalize test and added a test case for rolling

* Increase coverage

* Added text documentation for the new feature

* Included a warning if the time is not uniformly sampled. For this I had to move the id check before the sort check

* Added some formulas to the docu

* Do only enable the test when rolling is enabled...

* Faktored out the rolling into a new function

* Fixed documentation for the new function

* Forget to upload some changes
  • Loading branch information
nils-braun authored and MaxBenChrist committed Mar 25, 2017
1 parent 33236bd commit 8cd6a4c
Show file tree
Hide file tree
Showing 6 changed files with 476 additions and 16 deletions.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ The following chapters will explain the tsfresh package in detail:
Feature Filtering <text/feature_filtering>
How to write custom Feature Calculators <text/how_to_add_custom_feature>
Parallelization <text/parallelization>
How to handle rolling time series <text/rolling>
FAQ <text/faq>
Authors <authors>
License <license>
Expand Down
8 changes: 7 additions & 1 deletion docs/text/faq.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
FAQ
=================
===


1. *Does tsfresh support different time series lengths?*
Yes, it supports different time series lengths. However, some feature calculators can demand a minimal length
of the time series. If a shorter time series is passed to the calculator, normally a NaN is returned.


2. *Is it possible to extract features from rolling/shifted time series?*
Yes, there is the option `rolling` for the :func:`tsfresh.feature_extraction.extract_features` function.
Set it to a non-zero value to enable rolling. In the moment, this just rolls the input data into
as many time series as there are time steps - so there is no internal optimization for rolling calculations.
Please see :ref:`rolling-label` for more information.
167 changes: 167 additions & 0 deletions docs/text/rolling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
.. _rolling-label:

How to handle rolling time series
=================================

In many application with time series on real-world problems, the "time" column
(we will call it time in the following, although it can be anything)
gives a certain sequential order to the data. We can exploit this sequence to generate
more input data out of single time series, by *rolling* over the data.

Imagine the following situation: you have the data of EEG measurements, that
you want to use to classify patients into healthy and not healthy (we oversimplify the problem here).
You have e.g. 100 time steps of data, so you can extract features that may forecast the healthiness
of the patients. But what would happen if you had only the recorded measurement for 50 time steps?
The patients would be as healthy as with 100 time steps. So you can easily increase the amount of
training data by reusing time series cut into smaller pieces.

Another example is streaming data, e.g. in Industry 4.0 applications. Here you typically get one
new data row at a time and use this to predict machine failures for example. To train you model,
you could act as if you would stream the data, by feeding your classifier the data after one time step,
the data after the first two time steps etc.

Both examples imply, that you extract the features not only on the full data set, but also
on all temporal coherent subsets of data, which is the process of *rolling*. You can do this easily,
by calling the function :func:`tsfresh.utilities.dataframe_functions.roll_time_series`.

The rolling mechanism takes a time series :math:`x` with its data rows :math:`[x_1, x_2, x_3, ..., x_n]`
and creates :math:`n` new time series :math:`\hat x^k`, each of them with a different consecutive part
of :math:`x`:

.. math::
\hat x^k = [x_k, x_{k-1}, x_{k-2}, ..., x_1]
To see what this does in real-world applications, we look into the following example data frame (we show only one possible data format,
but rolling works on all 3 data formats :ref:`data-formats-label`):

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t1 | 1 | 5 |
+----+------+----+----+
| 1 | t2 | 2 | 6 |
+----+------+----+----+
| 1 | t3 | 3 | 7 |
+----+------+----+----+
| 1 | t4 | 4 | 8 |
+----+------+----+----+
| 2 | t8 | 10 | 12 |
+----+------+----+----+
| 2 | t9 | 11 | 13 |
+----+------+----+----+

where you have measured two values (x and y) for two different entities (1 and 2) in 4 or 2 time steps.

If you set `rolling` to 0, the feature extraction works on

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t1 | 1 | 5 |
+----+------+----+----+
| 1 | t2 | 2 | 6 |
+----+------+----+----+
| 1 | t3 | 3 | 7 |
+----+------+----+----+
| 1 | t4 | 4 | 8 |
+----+------+----+----+

and

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 2 | t8 | 10 | 12 |
+----+------+----+----+
| 2 | t9 | 11 | 13 |
+----+------+----+----+

So it extracts 2 set of features.

If you set rolling to 1, the feature extraction works with all of the following time series:

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t1 | 1 | 5 |
+----+------+----+----+

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t1 | 1 | 5 |
+----+------+----+----+
| 1 | t2 | 2 | 6 |
+----+------+----+----+

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t1 | 1 | 5 |
+----+------+----+----+
| 1 | t2 | 2 | 6 |
+----+------+----+----+
| 1 | t3 | 3 | 7 |
+----+------+----+----+
| 2 | t8 | 10 | 12 |
+----+------+----+----+

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t1 | 1 | 5 |
+----+------+----+----+
| 1 | t2 | 2 | 6 |
+----+------+----+----+
| 1 | t3 | 3 | 7 |
+----+------+----+----+
| 1 | t4 | 4 | 8 |
+----+------+----+----+
| 2 | t8 | 10 | 12 |
+----+------+----+----+
| 2 | t9 | 11 | 13 |
+----+------+----+----+

If you set rolling to -1, you end up with features for the time series, rolled in the other direction

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t4 | 4 | 8 |
+----+------+----+----+

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t3 | 3 | 7 |
+----+------+----+----+
| 1 | t4 | 4 | 8 |
+----+------+----+----+

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t2 | 2 | 6 |
+----+------+----+----+
| 1 | t3 | 3 | 7 |
+----+------+----+----+
| 1 | t4 | 4 | 8 |
+----+------+----+----+
| 2 | t9 | 11 | 13 |
+----+------+----+----+

+----+------+----+----+
| id | time | x | y |
+====+======+====+====+
| 1 | t1 | 1 | 5 |
+----+------+----+----+
| 1 | t2 | 2 | 6 |
+----+------+----+----+
| 1 | t3 | 3 | 7 |
+----+------+----+----+
| 1 | t4 | 4 | 8 |
+----+------+----+----+
| 2 | t8 | 10 | 12 |
+----+------+----+----+
| 2 | t9 | 11 | 13 |
+----+------+----+----+
179 changes: 178 additions & 1 deletion tests/utilities/test_dataframe_functions.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
# This file as well as the whole tsfresh package are licenced under the MIT licence (see the LICENCE.txt)
# Maximilian Christ (maximilianchrist.com), Blue Yonder Gmbh, 2016

import warnings
from unittest import TestCase

import pandas as pd
Expand Down Expand Up @@ -157,6 +157,178 @@ def test_with_wrong_input(self):
self.assertRaises(ValueError, dataframe_functions.normalize_input_to_internal_representation, test_df,
"id", None, None, "value")

test_df = pd.DataFrame([{"id": 0, "value": np.NaN}])
self.assertRaises(ValueError, dataframe_functions.normalize_input_to_internal_representation, test_df,
None, None, None, "value")


class RollingTestCase(TestCase):
def test_with_wrong_input(self):
test_df = pd.DataFrame([{"id": 0, "kind": "a", "value": 3, "sort": np.NaN}])
self.assertRaises(ValueError, dataframe_functions.roll_time_series,
df_or_dict=test_df, column_id="id",
column_sort="sort", column_kind="kind",
rolling_direction=1)

test_df = pd.DataFrame([{"id": 0, "kind": "a", "value": 3, "sort": 1}])
self.assertRaises(AttributeError, dataframe_functions.roll_time_series,
df_or_dict=test_df, column_id="strange_id",
column_sort="sort", column_kind="kind",
rolling_direction=1)

test_df = {"a": pd.DataFrame([{"id": 0}])}
self.assertRaises(ValueError, dataframe_functions.roll_time_series,
df_or_dict=test_df, column_id="id",
column_sort=None, column_kind="kind",
rolling_direction=1)

self.assertRaises(ValueError, dataframe_functions.roll_time_series,
df_or_dict=test_df, column_id=None,
column_sort=None, column_kind="kind",
rolling_direction=1)

self.assertRaises(ValueError, dataframe_functions.roll_time_series,
df_or_dict=test_df, column_id="id",
column_sort=None, column_kind=None,
rolling_direction=0)

self.assertRaises(ValueError, dataframe_functions.roll_time_series,
df_or_dict=test_df, column_id=None,
column_sort=None, column_kind=None,
rolling_direction=0)

def test_single_row(self):
test_df = pd.DataFrame([{"id": np.NaN, "kind": "a", "value": 3, "sort": 1}])
dataframe_functions.roll_time_series(
df_or_dict=test_df, column_id="id",
column_sort="sort", column_kind="kind",
rolling_direction=1)

def test_positive_rolling(self):
first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": range(4)})
second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})

first_class["id"] = 1
second_class["id"] = 2

df_full = pd.concat([first_class, second_class], ignore_index=True)

df = dataframe_functions.roll_time_series(df_full, column_id="id", column_sort="time",
column_kind=None, rolling_direction=1)

correct_indices = (["id=1, shift=3"] * 1 +
["id=1, shift=2"] * 2 +
["id=1, shift=1"] * 3 +
["id=2, shift=1"] * 1 +
["id=1, shift=0"] * 4 +
["id=2, shift=0"] * 2)

self.assertListEqual(list(df["id"]), correct_indices)

self.assertListEqual(list(df["a"].values),
[1, 1, 2, 1, 2, 3, 10, 1, 2, 3, 4, 10, 11])
self.assertListEqual(list(df["b"].values),
[5, 5, 6, 5, 6, 7, 12, 5, 6, 7, 8, 12, 13])

def test_negative_rolling(self):
first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": range(4)})
second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})

first_class["id"] = 1
second_class["id"] = 2

df_full = pd.concat([first_class, second_class], ignore_index=True)

df = dataframe_functions.roll_time_series(df_full, column_id="id", column_sort="time",
column_kind=None, rolling_direction=-1)

correct_indices = (["id=1, shift=-3"] * 1 +
["id=1, shift=-2"] * 2 +
["id=1, shift=-1"] * 3 +
["id=2, shift=-1"] * 1 +
["id=1, shift=0"] * 4 +
["id=2, shift=0"] * 2)

self.assertListEqual(list(df["id"].values), correct_indices)

self.assertListEqual(list(df["a"].values),
[4, 3, 4, 2, 3, 4, 11, 1, 2, 3, 4, 10, 11])
self.assertListEqual(list(df["b"].values),
[8, 7, 8, 6, 7, 8, 13, 5, 6, 7, 8, 12, 13])

def test_stacked_rolling(self):
first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": range(4)})
second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})

first_class["id"] = 1
second_class["id"] = 2

df_full = pd.concat([first_class, second_class], ignore_index=True)

df_stacked = pd.concat([df_full[["time", "id", "a"]].rename(columns={"a": "_value"}),
df_full[["time", "id", "b"]].rename(columns={"b": "_value"})], ignore_index=True)
df_stacked["kind"] = ["a"] * 6 + ["b"] * 6

df = dataframe_functions.roll_time_series(df_stacked, column_id="id", column_sort="time",
column_kind="kind", rolling_direction=-1)

correct_indices = (["id=1, shift=-3"] * 2 +
["id=1, shift=-2"] * 4 +
["id=1, shift=-1"] * 6 +
["id=2, shift=-1"] * 2 +
["id=1, shift=0"] * 8 +
["id=2, shift=0"] * 4)

self.assertListEqual(list(df["id"].values), correct_indices)

self.assertListEqual(list(df["kind"].values), ["a", "b"] * 13)
self.assertListEqual(list(df["_value"].values),
[4, 8, 3, 7, 4, 8, 2, 6, 3, 7, 4, 8, 11, 13, 1, 5, 2, 6, 3, 7, 4, 8, 10, 12, 11, 13])

def test_dict_rolling(self):
df_dict = {
"a": pd.DataFrame({"_value": [1, 2, 3, 4, 10, 11], "id": [1, 1, 1, 1, 2, 2]}),
"b": pd.DataFrame({"_value": [5, 6, 7, 8, 12, 13], "id": [1, 1, 1, 1, 2, 2]})
}

df = dataframe_functions.roll_time_series(df_dict, column_id="id", column_sort=None,
column_kind=None, rolling_direction=-1)

correct_indices = (["id=1, shift=-3"] * 1 +
["id=1, shift=-2"] * 2 +
["id=1, shift=-1"] * 3 +
["id=2, shift=-1"] * 1 +
["id=1, shift=0"] * 4 +
["id=2, shift=0"] * 2)

self.assertListEqual(list(df["a"]["id"].values), correct_indices)
self.assertListEqual(list(df["b"]["id"].values), correct_indices)

self.assertListEqual(list(df["a"]["_value"].values),
[4, 3, 4, 2, 3, 4, 11, 1, 2, 3, 4, 10, 11])
self.assertListEqual(list(df["b"]["_value"].values),
[8, 7, 8, 6, 7, 8, 13, 5, 6, 7, 8, 12, 13])



def test_warning_on_non_uniform_time_steps(self):
with warnings.catch_warnings(record=True) as w:
first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": [1, 2, 4, 5]})
second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})

first_class["id"] = 1
second_class["id"] = 2

df_full = pd.concat([first_class, second_class], ignore_index=True)

dataframe_functions.roll_time_series(df_full, column_id="id", column_sort="time",
column_kind=None, rolling_direction=1)

self.assertEqual(len(w), 1)
self.assertEqual(str(w[0].message),
"Your time stamps are not uniformly sampled, which makes rolling "
"nonsensical in some domains.")


class CheckForNanTestCase(TestCase):
def test_all_columns(self):
Expand Down Expand Up @@ -284,6 +456,11 @@ def test_restrict_dict(self):
self.assertTrue(kind_to_df_restricted2['a'].equals(kind_to_df['a']))
self.assertTrue(kind_to_df_restricted2['b'].equals(kind_to_df['b']))

def test_restrict_wrong(self):
other_type = np.array([1, 2, 3])

self.assertRaises(TypeError, dataframe_functions.restrict_input_to_index, other_type, "id", [1, 2, 3])


class GetRangeValuesPerColumnTestCase(TestCase):
def test_ignores_non_finite_values(self):
Expand Down
Loading

0 comments on commit 8cd6a4c

Please sign in to comment.