Rolling timeseries (blue-yonder#170)

* Added a rolling parameter to the extract function and the normalize function, to roll out time series in time in both directions * Added documentation in the code on the new rolling feature * Fixed the normalize test and added a test case for rolling * Increase coverage * Added text documentation for the new feature * Included a warning if the time is not uniformly sampled. For this I had to move the id check before the sort check * Added some formulas to the docu * Do only enable the test when rolling is enabled... * Faktored out the rolling into a new function * Fixed documentation for the new function * Forget to upload some changes
xiehaizheng · Mar 25, 2017 · 8cd6a4c · 8cd6a4c
1 parent 33236bd
commit 8cd6a4c
Show file tree

Hide file tree

Showing 6 changed files with 476 additions and 16 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -27,6 +27,7 @@ The following chapters will explain the tsfresh package in detail:
    Feature Filtering <text/feature_filtering>
    How to write custom Feature Calculators <text/how_to_add_custom_feature>
    Parallelization <text/parallelization>
+   How to handle rolling time series <text/rolling>
    FAQ <text/faq>
    Authors <authors>
    License <license>

diff --git a/docs/text/faq.rst b/docs/text/faq.rst
@@ -1,8 +1,14 @@
 FAQ
-=================
+===
 
 
     1. *Does tsfresh support different time series lengths?*
        Yes, it supports different time series lengths. However, some feature calculators can demand a minimal length
        of the time series. If a shorter time series is passed to the calculator, normally a NaN is returned.
 
+
+    2. *Is it possible to extract features from rolling/shifted time series?*
+       Yes, there is the option `rolling` for the :func:`tsfresh.feature_extraction.extract_features` function.
+       Set it to a non-zero value to enable rolling. In the moment, this just rolls the input data into
+       as many time series as there are time steps - so there is no internal optimization for rolling calculations.
+       Please see :ref:`rolling-label` for more information.
diff --git a/docs/text/rolling.rst b/docs/text/rolling.rst
@@ -0,0 +1,167 @@
+.. _rolling-label:
+
+How to handle rolling time series
+=================================
+
+In many application with time series on real-world problems, the "time" column
+(we will call it time in the following, although it can be anything)
+gives a certain sequential order to the data. We can exploit this sequence to generate
+more input data out of single time series, by *rolling* over the data.
+
+Imagine the following situation: you have the data of EEG measurements, that
+you want to use to classify patients into healthy and not healthy (we oversimplify the problem here).
+You have e.g. 100 time steps of data, so you can extract features that may forecast the healthiness
+of the patients. But what would happen if you had only the recorded measurement for 50 time steps?
+The patients would be as healthy as with 100 time steps. So you can easily increase the amount of
+training data by reusing time series cut into smaller pieces.
+
+Another example is streaming data, e.g. in Industry 4.0 applications. Here you typically get one
+new data row at a time and use this to predict machine failures for example. To train you model,
+you could act as if you would stream the data, by feeding your classifier the data after one time step,
+the data after the first two time steps etc.
+
+Both examples imply, that you extract the features not only on the full data set, but also
+on all temporal coherent subsets of data, which is the process of *rolling*. You can do this easily,
+by calling the function :func:`tsfresh.utilities.dataframe_functions.roll_time_series`.
+
+The rolling mechanism takes a time series :math:`x` with its data rows :math:`[x_1, x_2, x_3, ..., x_n]`
+and creates :math:`n` new time series :math:`\hat x^k`, each of them with a different consecutive part
+of :math:`x`:
+
+.. math::
+    \hat x^k = [x_k, x_{k-1}, x_{k-2}, ..., x_1]
+
+To see what this does in real-world applications, we look into the following example data frame (we show only one possible data format,
+but rolling works on all 3 data formats :ref:`data-formats-label`):
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t1   | 1  | 5  |
++----+------+----+----+
+| 1  | t2   | 2	 | 6  |
++----+------+----+----+
+| 1  | t3   | 3	 | 7  |
++----+------+----+----+
+| 1  | t4   | 4	 | 8  |
++----+------+----+----+
+| 2  | t8   | 10 | 12 |
++----+------+----+----+
+| 2  | t9   | 11 | 13 |
++----+------+----+----+
+
+where you have measured two values (x and y) for two different entities (1 and 2) in 4 or 2 time steps.
+
+If you set `rolling` to 0, the feature extraction works on
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t1   | 1  | 5  |
++----+------+----+----+
+| 1  | t2   | 2	 | 6  |
++----+------+----+----+
+| 1  | t3   | 3	 | 7  |
++----+------+----+----+
+| 1  | t4   | 4	 | 8  |
++----+------+----+----+
+
+and
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 2  | t8   | 10 | 12 |
++----+------+----+----+
+| 2  | t9   | 11 | 13 |
++----+------+----+----+
+
+So it extracts 2 set of features.
+
+If you set rolling to 1, the feature extraction works with all of the following time series:
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t1   | 1  | 5  |
++----+------+----+----+
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t1   | 1  | 5  |
++----+------+----+----+
+| 1  | t2   | 2  | 6  |
++----+------+----+----+
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t1   | 1  | 5  |
++----+------+----+----+
+| 1  | t2   | 2  | 6  |
++----+------+----+----+
+| 1  | t3   | 3  | 7  |
++----+------+----+----+
+| 2  | t8   | 10 | 12 |
++----+------+----+----+
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t1   | 1  | 5  |
++----+------+----+----+
+| 1  | t2   | 2  | 6  |
++----+------+----+----+
+| 1  | t3   | 3  | 7  |
++----+------+----+----+
+| 1  | t4   | 4  | 8  |
++----+------+----+----+
+| 2  | t8   | 10 | 12 |
++----+------+----+----+
+| 2  | t9   | 11 | 13 |
++----+------+----+----+
+
+If you set rolling to -1, you end up with features for the time series, rolled in the other direction
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t4   | 4  | 8  |
++----+------+----+----+
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t3   | 3  | 7  |
++----+------+----+----+
+| 1  | t4   | 4  | 8  |
++----+------+----+----+
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t2   | 2  | 6  |
++----+------+----+----+
+| 1  | t3   | 3  | 7  |
++----+------+----+----+
+| 1  | t4   | 4  | 8  |
++----+------+----+----+
+| 2  | t9   | 11 | 13 |
++----+------+----+----+
+
++----+------+----+----+
+| id | time | x  | y  |
++====+======+====+====+
+| 1  | t1   | 1  | 5  |
++----+------+----+----+
+| 1  | t2   | 2  | 6  |
++----+------+----+----+
+| 1  | t3   | 3  | 7  |
++----+------+----+----+
+| 1  | t4   | 4  | 8  |
++----+------+----+----+
+| 2  | t8   | 10 | 12 |
++----+------+----+----+
+| 2  | t9   | 11 | 13 |
++----+------+----+----+
diff --git a/tests/utilities/test_dataframe_functions.py b/tests/utilities/test_dataframe_functions.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 # This file as well as the whole tsfresh package are licenced under the MIT licence (see the LICENCE.txt)
 # Maximilian Christ (maximilianchrist.com), Blue Yonder Gmbh, 2016
-
+import warnings
 from unittest import TestCase
 
 import pandas as pd
@@ -157,6 +157,178 @@ def test_with_wrong_input(self):
         self.assertRaises(ValueError, dataframe_functions.normalize_input_to_internal_representation, test_df,
                           "id", None, None, "value")
 
+        test_df = pd.DataFrame([{"id": 0, "value": np.NaN}])
+        self.assertRaises(ValueError, dataframe_functions.normalize_input_to_internal_representation, test_df,
+                          None, None, None, "value")
+
+
+class RollingTestCase(TestCase):
+    def test_with_wrong_input(self):
+        test_df = pd.DataFrame([{"id": 0, "kind": "a", "value": 3, "sort": np.NaN}])
+        self.assertRaises(ValueError, dataframe_functions.roll_time_series,
+                          df_or_dict=test_df, column_id="id",
+                          column_sort="sort", column_kind="kind",
+                          rolling_direction=1)
+
+        test_df = pd.DataFrame([{"id": 0, "kind": "a", "value": 3, "sort": 1}])
+        self.assertRaises(AttributeError, dataframe_functions.roll_time_series,
+                          df_or_dict=test_df, column_id="strange_id",
+                          column_sort="sort", column_kind="kind",
+                          rolling_direction=1)
+
+        test_df = {"a": pd.DataFrame([{"id": 0}])}
+        self.assertRaises(ValueError, dataframe_functions.roll_time_series,
+                          df_or_dict=test_df, column_id="id",
+                          column_sort=None, column_kind="kind",
+                          rolling_direction=1)
+
+        self.assertRaises(ValueError, dataframe_functions.roll_time_series,
+                          df_or_dict=test_df, column_id=None,
+                          column_sort=None, column_kind="kind",
+                          rolling_direction=1)
+
+        self.assertRaises(ValueError, dataframe_functions.roll_time_series,
+                          df_or_dict=test_df, column_id="id",
+                          column_sort=None, column_kind=None,
+                          rolling_direction=0)
+
+        self.assertRaises(ValueError, dataframe_functions.roll_time_series,
+                          df_or_dict=test_df, column_id=None,
+                          column_sort=None, column_kind=None,
+                          rolling_direction=0)
+
+    def test_single_row(self):
+        test_df = pd.DataFrame([{"id": np.NaN, "kind": "a", "value": 3, "sort": 1}])
+        dataframe_functions.roll_time_series(
+            df_or_dict=test_df, column_id="id",
+            column_sort="sort", column_kind="kind",
+            rolling_direction=1)
+
+    def test_positive_rolling(self):
+        first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": range(4)})
+        second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})
+
+        first_class["id"] = 1
+        second_class["id"] = 2
+
+        df_full = pd.concat([first_class, second_class], ignore_index=True)
+
+        df = dataframe_functions.roll_time_series(df_full, column_id="id", column_sort="time",
+                                                  column_kind=None, rolling_direction=1)
+
+        correct_indices = (["id=1, shift=3"] * 1 +
+                           ["id=1, shift=2"] * 2 +
+                           ["id=1, shift=1"] * 3 +
+                           ["id=2, shift=1"] * 1 +
+                           ["id=1, shift=0"] * 4 +
+                           ["id=2, shift=0"] * 2)
+
+        self.assertListEqual(list(df["id"]), correct_indices)
+
+        self.assertListEqual(list(df["a"].values),
+                             [1, 1, 2, 1, 2, 3, 10, 1, 2, 3, 4, 10, 11])
+        self.assertListEqual(list(df["b"].values),
+                             [5, 5, 6, 5, 6, 7, 12, 5, 6, 7, 8, 12, 13])
+
+    def test_negative_rolling(self):
+        first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": range(4)})
+        second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})
+
+        first_class["id"] = 1
+        second_class["id"] = 2
+
+        df_full = pd.concat([first_class, second_class], ignore_index=True)
+
+        df = dataframe_functions.roll_time_series(df_full, column_id="id", column_sort="time",
+                                                  column_kind=None, rolling_direction=-1)
+
+        correct_indices = (["id=1, shift=-3"] * 1 +
+                           ["id=1, shift=-2"] * 2 +
+                           ["id=1, shift=-1"] * 3 +
+                           ["id=2, shift=-1"] * 1 +
+                           ["id=1, shift=0"] * 4 +
+                           ["id=2, shift=0"] * 2)
+
+        self.assertListEqual(list(df["id"].values), correct_indices)
+
+        self.assertListEqual(list(df["a"].values),
+                             [4, 3, 4, 2, 3, 4, 11, 1, 2, 3, 4, 10, 11])
+        self.assertListEqual(list(df["b"].values),
+                             [8, 7, 8, 6, 7, 8, 13, 5, 6, 7, 8, 12, 13])
+
+    def test_stacked_rolling(self):
+        first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": range(4)})
+        second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})
+
+        first_class["id"] = 1
+        second_class["id"] = 2
+
+        df_full = pd.concat([first_class, second_class], ignore_index=True)
+
+        df_stacked = pd.concat([df_full[["time", "id", "a"]].rename(columns={"a": "_value"}),
+                                df_full[["time", "id", "b"]].rename(columns={"b": "_value"})], ignore_index=True)
+        df_stacked["kind"] = ["a"] * 6 + ["b"] * 6
+
+        df = dataframe_functions.roll_time_series(df_stacked, column_id="id", column_sort="time",
+                                                  column_kind="kind", rolling_direction=-1)
+
+        correct_indices = (["id=1, shift=-3"] * 2 +
+                           ["id=1, shift=-2"] * 4 +
+                           ["id=1, shift=-1"] * 6 +
+                           ["id=2, shift=-1"] * 2 +
+                           ["id=1, shift=0"] * 8 +
+                           ["id=2, shift=0"] * 4)
+
+        self.assertListEqual(list(df["id"].values), correct_indices)
+
+        self.assertListEqual(list(df["kind"].values), ["a", "b"] * 13)
+        self.assertListEqual(list(df["_value"].values),
+                             [4, 8, 3, 7, 4, 8, 2, 6, 3, 7, 4, 8, 11, 13, 1, 5, 2, 6, 3, 7, 4, 8, 10, 12, 11, 13])
+
+    def test_dict_rolling(self):
+        df_dict = {
+            "a": pd.DataFrame({"_value": [1, 2, 3, 4, 10, 11], "id": [1, 1, 1, 1, 2, 2]}),
+            "b": pd.DataFrame({"_value": [5, 6, 7, 8, 12, 13], "id": [1, 1, 1, 1, 2, 2]})
+        }
+
+        df = dataframe_functions.roll_time_series(df_dict, column_id="id", column_sort=None,
+                                                  column_kind=None, rolling_direction=-1)
+
+        correct_indices = (["id=1, shift=-3"] * 1 +
+                           ["id=1, shift=-2"] * 2 +
+                           ["id=1, shift=-1"] * 3 +
+                           ["id=2, shift=-1"] * 1 +
+                           ["id=1, shift=0"] * 4 +
+                           ["id=2, shift=0"] * 2)
+
+        self.assertListEqual(list(df["a"]["id"].values), correct_indices)
+        self.assertListEqual(list(df["b"]["id"].values), correct_indices)
+
+        self.assertListEqual(list(df["a"]["_value"].values),
+                             [4, 3, 4, 2, 3, 4, 11, 1, 2, 3, 4, 10, 11])
+        self.assertListEqual(list(df["b"]["_value"].values),
+                             [8, 7, 8, 6, 7, 8, 13, 5, 6, 7, 8, 12, 13])
+
+
+
+    def test_warning_on_non_uniform_time_steps(self):
+        with warnings.catch_warnings(record=True) as w:
+            first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": [1, 2, 4, 5]})
+            second_class = pd.DataFrame({"a": [10, 11], "b": [12, 13], "time": range(20, 22)})
+
+            first_class["id"] = 1
+            second_class["id"] = 2
+
+            df_full = pd.concat([first_class, second_class], ignore_index=True)
+
+            dataframe_functions.roll_time_series(df_full, column_id="id", column_sort="time",
+                                                 column_kind=None, rolling_direction=1)
+
+            self.assertEqual(len(w), 1)
+            self.assertEqual(str(w[0].message),
+                             "Your time stamps are not uniformly sampled, which makes rolling "
+                             "nonsensical in some domains.")
+
 
 class CheckForNanTestCase(TestCase):
     def test_all_columns(self):
@@ -284,6 +456,11 @@ def test_restrict_dict(self):
         self.assertTrue(kind_to_df_restricted2['a'].equals(kind_to_df['a']))
         self.assertTrue(kind_to_df_restricted2['b'].equals(kind_to_df['b']))
 
+    def test_restrict_wrong(self):
+        other_type = np.array([1, 2, 3])
+
+        self.assertRaises(TypeError, dataframe_functions.restrict_input_to_index, other_type, "id", [1, 2, 3])
+
 
 class GetRangeValuesPerColumnTestCase(TestCase):
     def test_ignores_non_finite_values(self):