forked from wangle1218/Kaggle_Competition_Treasure
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
251 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,251 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"toc": true | ||
}, | ||
"source": [ | ||
"<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n", | ||
"<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#时间序列可视化\" data-toc-modified-id=\"时间序列可视化-1\"><span class=\"toc-item-num\">1 </span>时间序列可视化</a></span><ul class=\"toc-item\"><li><span><a href=\"#ts对象\" data-toc-modified-id=\"ts对象-1.1\"><span class=\"toc-item-num\">1.1 </span>ts对象</a></span></li><li><span><a href=\"#时序绘图(数据初探)\" data-toc-modified-id=\"时序绘图(数据初探)-1.2\"><span class=\"toc-item-num\">1.2 </span>时序绘图(数据初探)</a></span></li><li><span><a href=\"#时序模式(数据模式观测)\" data-toc-modified-id=\"时序模式(数据模式观测)-1.3\"><span class=\"toc-item-num\">1.3 </span>时序模式(数据模式观测)</a></span></li><li><span><a href=\"#季节性绘图\" data-toc-modified-id=\"季节性绘图-1.4\"><span class=\"toc-item-num\">1.4 </span>季节性绘图</a></span></li><li><span><a href=\"#季节性子图绘制\" data-toc-modified-id=\"季节性子图绘制-1.5\"><span class=\"toc-item-num\">1.5 </span>季节性子图绘制</a></span></li><li><span><a href=\"#散点图\" data-toc-modified-id=\"散点图-1.6\"><span class=\"toc-item-num\">1.6 </span>散点图</a></span></li><li><span><a href=\"#Lag图\" data-toc-modified-id=\"Lag图-1.7\"><span class=\"toc-item-num\">1.7 </span>Lag图</a></span></li><li><span><a href=\"#自相关性\" data-toc-modified-id=\"自相关性-1.8\"><span class=\"toc-item-num\">1.8 </span>自相关性</a></span></li><li><span><a href=\"#白噪音\" data-toc-modified-id=\"白噪音-1.9\"><span class=\"toc-item-num\">1.9 </span>白噪音</a></span></li><li><span><a href=\"#深入阅读\" data-toc-modified-id=\"深入阅读-1.10\"><span class=\"toc-item-num\">1.10 </span>深入阅读</a></span></li></ul></li></ul></div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 时间序列可视化\n", | ||
"数据分析最重要的一步是做EDA,对数据进行探索与分析,而这一步中数据可视化的作用不言而喻,通过合理的可视化技术,我们可以发现我们数据的周期性,随时间变化的规律,以及变量之间的相关性等,方便我们更好的理解数据同时可以辅助我们构建更好的模型.这一章我们会重点探讨对于时间序列的可视化技术. <br /> <br />\n", | ||
"\n", | ||
"<font color=red> **因为该本书的作者是使用R语言进行建模的,此处我就只介绍重点,需要看哪些量,并且截图来解释,而不会介绍R中相应的函数. **</font>\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## ts对象\n", | ||
"<font color=blue>**时间序列的理解之一**</font>:时间序列可以认为是一列数字或者数组,其中每一个元素都是在特定时间所记录,包含有在特定时间的信息.\n", | ||
"\n", | ||
"所谓的ts Object是R语言中用来处理时间序列的一个工具包." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 时序绘图(数据初探)\n", | ||
"<font color=blue>**绘图1:全局**</font>:先对整个时间序列进行绘制,给自己一个大致的感受(就和数据处理中的pd.head()一样),此处我们需要重点观察如下信息:\n", | ||
"\n", | ||
"> 时序是否出现断层,中间是否有一段时间没有记录,调查原因 <br />\n", | ||
"> 序列是否存在较大的波动,是否存在突变<br />\n", | ||
"> 序列是否存在周期性<br />\n", | ||
"> 序列近期表现出什么趋势 <br />\n", | ||
"\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 时序模式(数据模式观测)\n", | ||
"<font color=blue>**时间序列的几大模式**</font>\n", | ||
"- **趋势**:明显的趋势是当数据存在长期的增长或者下降,不仅仅局限于线性的增长或者下降.也可能是周期的上升等.例如下图:\n", | ||
"\n", | ||
"![](./pic/trend_exp1.png)\n", | ||
"\n", | ||
"- **季节性**: 时序被周期性的模式影响例如年份,或者每周(例如周末超市消费就会较高,周中就偏低),**季节性或者周期性往往是一种固定的已知的频率**.<br /> <br />\n", | ||
"\n", | ||
"- **循环性**:循环性的发生往往指数据不以固定的频率展现出上升以及下降的趋势.这些波动往往受到经济的影响,并且经常和商业周期相关,波动的时间往往较长.<br /> <br />\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<font color=blue>**例子**</font>\n", | ||
"![](./pic/trend_seasonal_cyclic_exp1.png)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"①. 最左上角的图展示了较强的**季节性**,以年为单位,同时还有一些较强的**循环性**(6-10年),但是却没有很强的趋势性. <br />\n", | ||
"②. 最右上角的图片展示了较强的**趋势**,但是却没有展现出**季节性和循环性**. <br />\n", | ||
"③. 最左下角的图片展示了很强的**上升趋势**,同时也伴有**较强的季节性**,循环性则没有显现出来. <br />\n", | ||
"④. 最右下角的图片没有很强的趋势,也没有很强的季节性以及循环性,更像是随机波动,此类数据时最难建模预测的. <br />\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 季节性绘图\n", | ||
"因为我不常用R语言的包,所以就不阐述R语言中的包的使用.此处仅仅阐述怎么看季节性的时序数据.\n", | ||
"\n", | ||
"<font color=blue>**季节性时序数据绘制的特点**</font>:\n", | ||
"因为季节性的数据的周期较为固定,所以我们可以设定周期,然后在同一张表上进行多个周期的绘制.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 季节性子图绘制\n", | ||
"当数据是在每个周期的单独的时间段内收集,针对此类数据的一种绘图方式.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 散点图\n", | ||
"\n", | ||
"\n", | ||
"<font color=red>**上述的图在对单个时间序列进行可视化的时候非常好,但是如果我们需要分析几个时间特征在时间序列上的关系的时候就不太方便**</font>.\n", | ||
"\n", | ||
"<font color=blue>**双变量之间的关系探索**</font>:\n", | ||
"\n", | ||
"- **绘图方式一**:上下绘制两张图片用来近距离比较.例子如下:\n", | ||
"\n", | ||
"![](./pic/two_timeseries_analysis1.png)\n", | ||
"\n", | ||
"\n", | ||
"- **绘图方式二**:在同一张图片上采用散点图的形式进行绘制:\n", | ||
"![](./pic/two_timeseries_analysis2.png)\n", | ||
"\n", | ||
"\n", | ||
"\n", | ||
"<font color=blue>**多变量(两两变量之间)的关系探索**</font>:当我们变量较多的时候,我们不太可能一一去绘制散点图,效率较为低下,此时我们可以通过成对绘制散点图的形式来进行观测比较,具体的例子如下:\n", | ||
"\n", | ||
"![](./pic/multiple_timeseries_analysis.png)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Lag图\n", | ||
"\n", | ||
"<font color=blue>**Lag图**</font>:Lag图显示了对于不同的$k$值,$y_t$和$y_{t-k}$之间的关系,具体的例子参考如下:\n", | ||
"\n", | ||
"![](./pic/lag_plot_exp.png)\n", | ||
"\n", | ||
"\n", | ||
"**由上图我们可以发现**:\n", | ||
"\n", | ||
"lags4和8的关系是严格正的,在数据上显示了较强的季节性,而lags2和lags6则显示出了较强的negative相关性." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 自相关性\n", | ||
"\n", | ||
"<font color=blue>**时间序列中的自相关系数数学定义**</font>\n", | ||
"\n", | ||
"在时间序列里面存在一些自相关系数,例如,$r_1$可以用来评估$y_t$和$y_{t-1}$之间的关系,$r_2$可以用来评估$y_t$和$y_{t-2}$之间的关系,其中$r_k$的计算公式如下:\n", | ||
"\n", | ||
"![](./pic/autocorrelation.png)\n", | ||
"\n", | ||
"其中$T$为时间序列的长度. <br /> <br />\n", | ||
"\n", | ||
"\n", | ||
"<font color=blue>**例子**</font>\n", | ||
"\n", | ||
"![](./pic/autocorrelation_r.png)\n", | ||
"![](./pic/autocorrelation_pic.png)\n", | ||
" \n", | ||
"在上表中,我们发现:\n", | ||
"\n", | ||
"> ① $r_4$相对较大,可能是因为周期为4的原因. <br />\n", | ||
"> ② $r_2$相对较小,可能是因为周期为4的,而2恰好在峰值之后的中间. <br />\n", | ||
"> ③ 蓝色的虚线表示相关性是否和0严格不同. <br /> <br />\n", | ||
"\n", | ||
" \n", | ||
"<font color=blue>**季节性和趋势给自相关系数带来的影响**</font>\n", | ||
"\n", | ||
"- 当数据是季节性的(周期性的),自相关系数会在季节性的位置(lag=k,2k,3k,...)获得较大的值; <font color=red>**如果我们的时间序列的以$k$为周期的话,那么我们的$r_{k}$值就会较大**</font>. \n", | ||
"\n", | ||
"- 当数据存在趋势,则对于小的lag(例如lag=1,2)的自相关系数就会变得较大,但是当把lag设置大一点的时候,就会得到缓解. <br /> <br />\n", | ||
"\n", | ||
"<font color=blue>**自相关系数的作用**</font> <br />\n", | ||
"自相关系数常常可以被用来回答如下两个问题:\n", | ||
"\n", | ||
"1. Was this sample data set generated from a random process? <br />\n", | ||
"\n", | ||
"2. Would a non-linear or time series model be a more appropriate model for these data than a simple constant plus error model?\n", | ||
"\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 白噪音\n", | ||
"\n", | ||
"<font color=blue>**白噪音**</font>:当时间序列没有显示出自相关性时,我们将这种时间序列称之为白噪音.\n", | ||
"\n", | ||
"![](./pic/white_noise.png) <br /> <br />\n", | ||
"\n", | ||
"<font color=blue>**白噪音检测**</font> <br />\n", | ||
"\n", | ||
"<font color=red>**关于白噪音的检测,我们认为ACF中有95%的spikes在±$\\frac{2}{\\sqrt{T}}$里面的时间序列称之为白噪声,其中$T$为时间序列的长度.但是如果有一个或者多个的spikes在这两个bounds(蓝色虚线)的外面,或者超过5%的spikes在这些bounds外面,那么这些时间序列可能就不是白噪音. **</font>\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 深入阅读\n", | ||
"1. https://www.youtube.com/watch?v=QDrmpphIfLE\n", | ||
"2. Autocorrelation:http://www.itl.nist.gov/div898/handbook/eda/section3/eda35c.htm\n", | ||
"3. Cleveland (1993) is a classic book on the principles of visualization for data analysis. While it is more than 20 years old, the ideas are timeless.\n", | ||
"4. Unwin (2015) is a modern introduction to graphical data analysis using R. It does not have much information on time series graphics, but plenty of excellent general advice on using graphics for data analysis." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.4" | ||
}, | ||
"toc": { | ||
"nav_menu": {}, | ||
"number_sections": true, | ||
"sideBar": true, | ||
"skip_h1_title": false, | ||
"title_cell": "Table of Contents", | ||
"title_sidebar": "Contents", | ||
"toc_cell": true, | ||
"toc_position": { | ||
"height": "705px", | ||
"left": "0px", | ||
"right": "1634px", | ||
"top": "67px", | ||
"width": "212px" | ||
}, | ||
"toc_section_display": "block", | ||
"toc_window_display": true | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |