Merge pull request udacity#2 from udacity/develop

Develop
baikeyu · Jun 20, 2018 · 5733e34 · 5733e34
2 parents de6b5c1 + 7f22ee4
commit 5733e34
Showing 1 changed file with 150 additions and 57 deletions.
diff --git a/P2_Explore_Movie_Dataset/Explore Movie Dataset.ipynb b/P2_Explore_Movie_Dataset/Explore Movie Dataset.ipynb
@@ -9,11 +9,26 @@
     "在这个项目中，你将尝试使用所学的知识，使用 `NumPy`、`Pandas`、`matplotlib`、`seaborn` 库中的函数，来对电影数据集进行探索。\n",
     "\n",
     "下载数据集：\n",
-    "[TMDb电影数据](https://s3.cn-north-1.amazonaws.com.cn/static-documents/nd101/explore+dataset/tmdb-movies.csv)\n",
-    "\n",
-    "---\n",
+    "[TMDb电影数据](https://s3.cn-north-1.amazonaws.com.cn/static-documents/nd101/explore+dataset/tmdb-movies.csv)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "\n",
-    "---"
+    "数据集各列名称的含义：\n",
+    "<table>\n",
+    "<thead><tr><th>列名称</th><th>id</th><th>imdb_id</th><th>popularity</th><th>budget</th><th>revenue</th><th>original_title</th><th>cast</th><th>homepage</th><th>director</th><th>tagline</th><th>keywords</th><th>overview</th><th>runtime</th><th>genres</th><th>production_companies</th><th>release_date</th><th>vote_count</th><th>vote_average</th><th>release_year</th><th>budget_adj</th><th>revenue_adj</th></tr></thead><tbody>\n",
+    " <tr><td>含义</td><td>编号</td><td>IMDB 编号</td><td>知名度</td><td>预算</td><td>票房</td><td>名称</td><td>主演</td><td>网站</td><td>导演</td><td>宣传词</td><td>关键词</td><td>简介</td><td>时常</td><td>类别</td><td>发行公司</td><td>发行日期</td><td>投票总数</td><td>投票均值</td><td>发行年份</td><td>预算（调整后）</td><td>票房（调整后）</td></tr>\n",
+    "</tbody></table>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**请注意，你需要提交该报告导出的 `.html`、`.ipynb` 以及 `.py` 文件。**"
    ]
   },
   {
@@ -22,33 +37,36 @@
     "collapsed": true
    },
    "source": [
-    "## 第一节 读取库、导入数据\n",
     "\n",
-    "在这一部分，你需要编写代码，完成以下任务：\n",
     "\n",
-    "1. 载入需要的库 `NumPy`、`Pandas`、`matplotlib`、`seaborn`。\n",
-    "2. 利用 `Pandas` 库，读取 `tmdb-movies.csv` 中的数据，保存为 `movie_data`。\n",
-    "3. 使用 `.head()` 方法，来获取数据的前几条数据。\n",
-    "4. 根据获取的数据，提出两个问题，作为接下来探索数据的目标。\n",
+    "---\n",
+    "\n",
+    "---\n",
     "\n",
-    "提示：\n",
-    "1. 记得使用 notebook 中的魔法指令 `%matplotlib inline`，否则会导致你接下来无法打印出图像。\n",
-    "2. 提出的问题应当和数据中的**某个**特征息息相关，例如：大部分电影的票房（revenue）是怎样分布的、大部分电影的知名度（popularity）是怎样分布的。"
+    "## 第一节 数据的导入与处理\n",
+    "\n",
+    "在这一部分，你需要编写代码，使用 Pandas 读取数据，并进行预处理。"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "---\n",
     "\n",
-    "**任务1：**按照要求完成代码。"
+    "**任务1.1：** 导入库以及数据\n",
+    "\n",
+    "1. 载入需要的库 `NumPy`、`Pandas`、`matplotlib`、`seaborn`。\n",
+    "2. 利用 `Pandas` 库，读取 `tmdb-movies.csv` 中的数据，保存为 `movie_data`。\n",
+    "\n",
+    "提示：记得使用 notebook 中的魔法指令 `%matplotlib inline`，否则会导致你接下来无法打印出图像。"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": []
   },
@@ -58,19 +76,49 @@
    "source": [
     "---\n",
     "\n",
-    "**任务2：**根据上述数据，提出两个问题，作为接下来探索数据的目标。"
+    "**任务1.2: ** 了解数据\n",
+    "\n",
+    "你会接触到各种各样的数据表，因此在读取之后，我们有必要通过一些简单的方法，来了解我们数据表是什么样子的。\n",
+    "\n",
+    "1. 获取数据表的行列，并打印。\n",
+    "2. 使用 `.head()`、`.tail()`、`.sample()` 方法，观察、了解数据表的情况。\n",
+    "3. 使用 `.dtypes` 属性，来查看各列数据的数据类型。\n",
+    "4. 使用 `isnull()` 配合 `.any()` 等方法，来查看各列是否存在空值。\n",
+    "5. 使用 `.describe()` 方法，看看数据表中数值型的数据是怎么分布的。\n",
+    "\n"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "---\n",
+    "\n",
+    "**任务1.3: ** 清理数据\n",
     "\n",
-    "- 问题 1：（回答区）\n",
+    "在真实的工作场景中，数据处理往往是最为费时费力的环节。但是幸运的是，我们提供给大家的 tmdb 数据集非常的「干净」，不需要大家做特别多的数据清洗以及处理工作。在这一步中，你的核心的工作主要是对数据表中的空值进行处理。你可以使用 `.fillna()` 来填补空值，当然也可以使用 `.dropna()` 来丢弃数据表中包含空值的某些行或者列。\n",
     "\n",
-    "- 问题 2：（回答区）"
+    "任务：使用适当的方法来清理空值，并将得到的数据保存。"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -79,24 +127,34 @@
     "\n",
     "---\n",
     "\n",
-    "## 第二节 获取数据的统计信息\n",
+    "## 第二节 根据指定要求读取数据\n",
+    "\n",
     "\n",
-    "读取数据之后，我们需要获取数据的一些统计信息，例如最大值、最小值、平均数、中位数等。"
+    "相比 Excel 等数据分析软件，Pandas 的一大特长在于，能够轻松地基于复杂的逻辑选择合适的数据。因此，如何根据指定的要求，从数据表当获取适当的数据，是使用 Pandas 中非常重要的技能，也是本节重点考察大家的内容。\n",
+    "\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "---\n",
+    "\n",
+    "**任务2.1: ** 简单读取\n",
     "\n",
+    "1. 读取数据表中名为 `id`、`popularity`、`budget`、`runtime`、`vote_average` 列的数据。\n",
+    "2. 读取数据表中前1～20行以及48、49行的数据。\n",
+    "3. 读取数据表中第50～60行的 `popularity` 那一列的数据。\n",
     "\n",
-    "**任务3：**请写代码，计算出数据有多少行、多少列？"
+    "要求：每一个语句只能用一行代码实现。"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": []
   },
@@ -106,13 +164,22 @@
    "source": [
     "---\n",
     "\n",
-    "**任务4：**获取数据中任意两列的一些统计信息，可以是最大值、最小值、平均数、中位数、标准差等。你可以使用 `.describe` 方法获取整张数据表的统计信息。"
+    "**任务2.2: **逻辑读取（Logical Indexing）\n",
+    "\n",
+    "1. 读取数据表中 **`popularity` 大于5** 的所有数据。\n",
+    "2. 读取数据表中 **`popularity` 大于5** 的所有数据且**发行年份在1996年之后**的所有数据。\n",
+    "\n",
+    "提示：Pandas 中的逻辑运算符如 `&`、`|`，分别代表`且`以及`或`。\n",
+    "\n",
+    "要求：请使用 Logical Indexing实现。"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": []
   },
@@ -122,19 +189,23 @@
    "source": [
     "---\n",
     "\n",
-    "**任务5：**上述获取的统计信息，对你回答提出的两个问题有何帮助？"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "**任务2.3: **分组读取\n",
     "\n",
-    "- 问题 1：（回答区）\n",
+    "1. 对 `release_year` 进行分组，使用 [`.agg`](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) 获得 `revenue` 的均值。\n",
+    "2. 对 `director` 进行分组，使用 [`.agg`](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) 获得 `popularity` 的均值，从高到低排列。\n",
     "\n",
-    "- 问题 2：（回答区）"
+    "要求：使用 `Groupby` 命令实现。"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -145,88 +216,110 @@
     "\n",
     "## 第三节 绘图与可视化\n",
     "\n",
-    "接着你要尝试对你的数据进行图像的绘制以及可视化。根据课程的所学内容，你可以根据不同的数据类型，绘制这些图像：\n",
+    "接着你要尝试对你的数据进行图像的绘制以及可视化。这一节最重要的是，你能够选择合适的图像，对特定的可视化目标进行可视化。所谓可视化的目标，是你希望从可视化的过程中，观察到怎样的信息以及变化。例如，观察票房随着时间的变化、哪个导演最受欢迎等。\n",
     "\n",
-    "1. 条形图\n",
-    "2. 饼图\n",
-    "3. 直方图\n",
-    "4. 散点图\n",
-    "5. 折线图\n",
-    "6. 箱线图\n",
-    "7. 热力图\n",
-    "8. 小提琴图\n",
-    "9. 轴须图\n",
-    "10. 带状图\n",
-    "11. 堆积图\n",
+    "<table>\n",
+    "<thead><tr><th>可视化的目标</th><th>可以使用的图像</th></tr></thead><tbody>\n",
+    " <tr><td>表示某一属性数据的分布</td><td>饼图、直方图、散点图</td></tr>\n",
+    " <tr><td>表示某一属性数据随着某一个变量变化</td><td>条形图、折线图、热力图</td></tr>\n",
+    " <tr><td>比较多个属性的数据之间的关系</td><td>散点图、小提琴图、堆积条形图、堆积折线图</td></tr>\n",
+    "</tbody></table>\n",
     "\n",
-    "那么接下来该你尝试使用所学的知识，来对我们的数据进行可视化啦！"
+    "在这个部分，你需要根据题目中问题，选择适当的可视化图像进行绘制，并进行相应的分析。对于选做题，他们具有一定的难度，你可以尝试挑战一下～"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**任务6：**请根据你的问题1，来对某个数据特征进行适当的可视化，并尝试回答你的问题。"
+    "**任务3.1：**对 `popularity` 最高的20名电影绘制其 `popularity` 值。"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": []
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "（问题回答）"
+    "---\n",
+    "**任务3.2：**分析电影净利润（票房-成本）随着年份变化的情况，并简单进行分析。"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "---\n",
     "\n",
-    "**任务7：**请根据你的问题2，来对某个数据特征进行适当的可视化，并尝试回答你的问题。"
+    "**[选做]任务3.3：**选择最多产的10位导演（电影数量最多的），绘制他们排行前3的三部电影的票房情况，并简要进行分析。"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": []
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "（问题回答）"
+    "---\n",
+    "\n",
+    "**[选做]任务3.4：**分析1968年~2015年六月电影的数量的变化。"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "---\n",
     "\n",
-    "**任务8：**（挑战）请尝试挑选一组特征，进行多变量的可视化。多变量的可视化能够帮我们揭示数据之间的关系，例如：电影的票房和知名度的关系。"
+    "**[选做]任务3.5：**分析1968年~2015年六月电影 `Comedy` 和 `Drama` 两类电影的数量的变化。"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": []
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "> 注意: 当你写完了所有的代码，并且回答了所有的问题。你就可以把你的 iPython Notebook 导出成 HTML 文件。你可以在菜单栏，这样导出**File -> Download as -> HTML (.html)**把这个 HTML 和这个 iPython notebook 一起做为你的作业提交。"
+    "> 注意: 当你写完了所有的代码，并且回答了所有的问题。你就可以把你的 iPython Notebook 导出成 HTML 文件。你可以在菜单栏，这样导出**File -> Download as -> HTML (.html)、Python (.py)** 把导出的 HTML、python文件 和这个 iPython notebook 一起提交给审阅者。"
    ]
   }
  ],
@@ -246,7 +339,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.4"
+   "version": "3.6.1"
   }
  },
  "nbformat": 4,