From 39523c9afc9984b1d9438f150f62cf2a55dc9cc6 Mon Sep 17 00:00:00 2001 From: Colin Bischoff Date: Fri, 25 Oct 2013 02:41:47 -0400 Subject: [PATCH] Added text and example for generators. --- labs/lab8/lab8_mapreduce.ipynb | 39 ++++++++++++++++++++++++++++++++-- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/labs/lab8/lab8_mapreduce.ipynb b/labs/lab8/lab8_mapreduce.ipynb index 8f1930e..56e962b 100644 --- a/labs/lab8/lab8_mapreduce.ipynb +++ b/labs/lab8/lab8_mapreduce.ipynb @@ -165,9 +165,44 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Generators are necessary to understand all of those yield statements popping up in the mapper and reducer methods." + "Generators are necessary to understand all of those yield statements popping up in the mapper and reducer methods. The main issue, in the case of industrial-strength MapReduce, is that you don't have enough memory to store all of your data at once. This is true even after you have split your data between many compute nodes. So instead of getting an enormous list of data, the mapper and reducer functions both receive and emit generators.\n", + "\n", + "When you run a function, it chugs along until it hits a return statement, at which point it returns some results and then it is done. A generator does its specified calculations until it hits a yield statement. It passes along whatever values it was supposed to yield and then it *pauses* and waits for someone to tell it to continue. It continues until it reaches another yield, and so on.\n", + "\n", + "Not only are mapper and reducer generators, their (key, value) inputs are also generators. This means that for each step of the mapper, it pulls in one (key, value) pair, does some processing, and then emits one or more key value pairs, which move along to a combiner or a shuffler or whatever. This is how MapReduce avoids ever having to load huge datasets into limited memory.\n", + "\n", + "A common stumbling block with generators is the fact that once you have iterated through an entire generator, it is done. You can see an example of this mistake by trying to run the code block below." ] }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "# This function converts a list into a generator.\n", + "def example_generator(list):\n", + " for item in list:\n", + " yield item\n", + " \n", + "# Create a generator.\n", + "my_generator = example_generator([0, 1, 2, 3, 4])\n", + "\n", + "# Iterating over the generator works great the first time.\n", + "print \"generator iteration 1\"\n", + "print \"---------------------\"\n", + "for value in my_generator:\n", + " print value\n", + " \n", + "# ...but it doesn't work the second time.\n", + "print \"\\n\"\n", + "print \"generator iteration 2\"\n", + "print \"---------------------\"\n", + "for value in my_generator:\n", + " print value" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, { "cell_type": "heading", "level": 3, @@ -241,7 +276,7 @@ "\n", "These documents (also linked from HW4) are very useful: [Instructions for Amazon Setup notebook](http://nbviewer.ipython.org/urls/raw.github.com/cs109/content/master/InstructionsForAmazonEMR.ipynb), [Elastic MapReduce Quickstart](http://pythonhosted.org/mrjob/guides/emr-quickstart.html)\n", "\n", - "Once you have this all set up and working, then mrjob makes it *very easy* to run a MapReduce job on EC2. Using the same MRMostUsedWord example as above, the command line invokation to run on EC2 is:" + "Once you have this all set up and working, then mrjob makes it *very easy* to run a MapReduce job with EMR. Using the same MRMostUsedWord example as above, the command line invokation to run with EMR is:" ] }, {