Skip to content

Commit

Permalink
Added text and example for generators.
Browse files Browse the repository at this point in the history
  • Loading branch information
cbischoff committed Oct 25, 2013
1 parent af81b1c commit 39523c9
Showing 1 changed file with 37 additions and 2 deletions.
39 changes: 37 additions & 2 deletions labs/lab8/lab8_mapreduce.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -165,9 +165,44 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Generators are necessary to understand all of those <span style=\"font-family: monospace; font-weight: bold; color: green;\">yield</span> statements popping up in the mapper and reducer methods."
"Generators are necessary to understand all of those <span style=\"font-family: monospace; font-weight: bold; color: green;\">yield</span> statements popping up in the mapper and reducer methods. The main issue, in the case of industrial-strength MapReduce, is that you don't have enough memory to store all of your data at once. This is true even after you have split your data between many compute nodes. So instead of getting an enormous list of data, the mapper and reducer functions both receive and emit generators.\n",
"\n",
"When you run a function, it chugs along until it hits a <span style=\"font-family: monospace; font-weight: bold; color: green;\">return</span> statement, at which point it returns some results and then it is done. A generator does its specified calculations until it hits a <span style=\"font-family: monospace; font-weight: bold; color: green;\">yield</span> statement. It passes along whatever values it was supposed to yield and then it *pauses* and waits for someone to tell it to continue. It continues until it reaches another <span style=\"font-family: monospace; font-weight: bold; color: green;\">yield</span>, and so on.\n",
"\n",
"Not only are mapper and reducer generators, their (key, value) inputs are also generators. This means that for each step of the mapper, it pulls in one (key, value) pair, does some processing, and then emits one or more key value pairs, which move along to a combiner or a shuffler or whatever. This is how MapReduce avoids ever having to load huge datasets into limited memory.\n",
"\n",
"A common stumbling block with generators is the fact that once you have iterated through an entire generator, it is done. You can see an example of this mistake by trying to run the code block below."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This function converts a list into a generator.\n",
"def example_generator(list):\n",
" for item in list:\n",
" yield item\n",
" \n",
"# Create a generator.\n",
"my_generator = example_generator([0, 1, 2, 3, 4])\n",
"\n",
"# Iterating over the generator works great the first time.\n",
"print \"generator iteration 1\"\n",
"print \"---------------------\"\n",
"for value in my_generator:\n",
" print value\n",
" \n",
"# ...but it doesn't work the second time.\n",
"print \"\\n\"\n",
"print \"generator iteration 2\"\n",
"print \"---------------------\"\n",
"for value in my_generator:\n",
" print value"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
Expand Down Expand Up @@ -241,7 +276,7 @@
"\n",
"These documents (also linked from HW4) are very useful: [Instructions for Amazon Setup notebook](http://nbviewer.ipython.org/urls/raw.github.com/cs109/content/master/InstructionsForAmazonEMR.ipynb), [Elastic MapReduce Quickstart](http://pythonhosted.org/mrjob/guides/emr-quickstart.html)\n",
"\n",
"Once you have this all set up and working, then mrjob makes it *very easy* to run a MapReduce job on EC2. Using the same MRMostUsedWord example as above, the command line invokation to run on EC2 is:"
"Once you have this all set up and working, then mrjob makes it *very easy* to run a MapReduce job with EMR. Using the same MRMostUsedWord example as above, the command line invokation to run with EMR is:"
]
},
{
Expand Down

0 comments on commit 39523c9

Please sign in to comment.