Update README.md

maltazor · Nov 14, 2013 · 7a41e88 · 7a41e88
1 parent 8231e2b
commit 7a41e88
Showing 1 changed file with 43 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -27,11 +27,53 @@ For maximum speed in the analysis this is the information you should pass to a s
 
 1. The raw data.
 2. A [tidy data set](http://vita.had.co.nz/papers/tidy-data.pdf) 
-3. An explicit and exact recipe you used to go from 1 -> 2 
+3. A code book describing each variable and its values in the tidy data set.  
+4. An explicit and exact recipe you used to go from 1 -> 2,3 
 
 Let's look at each part of the data package you will transfer. 
 
 
+### The raw data
+
+It is critical that you include the rawest form of the data that you have access to. Here are some examples of the
+raw form of data:
+
+* The strange binary file your measurement machine spits out
+* The unformated Excel file with 10 workbooks the company you contracted with sent you
+* The complicated JSON data you got from scraping the Twitter API
+* The hand-entered numbers you collected looking through a microscope
+
+You know the raw data is in the right format if you: 
+
+1. Ran no software on the data
+2. Did not manipulate any of the numbers in the data
+3. You did not remove any data from the data set
+4. You did not summarize the data in any way
+
+If you did any manipulation of the data at all it is not the raw form of the data. Reporting manipulated data
+as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a
+forensic study of your data to figure out why the raw data looks weird. 
+
+### The tidy data set
+
+The general principles of tidy data are laid out by Hadley Wickham in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)
+and [this video](http://vimeo.com/33727555). The paper and the video are both focused on the R package, which you
+may or may not know how to use. Regardless the three general principles you should pay attention to are:
+
+1. Each variable you measure should be in one column
+2. Each different observation of that variable should be in a different row
+3. There should be one table for each "kind" of variable
+4. If you have multiple tables, they should include a row in the table that allows them to be linked
+
+
+### The code book
+
+
+### The instruction list/script
+
+
+
+
 
 What you should expect from a statistician
 ====================