Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jtleek committed Nov 14, 2013
1 parent 8231e2b commit 7a41e88
Showing 1 changed file with 43 additions and 1 deletion.
44 changes: 43 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,53 @@ For maximum speed in the analysis this is the information you should pass to a s

1. The raw data.
2. A [tidy data set](http://vita.had.co.nz/papers/tidy-data.pdf)
3. An explicit and exact recipe you used to go from 1 -> 2
3. A code book describing each variable and its values in the tidy data set.
4. An explicit and exact recipe you used to go from 1 -> 2,3

Let's look at each part of the data package you will transfer.


### The raw data

It is critical that you include the rawest form of the data that you have access to. Here are some examples of the
raw form of data:

* The strange binary file your measurement machine spits out
* The unformated Excel file with 10 workbooks the company you contracted with sent you
* The complicated JSON data you got from scraping the Twitter API
* The hand-entered numbers you collected looking through a microscope

You know the raw data is in the right format if you:

1. Ran no software on the data
2. Did not manipulate any of the numbers in the data
3. You did not remove any data from the data set
4. You did not summarize the data in any way

If you did any manipulation of the data at all it is not the raw form of the data. Reporting manipulated data
as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a
forensic study of your data to figure out why the raw data looks weird.

### The tidy data set

The general principles of tidy data are laid out by Hadley Wickham in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)
and [this video](http://vimeo.com/33727555). The paper and the video are both focused on the R package, which you
may or may not know how to use. Regardless the three general principles you should pay attention to are:

1. Each variable you measure should be in one column
2. Each different observation of that variable should be in a different row
3. There should be one table for each "kind" of variable
4. If you have multiple tables, they should include a row in the table that allows them to be linked


### The code book


### The instruction list/script





What you should expect from a statistician
====================
Expand Down

0 comments on commit 7a41e88

Please sign in to comment.