Skip to content

Commit

Permalink
final version
Browse files Browse the repository at this point in the history
  • Loading branch information
bvonkonsky committed Apr 26, 2014
1 parent d634078 commit 7bdfa07
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 6 deletions.
4 changes: 2 additions & 2 deletions CodeBook.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ Papers arising from the original or cleaned data should reference:
The original data were downloaded and subsequently cleaned using an [R script](http://www.r-project.org/) called [run_analysis.R](https://github.com/bvonkonsky/GettingAndCleaningData/blob/master/run_analysis.R).

The cleaned version merges the training and test data from the original study, and uses arguably more meaningful variable names. The mapping from original to modified variable names are shown in the table below. Occurrences of **Mag** were changed to **Magnitude**, **Acc** to **Accelerate**, **std** to **StadardDeviation** and **mean** to **Mean**. All parentheses, dots and hyphens were removed. The **t** prefixes for **time** and **f** prefixes for **frequency domain** at the beginning of variable names were retained to avoid making names more unwieldy. A small modification to the script could easily expand this if desired.
The cleaned version merges the training and test data from the original study, and uses arguably more meaningful variable names. The mapping from original to modified variable names is shown in the table below. Occurrences of **Mag** were changed to **Magnitude**, **Acc** to **Accelerate**, **std** to **StadardDeviation** and **mean** to **Mean**. All parentheses, dots and hyphens were removed. The **t** prefixes for **time** and **f** prefixes for **frequency domain** at the beginning of variable names were retained to avoid making names more unwieldy. A small modification to the script could easily expand this if desired.

The cleaned version of the data only retains original variables that include **mean** or **std**. Variables that did not contain **mean** or **std** were intentionally dropped from the cleaned data set.

The cleaned version of the data follows the four principles of a Tidy Dataset as described by [Jeff Leek](http://biostat.jhsph.edu/~jleek/) and the Leek Group in their document on [datasharing](https://github.com/jtleek/datasharing). Specifically:
* each variable is in its own column;
* each observation is contained in a single row, in this case labeled by subject ID and the activity that the subject was engaged in at the time of the measurement;
* each observation (or mean of aggregaed observations) is contained in a single row, in this case labeled by subject ID and the activity that the subject was engaged in at the time of the measurement;
* one table for each kind of measurement, in this case measurements taken using Samsung mobile devices recorded while subjects were engaged in various activities; and
* multiple tables are readily linked, in this case, by subject ID and activity.

Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ The original data format stored test and training data in different sub-director
####Tidy Data Produced by run_analysis.R
The script [run_analysis.R](https://github.com/bvonkonsky/GettingAndCleaningData/blob/master/run_analysis.R):
* merges the training and test data by Subject ID for a given Activity;
* retains mean and standard deviation attributes, assumed to be those attributes that originally ended in **mean()** and **std()**, and drops other attributes;
* retains mean and standard deviation attributes, assumed to be those attributes that contain **mean** or **std** anywhere in the attribute name, and drops others;
* makes attribute names arguably more readable by changing abbreviations like **Mag** to **Magnitude**, **Acc** to **Acceleration**, and **std** to **StandardDeviation**, and removing parentheses, dots, and hyphens;
* adds columns identifying the subject by **SubjectID** and **Activity**, where activity is shown as a human readable string rather than as an integer.
* generates a tidy dataset of the merged data in [Comma Separated Values (CSV)](http://en.wikipedia.org/wiki/Comma-separated_values) format in a file called **tidyMerged.csv**; and
* generates a second tidy dataset in CSV format that contains the average of reported attributes by subject in a file called **tidyAveraged.csv**.
* generates a second tidy dataset in CSV format that contains the average of reported attributes by subject and activity in a file called **tidyAveraged.csv**.

**See Also:** [CodeBook.md](https://github.com/bvonkonsky/GettingAndCleaningData/blob/master/CodeBook.md)

Expand All @@ -33,7 +33,7 @@ To use [run_analysis.R](https://github.com/bvonkonsky/GettingAndCleaningData/blo
1. Download and install [R](http://www.r-project.org/) and [R Studio](https://www.rstudio.com/).
2. Obtain a copy of [run_analysis.R](https://github.com/bvonkonsky/GettingAndCleaningData/blob/master/run_analysis.R) from [Github](https://github.com/) and store it in your project directory.
3. Run [R Studio](https://www.rstudio.com/).
4. Use **setwd("\<project directory\>")** to set the working directory to your project directory.
4. Use **setwd("\<project directory\>")** to set the working directory to your project directory containing the [run_analysis.R](https://github.com/bvonkonsky/GettingAndCleaningData/blob/master/run_analysis.R) script.
5. Use **source("run_analysis.R")** to run the script. If necessary, the script will download and unzip the original data into the current working directory. The original dataset is large, so please be patient. Not including the initial download and unzip, the script takes around 30 seconds to run on a 2.3 GHz Intel Core i7 iMac running Mac OS X 10.9.2.


Expand All @@ -43,7 +43,7 @@ Functions in [run_analysis.R](https://github.com/bvonkonsky/GettingAndCleaningDa

* **main <- function()** </br> Creates two tidy [data frames](http://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html) for the training and test data and then merges these into a single [data frame](http://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html). The merged data frame is written to a CSV file called **tidyMerged.csv**. The merged data frame is then averaged by activity for each subject, and written to a second CSV file called **tidyAveraged.csv**. Paths to files and directories in the original data are coded to work on all operating systems supported by R using the file.path() function.
* **getAndClean <- function(subjectsFilename, labelsFilename, dataFilename)** </br> Recovers raw data, subject ID numbers, and activities from the three files used to store information from one of the two original data sets (either test or training) and combines data for that set into a single tidy [data frame](http://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html).
* **getData() <- function(fileName)** </br> Gets the feature set and the raw data. Keep rows that end in **mean()** or **std()**, and edit the remaining headers to make them more readable.
* **getData() <- function(fileName)** </br> Gets the feature set and the raw data. Keep columns with attribute names that contain **mean** or **std**, and edit these names to make them more readable. Drop those column names that did not originally contain **mean** or **std**.
* **getActivities() <- function(fileName)** </br> Read the Activity ID for each observation and convert this to a meaningful English verb (e.g. WALKING, STANDING).
* **getSubjectIDs() <- function(fileName)** </br> Returns a list of SubjectIDs for each observation in the set.
* **getActivityLabels() <- function(filename)** </br> Returns an ordered list of sequential activity labels for use as a lookup table in other functions.
Expand Down

0 comments on commit 7bdfa07

Please sign in to comment.