This script:
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set.
- Appropriately labels the data set with descriptive variable names.
- From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
Each group of data sets is contained in 3 tables (subject_*,X_* and y_* where *=training or test); variable names are contained in features.txt and activity names are contained in activity_labels.txt. The output of each step is used in the proceeding step. The tidy data set at the end of step 5 is called data5 and is in wide format. (See references below for conditions of a tidy data).
- load and clean column names from features.txt (to enable select function in step 2; it also completes most of step 4).
- Column bind the 3 tables for training and test seperately.
- Row bind the training and test tables to create one data set
- Names for columns of X_* tables come from features.txt, so load that first.
- OUTPUT: data1
- create vector of column names to include; by strict name match with '_mean_' and '_std_'. (This is based on a strict search for '-mean()' and '-std()', a looser search (say on 'mean' and 'std') would produce more variables).
- create extract using select_ function (which allows for selection based on vector of column names).
- OUTPUT: data2
- Use activity_labels.txt for activity names.
- Read in activity names and merge with data set.
- OUTPUT: data3
- The 'illegal' characters such as '-', ',', '(', ')' were already replaced in step 1.
- Remove last underscore '_' if it exists.
- The original names, after replacing 'illegal' characters, are interpreted to be sufficiently descriptive in combination with CodeBook.md.
- OUTPUT: data3 (renamed)
- Use dplyr pipe commands to group by activity and subject variables and then summarise by mean.
- The data set is tidy as per references below.
- OUTPUT: data5
Tidy data set: