Course project deliverables for the Coursera course Getting and Cleaning Data
- Source the script
run_analysis.R
. - When sourced, the script checks if the required R packages are available and proceeds to install them if they are not found
- Calling
download.data()
downloads the zipped dataset and unarchives it. - Calling
run.analysis()
starts the actual data processing, which are as follows:- Feature vector label data is loaded from
features.txt
- Using regex with grepl, subset of label data for selecting desired data columns is created.
- Activity labels are loaded from
features.txt
- Activity labels (id->label) and selected features (id->label) are given as parameters to function which loads the training or test dataset, based on the type value given also as a parameter.
- Paths to data files are created based on type parameter
- Data files are loaded. Feature vector data is filtered using ids of the selected features.
- Activity and subject id data are loaded
- Feature vector data is renamed using the names of selected features
- Activies and subjects are given labels using factor levels of activity and subject id data.
- Finally, processed dataset is returned.
- The previous processing is repeated to both training and test datasets.
- Training and test datasets are merged using
rbind()
and converted todata.table
to make it easier to do group-wise operations in the following step - A new independent tidy dataset is created by calculating means for all variables for each activity and subject.
- Variable names are loaded to separate vector and modified to follow CamelCase convention.
- New names are applied to tidy dataset.
- Both raw and tidy datasets are written to disk.
- Tidy datset is returned as output of the function.
- Feature vector label data is loaded from
In case the Samsung data is already unzipped and directory with the dataset is available as
UCI HAR Dataset
subdirectory of the current directory, the processing function run.analysis()
can be
called straight away, no need to call download.data()
.
At the end processing, both raw and tidy datasets are written to disk into raw-dataset.txt and tidy-dataset.txt respectively under the current working directory.