The purpose of this repo, is for the submission of Course Project for Getting and Cleaning Data for Courser'a Data Science specilization.
- Run the R Script file run_analysis.r using R-Studio
Tidy Dataset file tidyData.txt
The Code Book for the Tidy Dataset can be found in CodeBook.md, along with the Methodology for run_analysis.R
The run_analysis.R script takes the following steps to transform the Raw Dataset to Tidy Dataset.
- Downloads the Raw Data Set in .ZIP Archive Format, using the url provided.
- UnZIPs the archieve file using the utility function unzip()
- Loads 'UCI HAR Dataset/features.txt' file to dtFeatures DataTable and extracts the Measurements with mean() and std() in the name. These Names are cleaned up, so they are descriptive.
- t and f are substituded with Time and Frequency.
- () are removed
- Mag is substituded with Magnitude.
- Loads 'UCI HAR Dataset/activity_labels.txt' file into dtActivityLabels and the columns are renamed activityID and activityLabel
- The following Files in 'train' and 'test' folders are loaded into individual DataTables.
- subject_train.txt
- X_train.txt
- y_train.txt
- subject_test.txt
- X_test.txt
- y_text.txt
- The subject, X and Y Train and Test datatables are merged, using row bind.
- The Merged subject dataset's column is renamed as subjectID
- The Merged Y dataset's column is renamed as activityID.
- Both these two datatables are column bind. Which is finally column binded with the X merged dataset, resulting in the single dataset, dtTrainingSet
- subjectID and activityID from the dtTrainingSet datatable are set as Identifiers.
- Using the dtFeatures datatable, we subset dtTrainingSet for only the columns of interest, i.e the columns with mean() and std().
- The sub-setted datatable dt columns are assinged the Descriptive Column Names, from dtFeatures datatable. This gives us our Tidy Data Set.
- This new datatable with Descriptive Column Names is joined with dtActivityLabels on the activityID columns, to get a dataTable with descriptive Activity Names and descriptive measurements. This is our Tidy Dataset.
- The Project required that we extract a second Tidy DataSet, where each variable is aggregated for each activity and each subject. This is accumplished using the lapply() function and grouping the data by activityName and subjectID columns. The resultant dataset is saved to tidyData.txt file.