Practical Machine Learning
Practical Machine Learning
Writeup
1. Summary
This document is the final report of the Peer Assessment project from the Practical Machine
Learning course, which is a part of the Data Science Specialization. It was written and coded in
RStudio, using its knitr functions and published in the html format. The purpose of this analysis is to
predict the manner in which the six participants performed the exercises described below and to
answer the questions of the associated course quiz. The machine learning algorithm, which uses
the classe variable in the training set, is applied to the 20 test cases available in the test data. The
predictions are submitted to the Course Project Prediction Quiz for grading.
2. Introduction
Devices such as Jawbone Up, Nike FuelBand, and Fitbit can enable collecting a large amount of
data about someone’s physical activity. These devices are used by the enthusiasts who take
measurements about themselves regularly to improve their health, to find patterns in their behavior,
or because they are tech geeks. However, even though these enthusiasts regularly quantify how
much of a particular activity they do, they rarely quantify how well they do it. In this project, the goal
is to use data from accelerometers on the belt, forearm, arm, and dumbell of six participants. They
were asked to perform barbell lifts correctly and incorrectly in five different ways.
More information is available from the following website: http://groupware.les.inf.puc-rio.br/har (see
the section on the Weight Lifting Exercise Dataset).
3. Source of Data
The data for this project can be found on the following website:
http://groupware.les.inf.puc-rio.br/har.
The training data for this project:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data for this project:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The full reference is as follows:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of
Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with
SIGCHI (Augmented Human ’13)”. Stuttgart, Germany: ACM SIGCHI, 2013.
library(lattice)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(corrplot)
library(rattle)
library(randomForest)
library(RColorBrewer)
set.seed(1813)
Create two partitions (75 % and 25 %) within the original training dataset.
The two datasets (train_set and test_set) have a large number of NA values as well as near-
zero-variance (NZV) variables. Both will be removed together with their ID variables.
Since columns 1 to 5 are identification variables only, they will be removed as well.
The number of variables for the analysis has been reduced from the original 160 down to 54.
5. Correlation Analysis
Perform a correlation analysis between the variables before the modeling work itself is done. Select
“FPC” for the first principal component order.
corr_matrix <- cor(train_set[ , -54])
corrplot(corr_matrix, order = "FPC", method = "circle", type = "lower",
tl.cex = 0.6, tl.col = rgb(0, 0, 0))
If two variables are highly correlated their colors are either dark blue (for a positive correlation) or
dark red (for a negative corraltions). To further reduce the number of variables, a Principal
Components Analysis (PCA) could be performed as the next step. However, since there are only
very few strong correlations among the input variables, the PCA will not be performed. Instead, a
few different prediction models will be built next.
6. Prediction Models
6.1. Decision Tree Model
set.seed(1813)
fit_decision_tree <- rpart(classe ~ ., data = train_set, method="class")
fancyRpartPlot(fit_decision_tree)
Predictions of the decision tree model on test_set.
The predictive accuracy of the decision tree model is relatively low at 74.9 %.
Plot the predictive accuracy of the decision tree model.
The Random Forest model is selected and applied to make predictions on the 20 data points from
the original testing dataset (data_quiz).