0% found this document useful (0 votes)

178 views11 pages

Practical Machine Learning

This document provides a summary of a machine learning project to predict the manner in which participants performed weight lifting exercises. The project uses data from accelerometers on participants' bodies to build prediction models using decision trees, generalized boosted models, and random forests. The random forest model achieved the highest predictive accuracy of 98.45% on test data.

Uploaded by

minhajur rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

178 views11 pages

Practical Machine Learning

Uploaded by

minhajur rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Practical Machine Learning - Prediction Assignment

Writeup
1. Summary
This document is the final report of the Peer Assessment project from the Practical Machine
Learning course, which is a part of the Data Science Specialization. It was written and coded in
RStudio, using its knitr functions and published in the html format. The purpose of this analysis is to
predict the manner in which the six participants performed the exercises described below and to
answer the questions of the associated course quiz. The machine learning algorithm, which uses
the classe variable in the training set, is applied to the 20 test cases available in the test data. The
predictions are submitted to the Course Project Prediction Quiz for grading.

2. Introduction
Devices such as Jawbone Up, Nike FuelBand, and Fitbit can enable collecting a large amount of
data about someone’s physical activity. These devices are used by the enthusiasts who take
measurements about themselves regularly to improve their health, to find patterns in their behavior,
or because they are tech geeks. However, even though these enthusiasts regularly quantify how
much of a particular activity they do, they rarely quantify how well they do it. In this project, the goal
is to use data from accelerometers on the belt, forearm, arm, and dumbell of six participants. They
were asked to perform barbell lifts correctly and incorrectly in five different ways.
More information is available from the following website: http://groupware.les.inf.puc-rio.br/har (see
the section on the Weight Lifting Exercise Dataset).

3. Source of Data
The data for this project can be found on the following website:
http://groupware.les.inf.puc-rio.br/har.
The training data for this project:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data for this project:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The full reference is as follows:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of
Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with
SIGCHI (Augmented Human ’13)”. Stuttgart, Germany: ACM SIGCHI, 2013.

4. Data Loading and Cleaning

Set working directory.
setwd("C:/Open/Coursera/Machine_Learning")

Load the required R packages and set a seed.

library(lattice)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(corrplot)
library(rattle)
library(randomForest)
library(RColorBrewer)
set.seed(1813)

Load the training and test datasets.

url_train <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-

training.csv"
url_quiz <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-
testing.csv"
data_train <- read.csv(url(url_train), strip.white = TRUE, na.strings =
c("NA",""))
data_quiz <- read.csv(url(url_quiz), strip.white = TRUE, na.strings =
c("NA",""))
dim(data_train)
## [1] 19622 160
dim(data_quiz)
## [1] 20 160

Create two partitions (75 % and 25 %) within the original training dataset.

in_train <- createDataPartition(data_train$classe, p=0.75, list=FALSE)

train_set <- data_train[ in_train, ]
test_set <- data_train[-in_train, ]
dim(train_set)
## [1] 14718 160
dim(test_set)
## [1] 4904 160

The two datasets (train_set and test_set) have a large number of NA values as well as near-
zero-variance (NZV) variables. Both will be removed together with their ID variables.

nzv_var <- nearZeroVar(train_set)

train_set <- train_set[ , -nzv_var]
test_set <- test_set [ , -nzv_var]
dim(train_set)
## [1] 14718 121
dim(test_set)
## [1] 4904 121

Remove variables that are mostly NA. A threshlod of 95 % is selected.

na_var <- sapply(train_set, function(x) mean(is.na(x))) > 0.95

train_set <- train_set[ , na_var == FALSE]
test_set <- test_set [ , na_var == FALSE]
dim(train_set)
## [1] 14718 59
dim(test_set)
## [1] 4904 59

Since columns 1 to 5 are identification variables only, they will be removed as well.

train_set <- train_set[ , -(1:5)]

test_set <- test_set [ , -(1:5)]
dim(train_set)
## [1] 14718 54
dim(test_set)
## [1] 4904 54

The number of variables for the analysis has been reduced from the original 160 down to 54.

5. Correlation Analysis
Perform a correlation analysis between the variables before the modeling work itself is done. Select
“FPC” for the first principal component order.
corr_matrix <- cor(train_set[ , -54])
corrplot(corr_matrix, order = "FPC", method = "circle", type = "lower",
tl.cex = 0.6, tl.col = rgb(0, 0, 0))

If two variables are highly correlated their colors are either dark blue (for a positive correlation) or
dark red (for a negative corraltions). To further reduce the number of variables, a Principal
Components Analysis (PCA) could be performed as the next step. However, since there are only
very few strong correlations among the input variables, the PCA will not be performed. Instead, a
few different prediction models will be built next.

6. Prediction Models
6.1. Decision Tree Model
set.seed(1813)
fit_decision_tree <- rpart(classe ~ ., data = train_set, method="class")
fancyRpartPlot(fit_decision_tree)
Predictions of the decision tree model on test_set.

predict_decision_tree <- predict(fit_decision_tree, newdata = test_set,

type="class")
conf_matrix_decision_tree <- confusionMatrix(predict_decision_tree,
test_set$classe)
conf_matrix_decision_tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1248 173 48 42 39
## B 49 589 29 92 40
## C 3 54 667 107 49
## D 76 89 51 494 98
## E 19 44 60 69 675
##
## Overall Statistics
##
## Accuracy : 0.749
## 95% CI : (0.7366, 0.7611)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6814
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8946 0.6207 0.7801 0.6144 0.7492
## Specificity 0.9139 0.9469 0.9474 0.9234 0.9520
## Pos Pred Value 0.8052 0.7372 0.7580 0.6114 0.7785
## Neg Pred Value 0.9562 0.9123 0.9533 0.9243 0.9440
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2545 0.1201 0.1360 0.1007 0.1376
## Detection Prevalence 0.3161 0.1629 0.1794 0.1648 0.1768
## Balanced Accuracy 0.9043 0.7838 0.8638 0.7689 0.8506

The predictive accuracy of the decision tree model is relatively low at 74.9 %.
Plot the predictive accuracy of the decision tree model.

plot(conf_matrix_decision_tree$table, col = conf_matrix_decision_tree$byClass,

main = paste("Decision Tree Model: Predictive Accuracy =",
round(conf_matrix_decision_tree$overall['Accuracy'], 4)))
6.2. Generalized Boosted Model (GBM)
set.seed(1813)
ctrl_GBM <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fit_GBM <- train(classe ~ ., data = train_set, method = "gbm",
trControl = ctrl_GBM, verbose = FALSE)
fit_GBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 42 had non-zero influence.

Predictions of the GBM on test_set.

predict_GBM <- predict(fit_GBM, newdata = test_set)

conf_matrix_GBM <- confusionMatrix(predict_GBM, test_set$classe)
conf_matrix_GBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1391 7 0 0 0
## B 3 932 16 6 3
## C 0 10 836 14 1
## D 1 0 3 784 12
## E 0 0 0 0 885
##
## Overall Statistics
##
## Accuracy : 0.9845
## 95% CI : (0.9806, 0.9878)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9804
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9971 0.9821 0.9778 0.9751 0.9822
## Specificity 0.9980 0.9929 0.9938 0.9961 1.0000
## Pos Pred Value 0.9950 0.9708 0.9710 0.9800 1.0000
## Neg Pred Value 0.9989 0.9957 0.9953 0.9951 0.9960
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2836 0.1900 0.1705 0.1599 0.1805
## Detection Prevalence 0.2851 0.1958 0.1756 0.1631 0.1805
## Balanced Accuracy 0.9976 0.9875 0.9858 0.9856 0.9911

The predictive accuracy of the GBM is relatively high at 98.45 %.

6.3. Random Forest Model

set.seed(1813)
ctrl_RF <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fit_RF <- train(classe ~ ., data = train_set, method = "rf",
trControl = ctrl_RF, verbose = FALSE)
fit_RF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.22%
## Confusion matrix:
## A B C D E class.error
## A 4184 1 0 0 0 0.0002389486
## B 5 2839 3 1 0 0.0031601124
## C 0 4 2563 0 0 0.0015582392
## D 0 0 10 2401 1 0.0045605307
## E 0 1 0 7 2698 0.0029563932

Predictions of the Random Forest model on test_set.

predict_RF <- predict(fit_RF, newdata = test_set)

conf_matrix_RF <- confusionMatrix(predict_RF, test_set$classe)
conf_matrix_RF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 1 0 0 0
## B 0 946 2 0 0
## C 0 2 853 3 0
## D 0 0 0 801 1
## E 1 0 0 0 900
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.9963, 0.999)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9974
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9968 0.9977 0.9963 0.9989
## Specificity 0.9997 0.9995 0.9988 0.9998 0.9998
## Pos Pred Value 0.9993 0.9979 0.9942 0.9988 0.9989
## Neg Pred Value 0.9997 0.9992 0.9995 0.9993 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1929 0.1739 0.1633 0.1835
## Detection Prevalence 0.2845 0.1933 0.1750 0.1635 0.1837
## Balanced Accuracy 0.9995 0.9982 0.9982 0.9980 0.9993

The predictive accuracy of the Random Forest model is excellent at 99.8 %.

7. Applying the Best Predictive Model to the Test Data

To summarize, the predictive accuracy of the three models evaluated is as follows:

 Decision Tree Model: 74.90 %

 Generalized Boosted Model: 98.45 %
 Random Forest Model: 99.80 %

The Random Forest model is selected and applied to make predictions on the 20 data points from
the original testing dataset (data_quiz).

predict_quiz <- predict(fit_RF, newdata = data_quiz)

predict_quiz
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

RESEARCH REPORT
No ratings yet
RESEARCH REPORT
35 pages
MHF4U1 Lesson 1.1 Solutions
No ratings yet
MHF4U1 Lesson 1.1 Solutions
7 pages
DS_FILE_ET_C1_23
No ratings yet
DS_FILE_ET_C1_23
15 pages
Report- SVM
No ratings yet
Report- SVM
13 pages
212212739_Muhamad Choza Inul Muna_Analisis Sentimen
No ratings yet
212212739_Muhamad Choza Inul Muna_Analisis Sentimen
8 pages
ML Report2
No ratings yet
ML Report2
21 pages
期末作業
No ratings yet
期末作業
10 pages
Diabetes_Prediction_Presentation
No ratings yet
Diabetes_Prediction_Presentation
12 pages
Final Data Lab
No ratings yet
Final Data Lab
20 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
Cvms
No ratings yet
Cvms
37 pages
Decision Tree
No ratings yet
Decision Tree
10 pages
Classification and K Nearest Neighbour Algorithm
No ratings yet
Classification and K Nearest Neighbour Algorithm
53 pages
eBay Auction Case Solution
No ratings yet
eBay Auction Case Solution
9 pages
classification
No ratings yet
classification
4 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
Predicting-Earnings-Manipulation_FinalDoc
No ratings yet
Predicting-Earnings-Manipulation_FinalDoc
29 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Business Analytics C-2
No ratings yet
Business Analytics C-2
7 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Lect_02_Evaluation_Part_1
No ratings yet
Lect_02_Evaluation_Part_1
33 pages
Assignment 2 Sol
No ratings yet
Assignment 2 Sol
19 pages
Mla - 2 (Cia - 1) - 20221013
No ratings yet
Mla - 2 (Cia - 1) - 20221013
14 pages
Map Assign 8
No ratings yet
Map Assign 8
7 pages
Name: Le Ho Thao Nguyen Student ID: 20194224
No ratings yet
Name: Le Ho Thao Nguyen Student ID: 20194224
9 pages
PE IV - Practical Machine Learning
No ratings yet
PE IV - Practical Machine Learning
7 pages
Cristian Quiñonez Fase2
No ratings yet
Cristian Quiñonez Fase2
7 pages
Master Endre Final
No ratings yet
Master Endre Final
116 pages
Regression PDF
No ratings yet
Regression PDF
10 pages
Lecture - Model Accuracy Measures
No ratings yet
Lecture - Model Accuracy Measures
61 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Vasu Gupta, Sharan Srinivasan, Sneha Kudli, Prediction and Classification of Cardiac Arrhythmia
No ratings yet
Vasu Gupta, Sharan Srinivasan, Sneha Kudli, Prediction and Classification of Cardiac Arrhythmia
5 pages
Shark Tank Deal Prediction - Uudhya - Dec 2019
No ratings yet
Shark Tank Deal Prediction - Uudhya - Dec 2019
16 pages
Discussion 3 Supervised
No ratings yet
Discussion 3 Supervised
14 pages
Solution 1
No ratings yet
Solution 1
6 pages
University of California Los Angeles
No ratings yet
University of California Los Angeles
45 pages
Exploring Properties of Angles in Monuments
No ratings yet
Exploring Properties of Angles in Monuments
5 pages
20BCE1205 Lab6
No ratings yet
20BCE1205 Lab6
12 pages
ISYE6501-Homework-2
No ratings yet
ISYE6501-Homework-2
11 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
Diabetes Dectection
No ratings yet
Diabetes Dectection
7 pages
Course PDF
No ratings yet
Course PDF
403 pages
Grid Search For KNN
No ratings yet
Grid Search For KNN
17 pages
Image Classifaction
No ratings yet
Image Classifaction
17 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
Automatic Activity Recognition of Weight Lifting Exercises Using Sensor Data
No ratings yet
Automatic Activity Recognition of Weight Lifting Exercises Using Sensor Data
4 pages
Ontario Grade 1 Curriculum Checklist 2023
No ratings yet
Ontario Grade 1 Curriculum Checklist 2023
33 pages
Regression Modeling Strategies
No ratings yet
Regression Modeling Strategies
506 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
49 pages
Rms PDF
No ratings yet
Rms PDF
506 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
R Assignment
No ratings yet
R Assignment
8 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Geometry Rotational Symmetry Worksheets Template PDF
No ratings yet
Geometry Rotational Symmetry Worksheets Template PDF
5 pages
CH 10 Slides
No ratings yet
CH 10 Slides
65 pages
Exponential-Notation 111843
No ratings yet
Exponential-Notation 111843
6 pages
Week 12directions Psy 7104
No ratings yet
Week 12directions Psy 7104
3 pages
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
100% (1)
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
44 pages
IV-Day 1
No ratings yet
IV-Day 1
10 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
2023 - 24 Grade IX PT 3 Revision Worksheet - 1
No ratings yet
2023 - 24 Grade IX PT 3 Revision Worksheet - 1
3 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Predictive Analytics in Employee Churn A Systematic Literature Review
No ratings yet
Predictive Analytics in Employee Churn A Systematic Literature Review
11 pages
bba sem-2 circular-syllabus
No ratings yet
bba sem-2 circular-syllabus
37 pages
Strut and Tie Actions in Pile Cap Analysis Elastic Analysis
No ratings yet
Strut and Tie Actions in Pile Cap Analysis Elastic Analysis
11 pages
Use The Distributive Property To Divide
No ratings yet
Use The Distributive Property To Divide
28 pages
LabVIEWBasics I-Exercise3 1
No ratings yet
LabVIEWBasics I-Exercise3 1
5 pages
Stat 331 Course Notes
No ratings yet
Stat 331 Course Notes
79 pages
2017 Problem 1: Solution
No ratings yet
2017 Problem 1: Solution
45 pages
ECE Special Topics
No ratings yet
ECE Special Topics
3 pages
Customer Churn Analysis
No ratings yet
Customer Churn Analysis
10 pages
Hilbert Curve
No ratings yet
Hilbert Curve
4 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
Hydropower Measurement Manual: Head & Flow
No ratings yet
Hydropower Measurement Manual: Head & Flow
3 pages
EMDq - Question Bank - Jul 2019
0% (1)
EMDq - Question Bank - Jul 2019
18 pages
Research Reviewer
No ratings yet
Research Reviewer
2 pages
Material Selection Process
No ratings yet
Material Selection Process
10 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Process Integration
100% (3)
Process Integration
37 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Automotive Aerodynamics Mini-Project On "Aerodynamic Study of A Simplified Light Commercial Vehicle (LCV) With A Modified Trailer Design. "
0% (1)
Automotive Aerodynamics Mini-Project On "Aerodynamic Study of A Simplified Light Commercial Vehicle (LCV) With A Modified Trailer Design. "
12 pages
Branch and Price - Wikipedia
No ratings yet
Branch and Price - Wikipedia
3 pages
Evaluation of Grounding Resistance and
No ratings yet
Evaluation of Grounding Resistance and
6 pages
2104-Empirical Calculation of Roll Damping For Ships PDF
No ratings yet
2104-Empirical Calculation of Roll Damping For Ships PDF
18 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
AQA 83003H New Practice Paper Set 1 MS Sept 15
No ratings yet
AQA 83003H New Practice Paper Set 1 MS Sept 15
10 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
VTAMPS 15 P6 Set 2 1
No ratings yet
VTAMPS 15 P6 Set 2 1
26 pages

Practical Machine Learning

Uploaded by

Practical Machine Learning

Uploaded by

Practical Machine Learning - Prediction Assignment

4. Data Loading and Cleaning

Load the required R packages and set a seed.

Load the training and test datasets.

url_train <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-

in_train <- createDataPartition(data_train$classe, p=0.75, list=FALSE)

nzv_var <- nearZeroVar(train_set)

Remove variables that are mostly NA. A threshlod of 95 % is selected.

na_var <- sapply(train_set, function(x) mean(is.na(x))) > 0.95

train_set <- train_set[ , -(1:5)]

predict_decision_tree <- predict(fit_decision_tree, newdata = test_set,

plot(conf_matrix_decision_tree$table, col = conf_matrix_decision_tree$byClass,

Predictions of the GBM on test_set.

predict_GBM <- predict(fit_GBM, newdata = test_set)

The predictive accuracy of the GBM is relatively high at 98.45 %.

6.3. Random Forest Model

Predictions of the Random Forest model on test_set.

predict_RF <- predict(fit_RF, newdata = test_set)

The predictive accuracy of the Random Forest model is excellent at 99.8 %.

7. Applying the Best Predictive Model to the Test Data

 Decision Tree Model: 74.90 %

predict_quiz <- predict(fit_RF, newdata = data_quiz)

You might also like