0% found this document useful (0 votes)
87 views48 pages

Introduction To Machine Learning

1. Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without being explicitly programmed. It builds mathematical models from sample data to make predictions. 2. Machine learning algorithms are grouped by learning style (supervised, unsupervised, semi-supervised) and similarity (tree-based, neural network inspired, etc.). Supervised learning predicts outcomes, unsupervised learning extracts patterns from unlabeled data, and semi-supervised learning uses both labeled and unlabeled examples. 3. R is a popular free software for machine learning tasks due to its many packages for tasks like regression, clustering, and time series analysis. It is widely used in academia and industry.

Uploaded by

Kylle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views48 pages

Introduction To Machine Learning

1. Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without being explicitly programmed. It builds mathematical models from sample data to make predictions. 2. Machine learning algorithms are grouped by learning style (supervised, unsupervised, semi-supervised) and similarity (tree-based, neural network inspired, etc.). Supervised learning predicts outcomes, unsupervised learning extracts patterns from unlabeled data, and semi-supervised learning uses both labeled and unlabeled examples. 3. R is a popular free software for machine learning tasks due to its many packages for tasks like regression, clustering, and time series analysis. It is widely used in academia and industry.

Uploaded by

Kylle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Machine Learning Sampler

Ma. Sheila A. Magboo


What is Machine Learning?
• Machine learning (ML) is the scientific study of algorithms and statistical models that
computer systems use to effectively perform a specific task without using explicit
instructions, relying on patterns and inference instead.
• It is seen as a subset of artificial intelligence.
• It builds a mathematical model based on sample data, known as "training data", in order to
make predictions or decisions without being explicitly programmed to perform the task
• Machine learning algorithms are used in a wide variety of applications, such as
• email filtering
• computer vision
• Etc…
where it is infeasible to develop an algorithm of specific instructions for performing the task.
Machine learning is closely related to computational statistics, which focuses on making
predictions using computers
How are machine learning algorithms
grouped?
1. By learning style -  forces you to think about the roles of the input
data and the model preparation process and select one that is the
most appropriate for your problem in order to get the best result
a) Supervised
b) Unsupervised
c) Semi-supervised
2. By similarity - algorithms are grouped in terms of their function
(how they work). Ex. tree-based methods, and neural network
inspired methods
Machine Learning Algorithms Grouped by
Learning Style
1. Supervised Learning
outcome variable (or dependent variable) is to be predicted from a given set of predictors (independent variables). Model
must undergo training process until it achieves desired level of accuracy
Example problems are classification and regression
• Regression
• Back Propagation Neural Networks
• Decision Trees
• Random Forest
• KNN
2. Unsupervised Learning
No target or outcome variable to predict; aims to extract structures or general rules present in the input data. used for
clustering population in different groups
• Apriori algorithm
• K-means
3. Semi-supervised Learning
Input data is a mixture of labeled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make
predictions. Applied to image classification where there are large datasets with very few labeled examples
• Markov Decision Process
Machine Learning Algorithms Grouped by
Similarity
• useful grouping method, but is not perfect.
• there are still algorithms that could just as easily fit into multiple
categories
Ex. Learning Vector Quantization that is both a neural network inspired method
and an instance-based method. 
Why R?
• R is a free software environment for statistical computing
• and graphics.
• R can be easily extended with 4,728 packages available on CRAN (as
of Sept 6, 2013).
• Many other packages provided on Bioconductor, R-Forge, GitHub, etc.
• R manuals on CRAN
• An Introduction to R
• The R Language Denition
• R Data Import/Export
Why R?
• R is widely used in both academia and industry.
• The CRAN Task Views provide collections of packages for different
tasks.
• Machine learning & Statistical learning
• Cluster analysis & finite mixture models
• Time series analysis
• Multivariate statistics
• Analysis of spatial data
Regression
• A predictive modelling technique which investigates the relationship
between a dependent (target) and independent variable
(s) (predictor)
• Here, we fit a curve / line to the data points, in such a manner that
the differences between the distances of data points from the curve
or line is minimized
• Used for forecasting, time series modelling and finding the causal
effect relationship between the variables. Ex.
o relationship between rash driving and number of road accidents by a driver
o predict future sales of the company based on current & past information.
Types of Regression
1. Linear Regression
2. Logistic Regression
3. Polynomial Regression
4. Stepwise Regression
5. Ridge Regression
6. Lasso Regression
7. ElasticNet Regression
Linear Regression
• Use Linear Regression if the relationship between dependent variable
(Y) and one or more independent variables (X) is linear i.e. can be modelled
using a best fit straight line (also known as regression line)
o dependent variable is continuous
o independent variable(s) can be continuous or discrete
o nature of regression is linear
• Y = a + bX + e
o a is y-intercept
o b is slope
o e is error
o r is Pearson Correlation Coefficient; Sx and Sy std deviation of x and y respectively
• The Pearson Correlation Coefficient measures the strength of the linear
relationship between 2 variables
Linear Regression
Use Linear
Regression if
there is only 1
independent
variable that
can predict the
dependent
variable
Linear Regression
• Linear Regression is very sensitive to Outliers. It can terribly affect the
regression line and eventually the forecasted values.
• So outliers must be removed
How to analyze if there is a linear correlation
between variables in our dataset?
We are interested in determining if a relationship exists between blood
pressure and age, weight, body surface area, duration, pulse rate
and/or stress level.
• blood pressure (y = BP, in mm Hg)
• age (x1 = Age, in years)
• weight (x2 = Weight, in kg)
• body surface area (x3 = BSA, in sq m)
• duration of hypertension (x4 = Dur, in years)
• basal pulse (x5 = Pulse, in beats per minute)
• stress index (x6 = Stress)
How do we know if there is a linear
correlation between variables in our dataset?
Pt BP Age Weight BSA Dur Pulse Stress
1 105 47 85.4 1.75 5.1 63 33
2 115 49 94.2 2.1 3.8 70 14
3 116 49 95.3 1.98 8.2 72 10
4 117 50 94.7 2.01 5.8 73 99
5 112 51 89.4 1.89 7 72 95
6 121 48 99.5 2.25 9.3 71 10
7 121 49 99.8 2.25 2.5 69 42
8 110 47 90.9 1.9 6.2 66 8
9 110 49 89.2 1.83 7.1 69 62
10 114 48 92.7 2.07 5.6 64 35
11 114 47 94.4 2.07 5.3 74 90
12 115 49 94.1 1.98 5.6 71 21
13 114 50 91.6 2.05 10.2 68 47
14 106 45 87.1 1.92 5.6 67 80
15 125 52 101.3 2.19 10 76 98
16 114 46 94.5 1.98 7.4 69 95
17 106 46 87 1.87 3.6 62 18
18 113 46 94.5 1.9 4.3 70 12
19 110 48 90.5 1.88 9 71 99
20 122 56 95.7 2.09 7 75 99
To know if there is a correlation between
each pair of variables
bp <- read.csv("Blood Pressure vs other Vital Stats.csv", sep=",")
cor(bp[,2:8])
upper.panel<-function(x, y){
points(x,y, pch=19)
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, txt)
}
pairs(bp[,2:8], lower.panel = NULL, upper.panel = upper.panel)
library(sjPlot)
bpdata <- data.frame(bp$BP, bp$Age, bp$Weight, bp$BSA, bp$Dur, bp$Pulse, bp$Stress)
sjt.corr(bpdata, triangle="lower")
#read csv file and load to user-defined variable bp
bp <- read.csv("Blood Pressure vs other Vital Stats.csv", sep=",")

#cor() prints a correlation table showing correleation between each pair of variables
cor(bp[,2:8])
#pairs() outputs a correlation graph;
#$pch = 19 means use filled circles, pch = 21 means use open circle;
#lower.panel = NULL removes the correlation graph in the lower triangle
pairs(bp[,2:8], pch = 19, lower.panel = NULL)
#we want to display the pearson correlation coefficient in the graph
#so we need to customize what will appear in the upper triangle
upper.panel<-function(x, y){
points(x,y, pch=19) #gets the points to be correlated
r <- round(cor(x, y), digits=2) #c or(x,y) computes the pearson correlation between x and y
#and assigns to variable r
displayRtxt <- paste0("R = ", r) #assigns to variable displayRtxt the value of r
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, displayRtxt) #size of displayRtxt when shown onscreen
}

#to display the customized correlation graph


pairs(bp[,2:8], lower.panel = NULL,
upper.panel = upper.panel)
Another way to show correlation table
bp <- read.csv("Blood Pressure vs other Vital Stats.csv", sep=",")

#here we used sjPlot to show correlation between each pair or variables


library(sjPlot)
bpdata <- data.frame(bp$BP, bp$Age, bp$Weight, bp$BSA, bp$Dur, bp$Pulse, bp$Stress)
sjt.corr(bpdata, triangle="lower")
Other useful methods
bp <- read.csv("Blood Pressure vs other Vital Stats.csv", sep=",")
dim(bp)

names(bp)
#str(bp) Compactly display the internal structure of our R object, bp
str(bp)

#summary(bp) to display summary statistic per variable in bp such as minimum and maximum values,
#mean, median,1st Quartile value and 3rd Quartile value
summary(bp)
#get the linear model using lm(Y~X, data=dataframe)
bplm = lm(bp$BP ~ bp$Weight, data = bp)
with(bplm, plot(bp$Weight, bp$BP, main = "Scatterplot of BP vs Weight", xlab = "Weight in kg",
ylab = "BP in mm Hg"))
#to display the linear regression line we use the abline() function with bplm as the data
abline(bplm, col="red")
Findings
• There is a high correlation between
1. BP and Weight  correlation coef = 0.950
2. BP and BSA  correlation coef = 0.866
3. Weight and BSA  correlation coef = 0.875
• There seems to be little or no correlation between
1. BSA and Stress  correlation coef = 0.018
2. Stress and Weight  correlation coef = 0.034
3. Duration and BSA  correlation coef = 0.131
4. BP and Stress  correlation coef = 0.164

 Can we say BSA is a predictor of Weight?


 Can we remove either BSA or Weight as predictor of BP?
• There is a high correlation
between
1. BP and Weight 
correlation coef = 0.950
2. BP and BSA  correlation
coef = 0.866
3. Weight and BSA 
correlation coef = 0.875
• There seems to be little or
no correlation between
1. BSA and Stress 
correlation coef = 0.018
2. Stress and Weight 
correlation coef = 0.034
3. Duration and BSA 
correlation coef = 0.131
4. BP and Stress 
correlation coef = 0.164
#Example 1: Blood pressure: Determine if there is correlation among the 7 variables in the blood pressure data
bp <- read.csv("Blood Pressure vs other Vital Stats.csv", sep=",")

#displays the dimension of our data in rows and columns


dim(bp)

#displays the column names of our data


names(bp)

#str(bp) Compactly display the internal structure of our R object, bp


str(bp)

#summary(bp) to display summary statistic per variable in bp such as minimum and maximum values,
#mean, median,1st Quartile value and 3rd Quartile value
summary(bp)

#cor() prints a correlation table showing correleation between each pair of variables
cor(bp[,2:8])

#pairs() outputs a correlation graph in both upper and lower triangle;


pairs(bp[, 2:8])
R Codes for the Linear
Regression Example
#attach() enables us to directly access the column names of the .csv file
attach(bp)

#let's use the linear model function lm(Y~X, data=dataframe) to compute if there is a linear correlation between the variables BP
and and Weight
bplm = lm(BP ~ Weight, data = bp)
summary(bplm)

#let's generate the scatterplot of the observations


with(bplm, plot(Weight, BP, main = "Scatterplot of BP vs Weight", xlab = "Weight in kg", ylab = "BP in mm Hg"))

#let's generate the linear regression line of BP vs Weight


#since the points are near the line then we say there is high correlation between BP and Weight
abline(bplm, col="red")

#you can do the same for the other variables


#compute again the linear correlation between BP and BSA
bplm = lm(BP ~ BSA, data = bp)

#indicate the parameters of your scatterplot


with(bplm, plot(BSA, BP, main = "Scatterplot of BP vs Body Surface Area BSA", xlab = "BSA in sqm", ylab = "BP in mm Hg"))

#generate the linear regression line between BP and BSA


abline(bplm, col = "blue")
#display the correlation graph
#pairs() outputs a correlation graph;
#$pch = 19 means use filled circles, pch = 21 means use open circle;
#lower.panel = NULL removes the correlation graph in the lower triangle

pairs(bp[,2:8], pch = 21, lower.panel = NULL)

#since we want to display the pearson correlation coefficient r in the graph


#we need to customize what will appear in the upper triangle hence we need to set the config parameters of our upper
triangle

upper.panel<-function(x, y){
points(x,y, pch=19) #gets the points to be correlated
r <- round(cor(x, y), digits=2) #cor(x,y) computes the pearson correlation between x and y and assigns to
#variable r
displayRtxt <- paste0("R = ", r) #assigns to variable displayRtxt the value of r
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, displayRtxt) #size of displayRtxt when shown onscreen
}

pairs(bp[,2:8], lower.panel = NULL,


upper.panel = upper.panel)
#another way to show the correlation table using sjPlot
#here we'll use sjPlot to show correlation between each pair or variables
library(emmeans)
library(sjPlot)

bpdata <- data.frame(bp$BP, bp$Age, bp$Weight, bp$BSA, bp$Dur, bp$Pulse, bp$Stress)

sjt.corr(bpdata, triangle = "lower")


Notes about Correlation
• Correlation is a measure of the linear relationship between two variables-it
does not necessarily state that one variable is caused by another. For example,
a third variable or a combination of other things may be causing the
two correlated variables to relate as they do.
• When there is no linear relationship between two variables, the correlation
coefficient is 0 or near zero. It is important to remember that a correlation
coefficient of 0 indicates that there is no linear relationship, but there may still
be a strong relationship between the two variables. For example, there could
be a quadratic relationship between them.
• Therefore, the correlation coefficient is not always the best statistic to use to
understand the relationship between variables.
Multicollinearity
• Multicollinearity exists whenever two or more of the predictors in a
regression model are moderately or highly correlated
• Correlations of 0.8 and above suggest a strong relationship and only
one of the two variables is needed in the regression analysis
• Going back to our example, let’s look as BP, Weight, BSA, and Stress. It
seems that
• BP is strongly related to Weight and BSA
• BP is not related to Stress
• Weight and BSA are strongly related  possible multicollinearity exists!
• BSA is not related to Stress
Characteristics of uncorrelated predictors
Given 2 predictors x1 and x2 of response variable y,
• If the estimated slope coefficients b1 and b2 are the same regardless of the model used
• If the standard errors se(b1) and se(b2) don't change much at all from model to model.
• If the sum of squares SSR(x1) is the same as the sequential sum of squares SSR(x1|x2).
Then the marginal contribution that x1 has in reducing the variability in the response y doesn't depend on the predictor
x2

Given 2 predictors x1 and x2 of response variable y,


• If the estimated slope coefficients b1 and b2 are the same regardless of the model used.
• If the standard errors se(b1) and se(b2) don't change much at all from model to model.
• If the sum of squares SSR(x2) is the same as the sequential sum of squares SSR(x2|x1).
Then the marginal contribution that x2 has in reducing the variability in the response y doesn't depend on the predictor
x1

This means x1 and x2 are uncorrelated!


Are Stress and BSA Uncorrelated?
• From this table r = 0.018
for Stress and BSA
Uncorrelated

• What is the effect of this


uncorrelation on our
regression analysis?
Classification
Classification with R
• Decision trees: rpart, party
• Random forest: randomForest, party
• SVM: e1071, kernlab
• Neural networks: nnet, neuralnet, RSNNS
• Performance evaluation: ROCR
The Iris Dataset
# iris data
str(iris)

## data.frame: 150 obs. of 5 variables:


## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
## $ Species : Factor w/ 3 levels "setosa","versicolor",..:

#split into training and test datasets


set.seed(1234)
ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))
iris.train <- iris[ind==1, ]
iris.test <- iris[ind==2, ]
Build a Decision Tree
# build a decision tree
library(party)
iris.formula <- Species ~ Sepal.Length +
Sepal.Width + Petal.Length +
Petal.Width
iris.ctree <- ctree(iris.formula,
data=iris.train)
plot(iris.ctree) NOTE: You must click Zoom in order to see the other labels
otherwise you will only see setosa
#to print the tree logic
print(iris.ctree)
Prediction
#predict on test data
pred <- predict(iris.ctree, newdata = iris.test)
#check prediction result
table(pred, iris.test$Species)
The Confusion
Matrix
A confusion matrix is a table that is
often used to describe the
performance of a classification model
(or "classifier") on a set of test data for
which the true values are known. It
outputs a number of metrics such as:
1. Accuracy
2. Kappa
3. Sensitivity
4. Specificity
5. Positive Predictive Value
6. Negative Predictive Value
7. Prevalence
etc...
More info about Confusion Matrix and its
corresponding metrics
• Confusion Matrix
https://www.youtube.com/watch?v=Kdsp6soqA7o
• Sensitivity and Specificity
https://www.youtube.com/watch?v=sunUKFXMHGk
• ROC and AUC
https://www.youtube.com/watch?v=xugjARegisk
What we have done so far…
• We used iris dataset to demonstrate how to do classification using a decision
tree, specifically using the ctree() function of the party package
• We split the data 70 % for training and 30% for testing using random sampling
• We trained the model on the training dataset using all columns as predictor of
our outcome variable Species
• The trained model resulted to a tree with 4 terminal nodes
• The trained model was tested on our test dataset
• Confusion Matrix gave very good test results: (watch the video to understand
what is confusion matrix)
• Accuracy = 94.74%
• Kappa = 92.02%
Area Under the Curve
• When we need to check or visualize the performance of the multi - class classification
problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics)
curve. It is also written as AUROC (Area Under the Receiver Operating Characteristics)
• It tells how much the model is capable of distinguishing between classes. The higher the
AUC, the better the model is at predicting 0s as 0s and 1s as 1s. By analogy, the higher
the AUC, the better the model is at distinguishing between patients with disease and no
disease.
• The ROC curve is plotted with TPR (True Positive Rate) against the FPR (False Positive
Rate) where TPR is on y-axis and FPR is on the x-axis.
str(iris)

#split into training and test datasets


set.seed(1234)
ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))
Complete R Code iris.train <- iris[ind==1, ]
iris.test <- iris[ind==2, ]
for our iris dataset
using ctree() # build a decision tree
library(party)
iris.formula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
iris.ctree <- ctree(iris.formula, data=iris.train)
plot(iris.ctree)
print(iris.ctree)

#predict on test data


pred <- predict(iris.ctree, newdata = iris.test)

#check prediction result


table(pred, iris.test$Species)
library(caret)
confusionMatrix(iris.test$Species, pred)

You might also like