0% found this document useful (0 votes)
87 views144 pages

BDA Unit 4

The document discusses regression analysis and its applications in R. Regression is used to predict unknown values based on relationships between variables. There are different types of regression including linear, logistic, and polynomial regression. Linear regression finds relationships between a continuous dependent variable and one or more independent variables. Multiple linear regression extends this to relationships between more than two variables. The document provides examples of applying simple and multiple linear regression in R to predict variables like exam grades, housing prices, and stock prices.

Uploaded by

asbsfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views144 pages

BDA Unit 4

The document discusses regression analysis and its applications in R. Regression is used to predict unknown values based on relationships between variables. There are different types of regression including linear, logistic, and polynomial regression. Linear regression finds relationships between a continuous dependent variable and one or more independent variables. Multiple linear regression extends this to relationships between more than two variables. The document provides examples of applying simple and multiple linear regression in R to predict variables like exam grades, housing prices, and stock prices.

Uploaded by

asbsfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 144

R Language -

A Language for Data


Analytics and Visualisation
Regression
To predict the unknown value from known
value.
Regression analysis is a statistical, predictive
modelling technique used to study the
relationship between a dependent variable and
one or more independent variables.
X 0 2 4 6
Y 0 4 ? 12

Dependent variable – ? Find the value of ‘Y’ when X = 4


Independent Variable - ?
To study the relationship between two or more
variables using Regression

e.g.1 Relationship b/w advertising expenditure of


sales
adv. expenditure
sales also
e.g.2: relationship b/w no. of hours practice
hours of practice
no of errors
Objective - To develop a model to show how the
variables are related and to predict?
eg. Predict sales for a given level of advertising
Dependent variable (y)
variables we’re trying to predict.
Independent variable (x)
variable we use to predict the dependent
variable.
Dependent variable (y) –> sales, no of errors
Independent variable (x) –> adv. exp, hours of
Practice.
Input Dataset - Housing price data set of New
York City.
This data set contains information such as, size,
locality, number of bedrooms in the house, etc.
Task - to predict the price of the house.
Independent (or) Predictor variables
Values not depending on any variable.
number of bedrooms, the size of the house and so
on.
These predictor variables are used to predict the
response variable.
Dependent (or) Response variable
values depend on the values of the independent
variable.
Price of the house is the dependent variable.
Types of Regression
Analysis

Linear Regression
Logistic Regression
Polynomial Regression
Linear Regression
It is one of the most basic and widely used
machine learning algorithms.
It is a predictive modeling technique used to
predict a continuous dependent variable, given
one or more independent variables.
Simple linear regression
One independent and one dependent variable.
Multiple linear regression
More than one independent variable and one
dependent variable.
Here, the relationship between the
dependent and independent variable is
always linear thus, when we try to plot
their relationship, we’ll observe more of a
straight line than a curved one.
Equation used to represent a linear
regression model:
Multiple linear regression
Extension of linear regression into
relationship between more than two
variables.
In simple linear relation we have one
predictor and one response variable, but in
multiple regression we have more than one
predictor variable and one response variable.
Logistic Regression

Logistic Regression is a machine learning


algorithm used to solve classification
problems.
Logistic Regression is a predictive analysis
technique used to predict a dependent
variable, given a set of independent
variables, such that the dependent
variable is categorical.
Polynomial Regression

Polynomial Regression is a method used


to handle non-linear data.
Non-linearly separable data is basically
when you cannot draw out a straight line
to study the relationship between the
dependent and independent variables.
The reason it is called ‘Polynomial’ regression is that the power of some
independent variables is more than 1.
Simple Linear Regression

One Independent variable and one output


variable.
It is named as linear, because the
relationship is approximated using a
straight line.
y- intercept

Slope – tells whether the line is increasing / decreasing; how steep it is


Hours Grade
Studied on
(x) Exam
(y)
2 69
9 98
5 82
5 77
3 71
7 84
1 55
8 94
6 84
2 64
Regression equation :

To calculate b0:
2 69 -2.8 -8.8 24.64 7.84
9 98 4.2 20.2 84.84 17.64
5 82 0.2 4.2 0.84 0.04
5 77 0.2 -0,8 -0.16 0.04
3 71 -1.8 -6.8 12.24 3.24
7 84 2.2 6.2 13.64 4.84
1 55 -3.8 -22.8 86.64 14.44
8 94 3.2 16.2 51.84 10.24
6 84 1.2 6.2 7.44 1.44
2 64 -2.8 -13.8 38.64 7.84
48 778 320.6 67.6
x
To predict the grade when no. of hours
studied = 3
Steps to Establish a
Regression

Carry out the experiment of gathering a sample of


observed values of number of hours studied and
corresponding grade.
Create a relationship model using the lm() functions in
R.
Find the coefficients from the model created and create
the mathematical equation using the coefficients.
Get a summary of the relationship model to know the
average error in prediction. Also called residuals.
To predict the grade for new persons, use
the predict() function in R.
Input Data:
lm() function
This function creates the relationship model
between the predictor and the response
variable.
Is this enough to actually use this model?
How do we ensure that the model
generated is statistically significant?
Soln : p-Values
we can consider a linear model to be
statistically significant only when both these
p-Values are less than the pre-determined
statistical significance level of 0.05.
Whenever there is a p-value, there is always a
Null and Alternate Hypothesis associated.
Null Hypothesis (H0)
proposes that there is no difference between certain
characteristics of a population or data-generating
process.
Alternate Hypothesis (H1)
Proposes that there is a difference.
Get the Summary of Relationship
R-squared
Statistical measure which shows how close the data are to the
fitted regression line.
Known as the coefficient of determination, or the coefficient of
multiple determination for multiple regression.
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of
the response data around its mean.
100% indicates that the model explains all the variability of the
response data around its mean.
predict() function
To predict the value for new records
To predict the grade when number of
hours studied = 3
How good is the prediction?
How well the regression line fits the data?
Solution : coefficient of determination

SST = SSR + SSE


2 69 64.528 4.472 19.9988 -8.8 77.44
9 98 97.708 0.292 0.0852 20.02 408.04
5 82 78.748 3.252 10.5755 4.2 17.64
5 77 78.748 -1.748 3.0555 -0.8 0.64
3 71 69.268 1.732 2.9998 -6.8 46.24
7 84 88.228 -4.228 17.8759 6.2 38.44
1 55 59.788 4.788 22.9249 -22.8 519.84
8 94 92.968 1.032 1.0650 16.2 262.44
6 84 83.488 0.512 0.2621 6.2 368.44
2 64 64.528 -0.528 0.2788 -13.8 190.44
SSE 79.1215 SST 1599.6
SST = SSR + SSE
SSR = SST – SSE
= 1599.6-79.1215
=1520.4785
Multiple Regression

Multiple regression is an extension of linear


regression into relationship between more than
two variables.
In simple linear relation we have one predictor
and one response variable, but in multiple
regression we have more than one predictor
variable and one response variable.
Steps to apply the multiple linear
regression in R
Collect the data
Capture the data in R
Apply the multiple linear regression in R
Make a prediction
Step 1:
Goal is to predict the stock_index_price (the
dependent variable) of a fictitious economy
based on two independent/input variables:
Interest_Rate
Unemployment_Rate

Y = b0 + b1 * X 1 + b2 * X 2
Where,
b0 - Y- intercept
X1 - Interest_Rate
X2 - - Unemployment_Rate
Year Month Interest_Rate Unemployment_Rate Stock_Index_Price
2017 12 2.75 5.3 1464
2017 11 2.5 5.3 1394
2017 10 2.5 5.3 1357
2017 9 2.5 5.3 1293
2017 8 2.5 5.4 1256
2017 7 2.5 5.6 1254
2017 6 2.5 5.5 1234
2017 5 2.25 5.5 1195
2017 4 2.25 5.5 1159
2017 3 2.25 5.6 1167
2017 2 2 5.7 1130
2017 1 2 5.9 1075
2016 12 2 6 1047
2016 11 1.75 5.9 965
2016 10 1.75 5.8 943
2016 9 1.75 6.1 958
2016 8 1.75 6.2 971
2016 7 1.75 6.1 949
2016 6 1.75 6.1 884
2016 5 1.75 6.1 866
2016 4 1.75 5.9 876
2016 3 1.75 6.2 822
2016 2 1.75 6.2 704
2016 1 1.75 6.1 719
Step 2: Capture the data in R
Step 3: Apply multiple linear regression in
R
Multiple linear regression equation
Step 4: Make a prediction.
Predict the stock index price for the
following data:
Interest Rate = 1.5 (i.e., X1= 1.5)
Unemployment Rate = 5.8 (i.e., X2= 5.8)
Example – II
By using the sample.split() we are actually
creating a vector with two values TRUE
and FALSE.
By setting the SplitRatio to 0.7, you are
splitting the original Revenue dataset of
1000 rows to 70% training and 30%
testing data.
Logistic Regression
Use case – College Admission
using Logistic Regression
Polynomial Regression

Polynomial Regression is a special case of


linear regression where the relationship
between X and Y is modeled using a
polynomial, rather than a line….
It can be used when the relationship
between X and Y is nonlinear, although
this is still considered to be a special case
of Multiple Linear Regression.
But what if your linear regression model
cannot model the relationship between
the target variable and the predictor
variable?
In other words, what if they don’t have a
linear relationship?
Linear Regression

Polynomial Regression

𝜃 0 is the bias,
𝜃 1, 𝜃2, …, 𝜃n are the weights in the equation of the
polynomial regression, and
n is the degree of the polynomial
Co-variance
Covariance is a measure of how much two
random variables vary together
Lie between -infinity and +infinity
A positive value shows that both variables
vary in the same direction and negative
value shows that they vary in the opposite
direction.
Measure of correlation
Co-variance
Co-variance
Correlation

Correlation is a statistical measure that


indicates how strongly two variables are
related.
Lie between -1 and +1
Correlation

Correlation(x,y) = 112.33 / sqrt(331.28 * 48.78)


= 112.33 / sqrt(16159.8384)
= 112.33 / 127.121
= 0.88
0.88 shows that strength of the correlation between
temperature and number of customers is very strong.
Pearson’s correlation is a parametric measure of
the linear association between two numeric
variables.
Spearman’s rank correlation is a non-parametric
measure of the monotonic association (increase
[or decrease] in the same direction, but not
always at the same rate) between two numeric
variables.
Kendall’s rank correlation is another non-
parametric measure of the association, based on
concordance or discordance (refer to comparing
two pairs of data points to see if they “match.”) of
x-y pairs.
Hypothesis Testing
It is a statistical method that is used in making
statistical decisions using experimental data.
Hypothesis Testing is basically an assumption
that we make about the population parameter.
Null hypothesis (H0):
A null hypothesis is a type of hypothesis used in
statistics that proposes that no statistical significance
exists in a set of given observations.
Alternative hypothesis (Ha or H1):
An Alternative hypothesis is a type of hypothesis
used in statistics that proposes that there is statistical
significance exists in a set of given observations.
t - test

Consider a telecom company that has two service


centers in the city.
The company wants to find whether the average time
required to service a customer is the same in both
stores.
The company measures the average time taken by 50
random customers in each store.
Store A takes 22 minutes while Store B averages 25
minutes.
Can we say that Store A is more efficient than Store B in
terms of customer service?
This is where the t-test comes into play.
It helps us to understand if the difference
between two sample means is actually
real or simply due to chance.
Types of t-tests
One sample t-test
Independent two-sample t-test
Paired sample t-test
One sample t-test
In a one-sample t-test, we compare the
average (or mean) of one group against the
set average (or mean).
This set average can be any theoretical value
(or it can be the population mean).
Consider the following example – A research
scholar wants to determine if the average eating
time for a (standard size) burger differs from a
set value.
Let’s say this value is 10 minutes.
How do you think the research scholar can go
about determining this?
He/she can broadly follow the below
steps:
Select a group of people
Record the individual eating time of a
standard size burger
Calculate the average eating time for the
group
Finally, compare that average value with the
set value of 10.
where,
t = t-statistic
m = mean of the group
µ = theoretical value or population mean
s = standard deviation of the group
n = group size or sample size
Once we have calculated the t-statistic value,
the next task is to compare it with the critical
value of the t-test.
We can find this in the below t-test table against
the degree of freedom (n-1) and the level of
significance.
Degrees of freedom
the number of values that are free to vary in a data
set
Implementing the One-
Sample t-test in R

A mobile manufacturing company has


taken a sample of mobiles of the same
model from the previous month’s data.
They want to check whether the average
screen size of the sample differs from the
desired length of 10 cm.
t-statistic -> -0.39548.
Degree of freedom here is 999 and the confidence
interval is 95%.
The t-critical value is 1.962.
Since the t-statistic is less than the t-critical
value, we fail to reject the null hypothesis and
can conclude that the average screen size of the
sample does not differ from 10 cm.
We can also verify this from the p-value, which is
greater than 0.05.
Therefore, we fail to reject the null hypothesis at a 95%
confidence interval.
Independent Two-Sample
t-test

The two-sample t-test is used to compare the


means of two different samples.
Let’s say we want to compare the average
height of the male employees to the average
height of the females.
Of course, the number of males and females
should be equal for this comparison.
This is where a two-sample t-test is used.
where,
mA and mB are the means of two different
samples
nA and nB are the sample sizes
S2 is an estimator of the common variance
of two samples, such as:
For this section, we will work with data
about two samples of the various models
of a mobile phone.
We want to check whether the mean
screen size of sample 1 differs from the
mean screen size of sample 2.
We can confirm that the t-statistic is again less
than the t-critical value so we fail to reject the
null hypothesis.
Hence, we can conclude that there is no difference
between the mean screen size of both samples.
We can verify this again using the p-value.
It comes out to be greater than 0.05, therefore we
fail to reject the null hypothesis at a 95% confidence
interval.
There is no difference between the mean of the two
samples.
Paired t-test

Here, we measure one group at two different times.


We compare separate means for a group at two
different times or under two different conditions.
A certain manager realized that the productivity
level of his employees was trending significantly
downwards.
This manager decided to conduct a training
program for all his employees with the aim of
increasing their productivity levels.
How will the manager measure if the
productivity levels increased?
Just compare the productivity level of the
employees before versus after the training
program.
Here, we are comparing the same sample
(the employees) at two different times
(before and after the training).
where,
t = t-statistic
m = mean of the group
s = standard deviation of the group
n = group size or sample size
Degree of freedom = n – 1
As an example of data, 20 mice received a
treatment X during 3 months. We want to
know whether the treatment X has an
impact on the weight of the mice.
To answer to this question, the weight of
the 20 mice has been measured before
and after the treatment. This gives us 20
sets of values before treatment and 20
sets of values after treatment from
measuring twice the weight of the same
20 mice received a treatment X during 3
months. We want to know whether the
treatment X has an impact on the weight of the
mice.
Weight of the 20 mice has been measured before
and after the treatment.
20 sets of values before treatment and 20 sets of
values after treatment from measuring twice the
weight of the same mice.
The p-value is less than 0.05.
We can reject the null hypothesis at
a 95% confidence interval and
conclude that there is a significant
difference between the mean weight
before and after the treatment.
K-means clustering in R
Can you distinguish between the 3 species of IRIS flower
using Machine Learning
k-means Clustering
Features
Initial set of clusters randomly chosen
Iteratively, items are moved among sets of clusters until the desired set is reached
High degree of similarity among elements in a cluster is obtained
Given a cluster Ki={ti1,ti2,…,tim},
The cluster mean : mi = (1/m)(ti1 + … + tim)
Strength
Time complexity O(tkn)
Often terminates at a local optimum
Weakness
Does not work on categorical data
Only convex-shaped cluster are found
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
K-modes
One variation of K-means
Handle categorical data
Uses modes, instead of using means
k-means Clustering
k-means Clustering
{2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means : m1=2, m2=4
m1 m2 K1 K2
2 4 {2,3} {4,10,12,20,30,11,25}
2.5 16 {2,3,4} {10,12,20,30,11,25}
3 18 {2,3,4,10} {12,20,30,11,25}
4.75 19.6 {2,3,4,10,11,12} {20,30,25}
7 25 {2,3,4,10,11,12} {20,30,25}
Stop as the clusters with these means are the
same
Our answer
K1 = {2,3,4,10,11,12}, K2 = {20,30,25}
Association Rules

Association rules are used to show the


relationships between data items.
Association rules detect common usage of
data items.
E.g. The purchasing of one product when
another product is purchased represents
an association rule.
Association Rules
Association rules have most direct application in the
retail businesses.
Association rules used to assist in marketing,
advertising, floor placements and inventory control.
Eg:
From the transaction history several association rules can be
derived.
E.g. 100% of the time that PeanutButter is purchased, so is
bread.
33% of the time PeanutButter is purchased, Jelly is also
purchased.
Association Rules - Example
Association Rules - Example

Database in which Association rule is to


be found can be viewed as a set of tuples,
where each tuple contains a set of items.
Here, each tuple represents the list of
items purchased at one time.
Support:
The Support of an item (or set of items) is
the percentage of transactions in which that
item (or items) occurs.
Association Rule - Definition

Given a set of items I = {I1,I2,….Im} and a


database of transactions D = {t1,t2,….tm} where
ti = { Ii1,Ii2,….Iik} and IiJ € I , an association rule
is an implication of the form X ➔ Y where X,Y C
I are sets of items called itemsets and X∩Y =ø.
The Support(S) for an association rule
X ➔ Y is the percentage of transactions in the
database that contains X U Y.
NOTE: Support of X  Y is same as support of
X  Y.
Association Rule -
Introduction

Confidence (or) Strength


The Confidence (or) Strength (a) for an
Association rule X ➔ Y is the ratio of number
of transactions that contain X Y to the
number of transactions that contain X.
Selecting Association Rules

The selection of association rules is based on


Support and Confidence.
Confidence measures the strength of the rule,
Whereas support measures how often it should
occur in the database.
Typically large confidence values and a smaller
support are used.
Rules that satisfy both minimum support and
minimum confidence are called strong rules
Association Rule - Example
Association Rule - Example
Apriori algorithm
Most well known Association rule algorithm and
is used in most commercial products.
Uses the property called Large Itemset
property.
Any subset of a large itemset must be large.
To perform Association Rule Mining in R,
we use the arules and
the arulesViz packages in R.
If you don’t have these packages installed
in your system, please use the following
commands to install them.
Decision tree based algorithms

Given:
D = {t1, …, tn} where ti=<ti1, …, tih>
Database schema contains {A1, A2, …, Ah}
Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with D such that
Each internal node is labeled with attribute, Ai
Each arc is labeled with predicate which can be applied to
attribute at parent
Each leaf node is labeled with a class, Cj
Solving the Classification problem using Decision trees is
a two step process:
Decision tree Induction
Construct a DT using training data
For each ti belongs to D, apply the DT to determine its class.
Decision tree based algorithms
DT Splits Area

M
Gender
F

Height
Comparing DT’s

Balanced
Deep
Random Forests
Random forest is a supervised learning algorithm which
is used for both classification as well as regression.
Mainly used for classification problems.
Forest is made up of trees and more trees means more
robust forest.
Similarly, random forest algorithm creates decision trees
on data samples and then gets the prediction from each
of them and finally selects the best solution by means of
voting.
It is an ensemble method which is better than a single
decision tree because it reduces the over-fitting by
averaging the result.
Random Forest

Random forest is an ensemble machine


learning algorithm.
It operates by building multiple decision
trees.
They work for both
Classification
Regression
In Banking, it is used to predict fraudulent
customers.
It is used in analysing symptoms of the
patients and detecting the disease.
In e-commerce, the recommendations are
based on customer activity.
Stock market trends can be analysed to
predict profit or loss.
ANOVA

Analysis of Variance (ANOVA) is a


parametric statistical technique used to compare
datasets.
This technique was invented by R.A. Fisher, and
is thus often referred to as Fisher’s ANOVA, as
well.
It is similar in application to techniques such as
t-test, in that it is used to compare means and
the relative variance between them.
However, analysis of variance (ANOVA) is best
applied where more than 2 populations or
samples are meant to be compared.
ANOVA

ANOVA is a statistical test for estimating how


a quantitative dependent variable changes
according to the levels of one or more
categorical independent variables.
ANOVA tests whether there is a difference
in means of the groups at each level of the
independent variable.
The null hypothesis (H0) of the ANOVA is no
difference in means, and the alternate
hypothesis (Ha) is that the means are different
from one another.
One way analysis:
When we are comparing more than three
groups based on one factor variable, then it
said to be one way analysis of variance
(ANOVA).
For example, if we want to compare whether
or not the mean output of three workers is
the same based on the working hours of the
three workers.
Two way analysis
When factor variables are more than two,
then it is said to be two way analysis of
variance (ANOVA).
For example, based on working condition and
working hours, we can compare whether or
not the mean output of three workers is the
same.
K-way analysis
When factor variables are k, then it is said to
be the k-way analysis of variance (ANOVA).
Dealing with Missing
values

A common task in data analysis is dealing


with missing values.
In R, missing values are often represented
by NA or some other value that represents
missing values (i.e. 99).
Test for missing values

To identify missing values


use is.na() which returns a logical vector
with TRUE in the element locations that
contain missing values represented
by NA.
is.na() will work on vectors, lists,
matrices, and data frames.
For data frames, a convenient shortcut to
compute the total missing values in each
column is to use colSums().
Recode missing values
Exclude missing values

We can exclude missing values in a couple


different ways.
First, if we want to exclude missing values
from mathematical operations use
the na.rm = TRUE argument.
If you do not exclude these values most
functions will return an NA.
Non-linear least square

When modeling real world data for regression


analysis, we observe that it is rarely the case
that the equation of the model is a linear
equation giving a linear graph.
Most of the times, the plot of the model gives a
curve rather than a line.
The goal of both linear and non-linear
regression is to adjust the values of the model's
parameters to find the line or curve that comes
closest to your data.
In Least Square regression, we establish a
regression model in which the sum of the
squares of the vertical distances of different
points from the regression curve is minimized.
We generally start with a defined model and
assume some values for the coefficients.
We then apply the nls() function of R to get the
more accurate values along with the confidence
intervals.

You might also like