Data Analytics Unit 3
Data Analytics Unit 3
Unit 5
Data Analysis
R # Create a list.
list1 <- list(c(2,5,3),21.3,sin)
programming
# Print the list.
print(list1)
# Create a matrix.
language M=
matrix( c('a','a','b','c','b','a'),
nrow = 2, ncol = 3, byrow =
TRUE)
print(M)
# Create an array
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Regression Versus Classification
Regression Versus
Classification Contd..
Regression
Modelling
Regression analysis is a set of statistical
processes for estimating the relationships
between a dependent variable (often called
the 'outcome' or 'response' variable) and
one or more independent variables (often
called 'predictors', 'covariates', 'explanatory
variables' or 'features').
Regression Modeling Steps
1 2 3 4 5 6 7
Define Specify Collect data Do Estimate Evaluate Use model
problem or model descriptive unknown model for
question data analysis parameters prediction
• represents the • i represents the
unit change in Y per unit change in Y
unit change in X . per unit change
• Does not take into in Xi.
account any • Takes into account
Simple other variable the effect of
besides single other
vs. independent
Multiple variable. i s.
• “Net regression
coefficient.”
Linearity - the Y variable is linearly related to the
value of the X variable.
y = 0 + 1x +
where:
0 and 1 are called parameters of the model,
is a random variable called the error term.
Simple Linear Regression Equation
E(y) = 0 + 1x
RelEatqiounsahitpion
E(y)Regression
line
Intercept Slope 1
0
is positive
x
Simple Linear Regression Equation
E(y)
Intercept
0 Regression
line
Slope 1
is negative
x
Simple Linear Regression Equation
No Relationship
E(y)
Intercept Regression
0 line
Slope 1
is 0
x
Least Squares
Method
• Least Squares Criterion
min (y i y‸ i )2
where:
yi = observed value of the dependent variable
for the ith observation
yi^ = estimated value of the dependent
for the ith observation
variable
Least Squares
• Slope Method
for the Estimated Regression
Equation
b1 (x x )(y
i i
y )(x i
where: x )2
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
Least Squares Method
b0 y b1 x
x 2 5 3 5 1 6
y 4 7 6 8 4 9
Simple Linear Regression
Number of Number of
TV Ads (x) Cars Sold
(y)
1 14
3 24
2 18
1 17
3 27
x = 10 y = 100
x 2 y 20
Estimated Regression
Equation
Slope for the Estimated Regression Equation
b1 (x x )(y
i i
2
5
y ) (x i
0
y-Intercept for thexEstimated
) 2 Regression Equation
4
b0 y b1 x 20 5(2) 10
y
10
5x
ˆ
Coefficient of
Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
(y y )
i
2
(yˆi y )2 (yi yˆ i )2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination
r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
Coefficient of Determination
supervised unsupervised
Eg: Fruit
Classification
Classifier-
with some Eg: Fruit
orange, apples, Seen lots of
known label Classifier-
bananas example but not
a proper label. “fruit with
soft skin”.
Like clustering “red
fruits”
Supervised
learning
O/P variable
Draw Classification Regression is real or
conclusions continuous
like spam values like
or not, red marks or
or blue Linear weight
Naïve Bayes
regression
Polynomial
Decision tree
regression
SVM
SVM
Regression
What Is Naive Bayes?
Medical Diagnosis
• Given a list of symptoms, predict whether a patient has disease X or not
Weather
• Based on temperature, humidity, etc… predict if it will rain tomorrow
Feature
Feature x1
X2
Label Y
NAÏVE BAYS EXAMPLE:
To predict days suitable for a football match based on weather conditions
Smaller circle- low probability to play (P<0.5)
Big Circle- High probability to play (P>0.5
Combining both the conditions
Probability
of the class
0.60
Prior Probability
Predict the likelihood to play football on
( Season =Winter, Sunny=No, Windy= yes
Probability of match not being played??
Face recognition
Mail classification
Handwriting analysis
Salary prediction
Statistical
Learning :
Bayesian
Network
Bayesian
Network
A simple graphical representation for a joint probability distribution.
• Nodes are random variables
• Directed edges between nodes reflect dependence
Syntax:
– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ "directly influences")
– if there is a link from x to y, x is said to be a parent of y
– a conditional distribution for each node given its parents:
P (Xi | Parents (Xi ))
Find the probability when John Call and Marry Calls and Alarm
Went Off and there is no burglary and no earthquake happened.
P(J , M , A ,¬B , ¬ E)
= P(J | A)* P(M | A)* P(A| ¬ B , ¬ E) * P(¬ B) *P(¬ E)
= 0.90 x 0.70 x 0.001 x 0.999 x 0.998
= 0.0006
Inference and Bayesian
Networks
• Bayesian networks are a type of probabilistic
graphical model that uses Bayesian inference for
probability computations.
• Bayesian networks aim to model conditional
dependence, and therefore causation, by
representing conditional dependence by edges in a
directed graph.
• Through these relationships, one can efficiently
conduct inference on the random variables in the
graph through the use of factors.
•
Inference and Bayesian
Networks
• A Bayesian network is a directed acyclic graph in
which each edge corresponds to a conditional
dependency, and each node corresponds to a unique
random variable.
• Formally, if an edge (A, B) exists in the graph
connecting random variables A and B, it means that
P(B|A) is a factor in the joint probability distribution,
so we must know P(B|A) for all values of B and A in
order to conduct inference.
• In the above example, since Rain has an edge
going into WetGrass, it means that P(WetGrass|Rain)
will be a factor, whose probability values are specified
next to the WetGrass node in a conditional
probability table.
• Support Vector Machine, abbreviated as
SVM can be used for both regression and
classification tasks.
• In logistic regression, we take the output of the linear function and squash the value
within the range of [0,1] using the sigmoid function.
• If the squashed value is greater than a threshold value(0.5) we assign it a label 1,
else we
assign it a label 0.
• In SVM, we take the output of the linear function and if that output is greater than 1,
we identify it with one class and if the output is -1, we identify is with another class.
• Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement
range of values([-1,1]) which acts as margin.
Sec.
15.1
Maximum Margin:
Formalization
w: decision hyperplane normal vector
Margin
• Distance from example to the separator is r
wT x b
y w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the width of separation between support vectors of classes.
Derivation of finding r:
ρ Dotted line x’−x is perpendicular to
x
decision boundary so parallel to w.
Unit vector is w/|w|, so line is
r
x rw/|w|.
x’ = x – yrw/|w|.
′ x’ satisfies wTx’+b = 0. So
wT(x –yrw/|w|) + b = 0
Recall that |w| =
sqrt(wTw).
So wTx –yr|w| + b = 0
w So, solving for r gives:
r = y(wTx + b)/|w|
Sec.
15.1
Linear SVM
Mathematically The
• Assume that all data is at least linearly separable
distance 1 from the hyperplane, then the following two constraints
follow for a training set {(x ,y )}
i i
case
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
2
w
75
Sec.
15.1
•
wTxb + b = -1
Hyperplane
wT x + b = 0
• Extra scale constraint:
mini=1,…,n |wTxi + b| = 1
• This implies:
wT(xa–xb) = 2
ρ = ||xa–x b|| 2 = 2/||w||2 wT x + b = 0
76
Solving the Optimization Problem
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
f(x) = ΣαiyixiTx + b
• Notice that it relies on an inner product between the test point x and the support vectors xi
• We will return to this later.
• Also keep in mind that solving the optimization problem involved computing the inner products x iTx j
between all pairs of training points.
78
Classification with SVMs
• The most “important” training points are the support vectors; they define the hyperplane.
• Quadratic optimization algorithms can identify which training points xi are support vectors with
non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solution, training points appear only
inside
inner products:
Find α1…αN such that f(x) = ΣαiyixiTx + b
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
80
Non-linear SVMs
• Datasets that are linearly separable (with some noise) work out great:
0 x
0 x
81
Non-linear SVMs:
Feature spaces
• General idea: the original feature space can always be mapped
to some higher-dimensional feature space where
the training set is separable:
Φ: x → φ(x)
82
The “Kernel
Trick”
• The linear classifier relies on an inner product between vectors K(x ,x )=x x i j i
T
j
• If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the
inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner product in some expanded feature
space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi xj) ,
T 2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj1 √2 xj1xj2 xj2 √2xj1 √2xj2]
2
2
2 2
= φ(xi) Tφ(xj) where φ(x) = [1 x1 x2 √2x1 √2x2]
83
√2 x1x2
Sec.
15.2.3
Kernels
Common kernels
• Linear
• Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
• Radial basis function (infinite dimensional space)
Objectives:
• To understand how time series works, what factors are affecting
a certain variable(s) at different points of time.
• Time series analysis will provide the consequences and insights
of
features of the given dataset that changes over time.
• Supporting to derive the predicting the future values of the time
series variable.
• Assumptions: There is one and the only assumption that is
“stationary”, which means that the origin of time, does not affect
the properties of the process under the statistical factor.
How to analyze Time Series?
Classical set
1. Classical set is a collection of distinct objects. For example, a set of students passing
grades.
2. Each individual entity in a set is called a member or an element of the set.
3. The classical set is defined in such a way that the universe of discourse is splitted into
two groups members and non-members. Hence, In case classical sets, no
partial membership exists.
4. Let A is a given set. The membership function can be use to define a set A is given by:
Fuzzy Logic: Extracting Fuzzy Models from Data
Fuzzy set:
1. Fuzzy set is a set having degrees of membership between 1 and 0. Fuzzy sets are
represented with tilde character(~). For example, Number of cars following traffic
signals at a particular time out of all cars present will have membership value
between [0,1].
2. Partial membership exists when member of one fuzzy set can also be a part of other
fuzzy sets in the same universe.
3. The degree of membership or truth is not same as probability, fuzzy truth
represents membership in vaguely defined sets.
4. A fuzzy set A~ in the universe of discourse, U, can be defined as a set of ordered
pairs
and it is given by
Fuzzy Logic: Extracting Fuzzy Models from Data