0% found this document useful (0 votes)
3 views48 pages

Unit 2

The document provides an overview of regression modeling and multivariate analysis, detailing the relationship between dependent and independent variables, as well as various regression techniques such as linear and logistic regression. It also introduces multivariate analysis techniques, distinguishing between dependence and interdependence methods, and explains Bayes theorem's application in machine learning for predicting probabilities. Key concepts such as outliers, multicollinearity, and overfitting are discussed, along with examples to illustrate the practical applications of these statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views48 pages

Unit 2

The document provides an overview of regression modeling and multivariate analysis, detailing the relationship between dependent and independent variables, as well as various regression techniques such as linear and logistic regression. It also introduces multivariate analysis techniques, distinguishing between dependence and interdependence methods, and explains Bayes theorem's application in machine learning for predicting probabilities. Key concepts such as outliers, multicollinearity, and overfitting are discussed, along with examples to illustrate the practical applications of these statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

School of Computing Science and Engineering

Course Code : R1UC402T Course Name: Data Analytics

Unit 2
Data Analysis I : Introduction to Regression Modeling ,
Techniques and Applications

Name of the Faculty: Dr. Saurabh Raj Sangwan Program Name:


Btech (IV Sem)
Regression Modeling
✓ The process of fitting a line to data.
✓ Sir Francis Galton (1822-1911) -- a British anthropologist and
meteorologist coined the term “regression”.

Regression towards mediocrity in hereditary stature - the tendency of


offspring to be smaller than large parents and larger than small parents.
Referred to as “regression towards the mean”.

ˆ 2
Y =Y + (X − X)
3 Adjustment for how far
Expected
offspring parent is from mean of
height parents
Average sized offspring
• Regression Modeling is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more independent
variables.
• More specifically, Regression Modeling helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed.
• It predicts continuous/real values such as temperature, age, salary, price, etc.
Terminology

• Dependent Variable:
The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.

• Independent Variable:
The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a
predictor.

• Outliers:
Outlier is an observation which contains either very low value or very high value
in comparison to other observed values. An outlier may hamper the result, so it should be
avoided.

• Multicollinearity:
If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
•Underfitting and Overfitting:
If our algorithm works well with the training dataset but not well with test dataset,
then such problem is called Overfitting.
If our algorithm does not perform well even with training dataset, then such
problem is called underfitting.
• We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:

• Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year.

• So to solve such type of prediction problems in machine learning, we need regression


analysis.
• Regression is a supervised learning technique which helps in finding the correlation
between variables and enables us to predict the continuous output variable based on the
one or more predictor variables.

• It is mainly used for prediction, forecasting, time series modeling, and determining
the causal-effect relationship between variables.

• In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions about the
data.

• In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance between
the datapoints and the regression line is minimum."

• Regression is a statistical procedure that determines the equation for the straight line that
best fits a specific set of data.

• The distance between datapoints and line tells whether a model has captured a strong
relationship or not.
Some examples of regression can be as:

• Prediction of rain using temperature and other factors

• Determining Market trends

• Prediction of road accidents due to rash driving.


Why do we use Regression Analysis?

• As mentioned earlier, Regression analysis helps in the prediction of a continuous


variable.

• There are various scenarios in the real world where we need some future predictions such
as weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately.

• So for such case we need Regression analysis which is a statistical method and used in
machine learning and data science.

• Below are some other reasons for using Regression analysis:

• Regression estimates the relationship between the target and the independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
Types of Regression

• There are various types of regressions which are used in data science and machine
learning.

• Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables.

• Here we are discussing some important types of regression which are given below:

▪ Linear Regression
▪ Logistic Regression
▪ Polynomial Regression
▪ Support Vector Regression
▪ Decision Tree Regression
▪ Random Forest Regression
▪ Ridge Regression
▪ Lasso Regression
Techniques for Regression Modeling

1.Ordinary Least Squares (OLS)

2.Maximum Likelihood Estimation (MLE)

3.Gradient Descent

4.Cross-Validation

5.Regularization Techniques (Ridge, Lasso)


School of Computing Science and Engineering
Course Code : …………….. Course Name: Data Analytics

UNIT 2
MULTIVARIATE ANALYSIS
AND
BAYESIAN MODELING

Name of the Faculty: Ms.KIRTI Program Name: BTech


Multivariate Analysis
There are many different techniques for multivariate analysis, and they can be divided into two
categories:
• Dependence techniques
• Interdependence techniques
Multivariate analysis techniques: Dependence vs. interdependence
When we use the terms “dependence” and “interdependence,” we’re referring to different types of
relationships within the data. To give a brief explanation:
• Dependence methods
• Dependence methods are used when one or some of the variables are dependent on others.
Dependence looks at cause and effect;
• In other words, can the values of two or more independent variables be used to explain, describe, or
predict the value of another, dependent variable? To give a simple example, the dependent variable
of “weight” might be predicted by independent variables such as “height” and “age.”
• In machine learning, dependence techniques are used to build predictive models. The analyst enters
input data into the model, specifying which variables are independent and which ones are
dependent—in other words, which variables they want the model to predict, and which variables
they want the model to use to make those predictions.
Cont…..
Interdependence methods
• Interdependence methods are used to understand the structural
makeup and underlying patterns within a dataset. In this case, no
variables are dependent on others, so you’re not looking for causal
relationships. Rather, interdependence methods seek to give meaning
to a set of variables or to group them together in meaningful ways.
• So: One is about the effect of certain variables on others, while the
other is all about the structure of the dataset.
Cont……..
Some useful multivariate analysis techniques are:
• Multiple linear regression
• Multiple logistic regression
• Multivariate analysis of variance (MANOVA)
• Factor analysis
• Cluster analysis
Multiple linear regression

Multiple linear regression is a dependence method which looks at the relationship


between one dependent variable and two or more independent variables. A
multiple regression model will tell you the extent to which each independent
variable has a linear relationship with the dependent variable. This is useful as it
helps you to understand which factors are likely to influence a certain outcome,
allowing you to estimate future outcomes.
Example of multiple regression:
As a data analyst, you could use multiple regression to predict crop growth. In this
example, crop growth is your dependent variable and you want to see how
different factors affect it. Your independent variables could be rainfall,
temperature, amount of sunlight, and amount of fertilizer added to the soil. A
multiple regression model would show you the proportion of variance in crop
growth that each independent variable accounts for.
Multiple logistic regression

Logistic regression analysis is used to calculate (and predict) the probability


of a binary event occurring. A binary outcome is one where there are only
two possible outcomes; either the event occurs (1) or it doesn’t (0). So,
based on a set of independent variables, logistic regression can predict how
likely it is that a certain scenario will arise. It is also used for classification.
Example of logistic regression:
Let’s imagine you work as an analyst within the insurance sector and you
need to predict how likely it is that each potential customer will make a
claim. You might enter a range of independent variables into your model,
such as age, whether or not they have a serious health condition, their
occupation, and so on. Using these variables, a logistic regression analysis
will calculate the probability of the event (making a claim) occurring.
Another cited example is the filters used to classify email as “spam” or “not
spam.”
Multivariate analysis of variance (MANOVA)

• Multivariate analysis of variance (MANOVA) is used to measure the effect


of multiple independent variables on two or more dependent variables.
With MANOVA, it’s important to note that the independent variables are
categorical, while the dependent variables are metric in nature. A
categorical variable is a variable that belongs to a distinct category—for
example, the variable “employment status” could be categorized into
certain units, such as “employed full-time,” “employed part-time,”
“unemployed,” and so on. A metric variable is measured quantitatively
and takes on a numerical value.
• In MANOVA analysis, you’re looking at various combinations of the
independent variables to compare how they differ in their effects on the
dependent variable.
Example of MANOVA:

Let’s imagine you work for an engineering company that is on a mission to


build a super-fast, eco-friendly rocket. You could use MANOVA to measure
the effect that various design combinations have on both the speed of the
rocket and the amount of carbon dioxide it emits. In this scenario, your
categorical independent variables could be:
Engine type, categorized as E1, E2, or E3
Material used for the rocket exterior, categorized as M1, M2, or M3
Type of fuel used to power the rocket, categorized as F1, F2, or F3
Your metric dependent variables are speed in kilometers per hour, and
carbon dioxide measured in parts per million. Using MANOVA, you’d test
different combinations (e.g. E1, M1, and F1 vs. E1, M2, and F1, vs. E1, M3,
and F1, and so on) to calculate the effect of all the independent variables.
This should help you to find the optimal design solution for your rocket.
Factor analysis

• Factor analysis is an interdependence technique which seeks to


reduce the number of variables in a dataset. If you have too many
variables, it can be difficult to find patterns in your data. At the same
time, models created using datasets with too many variables are
susceptible to overfitting. Overfitting is a modeling error that occurs
when a model fits too closely and specifically to a certain dataset,
making it less generalizable to future datasets, and thus potentially
less accurate in the predictions it makes.
• Factor analysis works by detecting sets of variables which correlate
highly with each other. These variables may then be condensed into a
single variable. Data analysts will often carry out factor analysis to
prepare the data for subsequent analyses.
Example:
• Let’s imagine you have a dataset containing data pertaining to a
person’s income, education level, and occupation. You might find a
high degree of correlation among each of these variables, and thus
reduce them to the single factor “socioeconomic status.” You might
also have data on how happy they were with customer service, how
much they like a certain product, and how likely they are to
recommend the product to a friend. Each of these variables could be
grouped into the single factor “customer satisfaction” (as long as they
are found to correlate strongly with one another). Even though you’ve
reduced several data points to just one factor, you’re not really losing
any information—these factors adequately capture and represent the
individual variables concerned. With your “streamlined” dataset,
you’re now ready to carry out further analyses.
Cluster analysis

• Another interdependence technique, cluster analysis is used to group


similar items within a dataset into clusters.
• When grouping data into clusters, the aim is for the variables in one
cluster to be more similar to each other than they are to variables in
other clusters. This is measured in terms of intracluster and
intercluster distance. Intracluster distance looks at the distance
between data points within one cluster. This should be small.
Intercluster distance looks at the distance between data points in
different clusters. This should ideally be large. Cluster analysis helps
you to understand how data in your sample is distributed, and to find
patterns.
Example:
• A prime example of cluster analysis is audience segmentation. If you
were working in marketing, you might use cluster analysis to define
different customer groups which could benefit from more targeted
campaigns. As a healthcare analyst, you might use cluster analysis to
explore whether certain lifestyle factors or geographical locations are
associated with higher or lower cases of certain illnesses. Because it’s
an interdependence technique, cluster analysis is often carried out in
the early stages of data analysis.
Bayes Theorem in Machine Learning
Introduction:
❑Bayes theorem is given by an English statistician, philosopher, and
Presbyterian minister named Mr. Thomas Bayes in 17th century.
❑Bayes provides their thoughts in decision theory which is extensively used
in important mathematics concepts as Probability.
❑Bayes theorem is also widely used in Machine Learning where we need to
predict classes precisely and accurately.
❑An important concept of Bayes theorem named Bayesian method is used
to calculate conditional probability in Machine Learning application that
includes classification tasks.
❑Further, a simplified version of Bayes theorem (Naïve Bayes classification)
is also used to reduce computation time and average cost of the projects.
❑Bayes theorem is also extensively applied in health and medical, research
and survey industry, aeronautical sector, etc.
Cont……….
• Bayes theorem is one of the most popular machine learning concepts that helps to calculate the probability of
occurring one event with uncertain knowledge while other one has already occurred.
• Bayes' theorem can be derived using product rule and conditional probability of event X with known event Y:
• According to the product rule we can express as the probability of event X with known event Y as follows;
P(X ? Y)= P(X|Y) P(Y) {equation 1}
• Further, the probability of event Y with known event X:
P(X ? Y)= P(Y|X) P(X) {equation 2}
• Mathematically, Bayes theorem can be expressed by combining both equations on right hand side. We will get:

P(X|Y) = P(Y|X) P(X) / P(Y)


• Here, both events X and Y are independent events which means probability of
outcome of both events does not depends one another.
• The above equation is called as Bayes Rule or Bayes Theorem.
Cont…….
• The Formula : P(X|Y) = P(Y|X) P(X) / P(Y)

• P(X|Y) is called as posterior, which we need to calculate. It is defined as


updated probability after considering the evidence.
• P(Y|X) is called the likelihood. It is the probability of evidence when
hypothesis is true.
• P(X) is called the prior probability, probability of hypothesis before
considering the evidence
• P(Y) is called marginal probability. It is defined as the probability of
evidence under any consideration.
Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Prerequisites for Bayes Theorem

While studying the Bayes theorem, we need to understand few important concepts.
These are as follows:
1. Experiment
• An experiment is defined as the planned operation carried out under controlled
condition such as tossing a coin, drawing a card and rolling a dice, etc.
2. Sample Space
• During an experiment what we get as a result is called as possible outcomes and the
set of all possible outcome of an event is known as sample space. For example, if we
are rolling a dice, sample space will be:
• S1 = {1, 2, 3, 4, 5, 6}
• Similarly, if our experiment is related to toss a coin and recording its outcomes, then
sample space will be:
• S2 = {Head, Tail}
Cont……….
3. Event
• Event is defined as subset of sample space in an experiment. Further, it is also called as
set of outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;
• A = Event when an even number is obtained = {2, 4, 6}
• B = Event when a number is greater than 4 = {5, 6}
• Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of
possible outcomes
P(E) = 3/6 =1/2 =0.5
• Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total
number of possible outcomes
=2/6
=1/3
=0.333
• Union of event A and B:
A∪B = {2, 4, 5, 6}
Cont……
Intersection of event A and B:
A∩B= {6}
Disjoint Event: If the intersection of the event A and B is an empty set or null then such events are
known as disjoint event or mutually exclusive events also.
5. Exhaustive Event: As per the name suggests, a set of events where at least one event occurs at a
time, called exhaustive event of an experiment. Thus, two events A and B are said to be exhaustive if
either A or B definitely occur at a time and both are mutually exclusive for e.g., while tossing a coin,
either it will be a Head or may be a Tail.
6. Independent Event:
• Two events are said to be independent when occurrence of one event does not affect the occurrence
of another event. In simple words we can say that the probability of outcome of both events does
not depends one another.
Mathematically, two events A and B are said to be independent if:
P(A ∩ B) = P(AB) = P(A)*P(B)
7. Conditional Probability:Conditional probability is defined as the probability of an event A, given that
another event B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we can
define it as:
• P(A|B) = P(A ∩ B) / P(B)
Cont……
8. Marginal Probability:
• Marginal probability is defined as the probability of an event A occurring
independent of any other event B. Further, it is considered as the probability of
evidence under any consideration. Here ~B represents the event that B does
not occur.
• P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)
How to apply Bayes Theorem or Bayes rule in
Machine Learning?
• Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B),
P(B), and P(A). This rule is very helpful in such scenarios where we have a good
probability of P(A|B), P(B), and P(A) and need to determine the fourth term.
• Naïve Bayes classifier is one of the simplest applications of Bayes theorem
which is used in classification algorithms to isolate data as per accuracy, speed
and classes.
• Let's understand the use of Bayes theorem in machine learning with below
example.
• Suppose, we have a vector A with I attributes. It means
• A = A1, A2, A3, A4……………Ai
• Further, we have n classes represented as C1, C2, C3, C4…………Cn.
Cont….
• These are two conditions given to us, and our classifier that works on Machine Language has to predict A and
the first thing that our classifier has to choose will be the best possible class. So, with the help of Bayes
theorem, we can write it as:
P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)
Here;
• P(A) is the condition-independent entity.
• P(A) will remain constant throughout the class means it does not change its value with respect to change in
class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).
• With n number classes on the probability list let's assume that the possibility of any class being the right answer
is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)=P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes theorem plays a
significant role in Machine Learning and Naïve Bayes theorem has simplified the conditional probability tasks
without affecting the precision. Hence, we can conclude that:
P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)
Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of smaller events.
School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

UNIT -2
UNDERSTANDING SUPPORT VECTOR MACHINE

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

UNDERSTANDING SUPPORT VECTOR MACHINE

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)


School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS

Name of the Faculty: RAJ KUMAR PARIDA Program Name: B.Tech(CSE)

You might also like