Unit 2
Unit 2
Unit 2
Data Analysis I : Introduction to Regression Modeling ,
Techniques and Applications
ˆ 2
Y =Y + (X − X)
3 Adjustment for how far
Expected
offspring parent is from mean of
height parents
Average sized offspring
• Regression Modeling is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more independent
variables.
• More specifically, Regression Modeling helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed.
• It predicts continuous/real values such as temperature, age, salary, price, etc.
Terminology
• Dependent Variable:
The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.
• Independent Variable:
The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a
predictor.
• Outliers:
Outlier is an observation which contains either very low value or very high value
in comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
• Multicollinearity:
If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
•Underfitting and Overfitting:
If our algorithm works well with the training dataset but not well with test dataset,
then such problem is called Overfitting.
If our algorithm does not perform well even with training dataset, then such
problem is called underfitting.
• We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:
• Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year.
• It is mainly used for prediction, forecasting, time series modeling, and determining
the causal-effect relationship between variables.
• In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions about the
data.
• In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance between
the datapoints and the regression line is minimum."
• Regression is a statistical procedure that determines the equation for the straight line that
best fits a specific set of data.
• The distance between datapoints and line tells whether a model has captured a strong
relationship or not.
Some examples of regression can be as:
• There are various scenarios in the real world where we need some future predictions such
as weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately.
• So for such case we need Regression analysis which is a statistical method and used in
machine learning and data science.
• Regression estimates the relationship between the target and the independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
Types of Regression
• There are various types of regressions which are used in data science and machine
learning.
• Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables.
• Here we are discussing some important types of regression which are given below:
▪ Linear Regression
▪ Logistic Regression
▪ Polynomial Regression
▪ Support Vector Regression
▪ Decision Tree Regression
▪ Random Forest Regression
▪ Ridge Regression
▪ Lasso Regression
Techniques for Regression Modeling
3.Gradient Descent
4.Cross-Validation
UNIT 2
MULTIVARIATE ANALYSIS
AND
BAYESIAN MODELING
While studying the Bayes theorem, we need to understand few important concepts.
These are as follows:
1. Experiment
• An experiment is defined as the planned operation carried out under controlled
condition such as tossing a coin, drawing a card and rolling a dice, etc.
2. Sample Space
• During an experiment what we get as a result is called as possible outcomes and the
set of all possible outcome of an event is known as sample space. For example, if we
are rolling a dice, sample space will be:
• S1 = {1, 2, 3, 4, 5, 6}
• Similarly, if our experiment is related to toss a coin and recording its outcomes, then
sample space will be:
• S2 = {Head, Tail}
Cont……….
3. Event
• Event is defined as subset of sample space in an experiment. Further, it is also called as
set of outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;
• A = Event when an even number is obtained = {2, 4, 6}
• B = Event when a number is greater than 4 = {5, 6}
• Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of
possible outcomes
P(E) = 3/6 =1/2 =0.5
• Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total
number of possible outcomes
=2/6
=1/3
=0.333
• Union of event A and B:
A∪B = {2, 4, 5, 6}
Cont……
Intersection of event A and B:
A∩B= {6}
Disjoint Event: If the intersection of the event A and B is an empty set or null then such events are
known as disjoint event or mutually exclusive events also.
5. Exhaustive Event: As per the name suggests, a set of events where at least one event occurs at a
time, called exhaustive event of an experiment. Thus, two events A and B are said to be exhaustive if
either A or B definitely occur at a time and both are mutually exclusive for e.g., while tossing a coin,
either it will be a Head or may be a Tail.
6. Independent Event:
• Two events are said to be independent when occurrence of one event does not affect the occurrence
of another event. In simple words we can say that the probability of outcome of both events does
not depends one another.
Mathematically, two events A and B are said to be independent if:
P(A ∩ B) = P(AB) = P(A)*P(B)
7. Conditional Probability:Conditional probability is defined as the probability of an event A, given that
another event B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we can
define it as:
• P(A|B) = P(A ∩ B) / P(B)
Cont……
8. Marginal Probability:
• Marginal probability is defined as the probability of an event A occurring
independent of any other event B. Further, it is considered as the probability of
evidence under any consideration. Here ~B represents the event that B does
not occur.
• P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)
How to apply Bayes Theorem or Bayes rule in
Machine Learning?
• Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B),
P(B), and P(A). This rule is very helpful in such scenarios where we have a good
probability of P(A|B), P(B), and P(A) and need to determine the fourth term.
• Naïve Bayes classifier is one of the simplest applications of Bayes theorem
which is used in classification algorithms to isolate data as per accuracy, speed
and classes.
• Let's understand the use of Bayes theorem in machine learning with below
example.
• Suppose, we have a vector A with I attributes. It means
• A = A1, A2, A3, A4……………Ai
• Further, we have n classes represented as C1, C2, C3, C4…………Cn.
Cont….
• These are two conditions given to us, and our classifier that works on Machine Language has to predict A and
the first thing that our classifier has to choose will be the best possible class. So, with the help of Bayes
theorem, we can write it as:
P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)
Here;
• P(A) is the condition-independent entity.
• P(A) will remain constant throughout the class means it does not change its value with respect to change in
class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).
• With n number classes on the probability list let's assume that the possibility of any class being the right answer
is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)=P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes theorem plays a
significant role in Machine Learning and Naïve Bayes theorem has simplified the conditional probability tasks
without affecting the precision. Hence, we can conclude that:
P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)
Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of smaller events.
School of Computing Science and Engineering
Course Code : R1UC402T Course Name : DATA ANALYTICS
UNIT -2
UNDERSTANDING SUPPORT VECTOR MACHINE