Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
n Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
UNIT 2
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Terminologies Related to the Regression Analysis:
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
UNIT 2
o Ridge Regression
o Lasso Regression:
inear Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:
Polynomial Regression:
o Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those datapoints.
To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line
Decision Tree Regression:
o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test, and
each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which
splits into left and right child nodes (subsets of dataset). These child nodes are further
divided into their children node, and themselves become the parent node of those
nodes. Consider the below image:
Univariate data:
Univariate data refers to a type of data in which each observation or data point corresponds
to a single variable. In other words, it involves the measurement or observation of a single
characteristic or attribute for each individual or item in the dataset. Analyzing univariate
data is the simplest form of analysis in statistics.
Heights (in cm) 164 167.3 170 174.2 178 180 18
Suppose that the heights of seven students in a class is recorded (above table). There is only
one variable, which is height, and it is not dealing with any cause or relationship.
Key points in Univariate analysis:
1. No Relationships: Univariate analysis focuses solely on describing and summarizing
the distribution of the single variable. It does not explore relationships between
variables or attempt to identify causes.
2. Descriptive Statistics: Descriptive statistics, such as measures of central
tendency (mean, median, mode) and measures of dispersion (range, standard deviation),
are commonly used in the analysis of univariate data.
3. Visualization: Histograms, box plots, and other graphical representations are often used
to visually represent the distribution of the single variable.
Bivariate data
Bivariate data involves two different variables, and the analysis of this type of data focuses
on understanding the relationship or association between these two variables. Example of
bivariate data can be temperature and ice cream sales in summer season.
UNIT 2
20 2000
25 2500
35 5000
Suppose the temperature and ice cream sales are the two variables of a bivariate data(table
2). Here, the relationship is visible from the table that temperature and sales are directly
proportional to each other and thus related because as the temperature increases, the sales
also increase.
Key points in Bivariate analysis:
1. Relationship Analysis: The primary goal of analyzing bivariate data is to understand
the relationship between the two variables. This relationship could be positive (both
variables increase together), negative (one variable increases while the other decreases),
or show no clear pattern.
2. Scatterplots: A common visualization tool for bivariate data is a scatterplot, where
each data point represents a pair of values for the two variables. Scatterplots help
visualize patterns and trends in the data.
Multivariate data
Multivariate data refers to datasets where each observation or sample point consists of
multiple variables or features. These variables can represent different aspects,
characteristics, or measurements related to the observed phenomenon. When dealing with
three or more variables, the data is specifically categorized as multivariate.
Example of this type of data is suppose an advertiser wants to compare the popularity of
four advertisements on a website.
Advertisement Gender Click rate
Ad1 Male 80
Ad3 Female 55
Ad1 Male 66
Ad3 Male 35
The click rates could be measured for both men and women and relationships between
variables can then be examined. It is similar to bivariate but contains more than one
dependent variable.
UNIT 2
relationships between
variable can be examined
1. Data Modelling
Data modelling is a very common terminology in software engineering and other IT
disciplines. It has many interpretations and definitions depending on the field in discussion.
In data science, data modelling is the process of finding the function by which data was
generated. In this context, data modelling is the goal of any data analysis task. For instance if
you have a 2d dataset (see the figures below), and you find the 2 variables are linearly
correlated, you may decide to model it using linear regression.
2. Bayesian Data Modelling
Bayesian data modelling is to model your data using Bayes Theorem. Let us re-visit Bayes
Rule again:
P(H|E)=P(E|H)×P(H)P(E)
In the above equation, H is the hypothesis and E is the evidence. In the real world however,
we understand Bayesian components differently! The evidence is usually expressed by data,
and the hypothesis reflects the expert’s prior estimation of the posterior. Therefore, we can
re-write the Bayes Rule to be:
P(posterior)=P(data|θ)×P(prior)P(data)
In the above definition we learned about prior, posterior, and data, bout what
about θ parameter? θ is the set of coefficients that best define the data. You may think of θ as
the set of slope and intercept of your linear regression equation, or the vector of
coefficients w in your polynomial regression function. As you see in the above equation, θ is
the single missing parameter, and the goal of Bayesian modelling is to find it.
learn how data is developed over time. If you have big portions of data, you may visualize it
and try to detect certain pattern(s) of its evolving over time, and select your probability
distribution upon.
Bayesian modeling is able to incorporate prior knowledge into the model. In environmental
health, this can be used to inform the model with information from previous studies, such as
the previously estimated toxicities of certain pollutants. This allows for more predictions
incorporating previous work, all while taking into account the uncertainty of these
associations.
A particularly powerful advantage is Bayesian modeling’s ability to incorporate uncertainty.
In environmental health, that may include uncertainty in the exposure, or prior knowledge
about the association with the outcome. This approach incorporates model uncertainty, which
can help estimate the probability of a hypothesis being correct. There are many other benefits,
too, such as its flexibility in dealing with missing data.
Finally, Bayesian modeling is a powerful tool for decision-making. It can be used to inform
policy decisions by providing a quantitative assessment of a variety of complex risks
associated with exposure to pollutants.
While Bayesian modeling is ascendant in environmental health sciences, particularly in the
last decade, the theory underlying it is anything but new. In fact, the originally-stated Bayes’
theorem, which describes how to update the probability of a hypothesis as new evidence
becomes available, is named for Reverend Thomas Bayes, an 18th-century statistician and
theologian, who first described the theorem in a paper published posthumously way back in
1763.
What is the Bayesian Model Selection?
Bayesian Model Selection is a probabilistic approach used in statistics and machine learning
to compare and choose between different statistical models. This method is based on the
principles of Bayesian statistics, which provide a systematic framework for updating beliefs
in light of new evidence.
Bayesian Inference
Bayesian inference is a statistical method for updating beliefs about unknown parameters
using observed data and prior knowledge. It’s based on Bayes’ theorem:
P(\theta|D) = \frac{P(D|\theta) \times P(\theta)}{P(D)}
Here,
P(\theta|D) is the posterior probability of the parameter \theta given data D.
P(D|\theta) is the likelihood of data D given\theta .
P(\theta) is the prior probability of \theta .
P(D) is the marginal likelihood of data.
So basically , we update our belief \theta based on new evidence data ( D ). The
likelihoodP(D|\theta) measures how probable the data is under certain parameter values. The
prior P(\theta) represents our initial belief about \theta before seeing the data. We then
UNIT 2
combine this with the likelihood to get the posteriorP(\theta|D) , our updated belief after
observing the data.
Key Components of Bayesian Statistics
The key components of this framework are:
Prior Probability (Prior): This represents the belief about the model before seeing
the data.
Likelihood: The probability of the data given the model.
Posterior Probability: The probability of the model given the data, obtained by
updating the prior with the likelihood using Bayes’ theorem.
Application of Bayesian Model Selection in Machine Learning
1. Model Comparison: Used to compare different machine learning models (e.g., linear
regression, neural networks, decision trees) to identify the model that best explains the
data.
2. Hyperparameter Tuning: Bayesian optimization can be used for hyperparameter
tuning by treating hyperparameters as random variables and optimizing their posterior
distribution.
3. Ensemble Methods: Bayesian model averaging combines multiple models by
weighting them according to their posterior probabilities, leading to more robust
predictions.
4. Feature Selection: Bayesian methods can be used for feature selection by comparing
models with different subsets of features.
Conclusion
Bayesian Model Selection offers a robust framework for dealing with the complexities
inherent in statistical model comparison. By effectively integrating prior knowledge and
assessing model plausibility through the lens of probability, it provides a powerful tool for
many scientific and engineering disciplines. As computational resources continue to improve,
its applicability and popularity are likely to grow, making it a cornerstone in the field of
statistical inference.
Bayesian Belief Network is a graphical representation of different probabilistic relationships
among random variables in a particular set. It is a classifier with no dependency on attributes
i.e it is condition independent. Due to its feature of joint probability, the probability in
Bayesian Belief Network is derived, based on a condition — P(attribute/parent) i.e
probability of an attribute, true over parent attribute.
(Note: A classifier assigns data in a collection to desired categories.)
Consider this example:
UNIT 2
In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a
person ‘gfg’, which rings upon two probabilities i.e burglary ‘B’ and fire ‘F’, which
are – parent nodes of the alarm node. The alarm is the parent node of two probabilities
P1 calls ‘P1’ & P2 calls ‘P2’ person nodes.
Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively.
But, there are few drawbacks in this case, as sometimes ‘P1’ may forget to call the
person ‘gfg’, even after hearing the alarm, as he has a tendency to forget things,
quick. Similarly, ‘P2’, sometimes fails to call the person ‘gfg’, as he is only able to
hear the alarm, from a certain distance.
Q) Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called ‘gfg’)
when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’
events]
[ Note: The values mentioned below are neither calculated nor computed. They have
observed values ]
Burglary ‘B’ –
P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
UNIT 2
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have rung).
It has two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e
may have occurred or may not have occurred) depending upon different conditions.
Person ‘P1’ –
P
A P (P1=F)
(P1=T)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person ‘gfg’ or
not) . It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have
rung or may not have rung ,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
P
A P (P2=F)
(P2=T)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have
rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
UNIT 2
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to get the probability of
‘P1’. We find it with regard to its parent node – alarm ‘A’. To get the probability of ‘P2’, we
find it with regard to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’ and
fire ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075