Data Analytics and Visualization
IMBA III SEM
Prepared By
S Naresh Kumar
Department of Computer Science and Engineering
UNIT-III
Regression Equations- Linear models, ordinary least squares
Gaussian Processes regression.
Machine Learning
Machine Learning is a concept which allows the machine to learn from
examples and experience, and that too without being explicitly
programmed.
Supervised
Unsupervised
Semi –supervised DL
ML
Reinforcement AI
Algorithms are
Linear Regression
Logistic Regression
Decision Tree
SVM
Naive Bayes
kNN
K-Means
Random Forest
Dimensionality Reduction Algorithms
Gradient Boosting algorithms
GBM
XGBoost
LightGBM
CatBoost
The algorithm is mainly divided into:
Training Phase
Testing phase
Training Phase
You take a randomly selected specimen of apples from the market (training
data), make a table of all the physical characteristics of each apple, like color,
size, shape, grown in which part of the country, sold by which vendor, etc
(features), along with the sweetness, juiciness, ripeness of that apple
(output variables).
You feed this data to the machine learning algorithm
(classification/regression), and it learns a model of the correlation
between an average apple’s physical characteristics, and its quality.
Testing Phase :
Next time when you go shopping, you will measure the characteristics of the
apples which you are purchasing(test data)and feed it to the Machine
Learning algorithm.
It will use the model which was computed earlier to predict if the apples are
sweet, ripe and/or juicy.
The algorithm may internally use the rules, similar to the one you manually
wrote earlier (for eg, a decision tree).
Finally, you can now shop for apples with great confidence, without worrying
about the details of how to choose the best apples.
Linear regression
•Linear regression is a linear model,
•Linear regression methods attempt to solve the regression problem
by making the assumption that the dependent variable is (at least to
some approximation) a linear function of the independent variables
•i.e. model that assumes a linear relationship between the input
variables (x) and the single output variable (y). More specifically,
that y can be calculated from a linear combination of the input
variables (x).
A linear regression line has an equation of the form
Y = a + bX
where X is the explanatory (independent ) variable and Y is the dependent
variable. The slope of the line is b (regression coefficient), and a is the
intercept (constant)
Regression’s dependent variable. It may be called an outcome variable,
criterion variable, endogenous variable, or regressand.
The independent variables can be called exogenous variables, predictor
variables, or regressors.
Regression can be simple linear regression or multiple linear regression
• A single input variable (x), is used to predict the value of a
dependent variable ,such method is referred as simple linear
regression.
Y=c+bX
• A multiple input variables, are used to predict the value of a
dependent variable. ,such method is referred as multiple
linear regression.
• The difference between the two is the number of independent
variables. In both cases there is only a single dependent
variable.
The core idea is to obtain a line that best fits the data. The best fit line is the one
for which total prediction error (all data points) are as small as possible. Error is
the distance between the point to the regression line.(Least Square error)
Least-Squares Regression
The most common method for fitting a
regression line(best fit) is the method of
least-squares. This method calculates the
best-fitting line for the observed data by
minimizing the sum of the squares of the
vertical deviations from each data point
to the line
regr = linear_model.LinearRegression()
The least square regression line for the set of n data points is given by the equation of a line in slope intercept
form:
y=ax+b
Steps To find the line of best fit for N points:
Step 1: For each (x,y) point calculate x2 and xy
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (
Σ means "sum up")
Step 3: Calculate Slope m:
m = (N Σ(xy) − Σx Σy)/(N Σ(x2) − (Σx)2)
(N is the number of points.)
Step 4: Calculate Intercept b:
b = (Σy − m Σ)/xN
Step 5: Assemble the equation of a line
y = mx + b
Problem - 1 :
Problem 2
Problem 3
Advantages of Linear Regression
1. Linear Regression performs well when the dataset is linearly
separable. We can use it to find the nature of the relationship among
the variables.
2. Linear Regression is easier to implement, interpret and very efficient
to train.
3. Linear Regression is prone to over-fitting but it can be easily avoided
using some dimensionality reduction techniques, regularization (L1
and L2) techniques and cross-validation.
Disadvantages of Linear Regression
1. Linear regression requires that the dependent variable is a continuous
numerical variable
2. Linear Regression Is Limited to Linear Relationships between the dependent
variable and the independent variables. In the real world, the data is rarely
linearly separable. It assumes that there is a straight-line relationship
between the dependent and independent variables which is incorrect many
times.
3. Non-linear data cannot be well fitted. So you need to first determine whether
the variables are linear.
4. Prone to outliers: Linear regression is very sensitive to outliers (anomalies).
E.g. if most of your data lives in the range (20,50) on the x-axis, but you have one
or two points out at x= 200, this could significantly swing your regression results.
5. Prone to multicollinearity: Before applying Linear regression,
multicollinearity should be removed (using dimensionality reduction
techniques) because it assumes that there is no relationship among independent
variables.
6. Prone to noise and overfitting: If the number of observations are lesser than
the number of features, Linear Regression should not be used, otherwise it may
lead to overfit because is starts considering noise in this scenario while building
the model.
Gaussian Processes (GP)
Gaussian Processes (GP) are a generic supervised learning method
designed to solve regression and probabilistic classification problems.
•A Gaussian process is a probability distribution over possible functions.
Normal distribution, also known as
the Gaussian distribution, is a
. probability distribution that is symmetric
about the mean, showing that data near
the mean are more frequent in occurrence
than data far from the mean. In graph
form, normal distribution will appear as a
bell curve. Lazy learner
In probability theory, a normal (or Gaussian or Gauss or Laplace–
Gauss) distribution is a type of continuous probability distribution for a
real-valued random variable. The general form of its
probability density function is
The parameter is the mean or expectation of the distribution (and also its
median and mode), while the parameter is its standard deviation.[1] The
variance of the distribution is .[2] A random variable with a Gaussian
distribution is said to be normally distributed, and is called a normal
deviate.
The normal distribution contains the curve between the x values
and corresponding to the y values but the gaussian
distribution made the curve with the x random variables and
corresponding the PDF values.
Gaussian processes are computationally expensive
The advantages of Gaussian processes are:
•The prediction interpolates the observations (at least for regular kernels).
•The prediction is probabilistic (Gaussian) so that one can compute
empirical confidence intervals and decide based on those if one should refit
(online fitting, adaptive fitting) the prediction in some region of interest.
•Versatile: different kernels can be specified. Common kernels are provided,
but it is also possible to specify custom kernels.
The disadvantages of Gaussian processes include:
•They are not sparse, i.e., they use the whole samples/features information
to perform the prediction.
•They lose efficiency in high dimensional spaces – namely when the
number of features exceeds a few dozens
Gaussian processes are a non-parametric method.
Parametric approaches distill knowledge about the training data
into a set of numbers. For linear regression this is just two numbers, the
slope and the intercept, whereas other approaches like neural networks
may have 10s of millions. This means that after they are trained the cost of
making predictions is dependent only on the number of parameters.
However as Gaussian processes are non-parametric (although
kernel hyperparameters blur the picture) they need to take into account
the whole training data each time they make a prediction. This means not
only that the training data has to be kept at inference time but also means
that the computational cost of predictions scales (cubically!) with the
number of training samples.