ML Unit 1
ML Unit 1
me/jntuh
UNIT - I
Learning –
The Machine Learning Tutorial covers both the fundamentals and more complex ideas
of machine learning. Students and professionals in the workforce can benefit from our
machine learning tutorial.
You will learn about the many different methods of machine learning, including
reinforcement learning, supervised learning, and unsupervised learning, in this
machine learning tutorial. Regression and classification models, clustering techniques,
hidden Markov models, and various sequential models will all be covered.
previous experiences. Arthur Samuel first used the term "machine learning" in 1959. It
could be summarized as follows:
Based on the methods and way of learning, machine learning is divided into mainly
four types, which are:
The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of supervised
learning are Risk Assessment, Fraud Detection, Spam filtering, etc.
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
The classification algorithms predict the categories present in the dataset. Some real-
world examples of classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
In unsupervised learning, the models are trained with the data that is neither classified
nor labelled, and the model acts on that data without any supervision.
So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by
their purchasing behaviour.
2) Association
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies
between Supervised and Unsupervised machine learning. It represents the
intermediate ground between Supervised (With Labelled training data) and
Unsupervised learning (with no labelled training data) algorithms and uses the
combination of labelled and unlabeled datasets during the training period.
ADVERTISEMENT
Disadvantages:
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
ADVERTISEMENT
The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of
reinforcement learning is to play a game, where the Game is the environment, moves
of an agent at each step define states, and the goal of the agent is to get a high score.
Agent receives feedback in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
Disadvantage
Artificial neural network tutorial covers all the aspects related to the artificial neural
network. In this tutorial, we will discuss ANNs, Adaptive resonance theory, Kohonen
self-organizing map, Building blocks, unsupervised learning, Genetic algorithm, etc.
Dendrites Inputs
Synapse Weights
Axon Output
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one
piece of this data when necessary from our memory parallelly. We can say that the
human brain is made up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example
of a digital logic gate that takes an input and gives an output. "OR" gate, which takes
two inputs. If one or both the inputs are "On," then we get "On" in output. If both the
inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning."
Input Layer:
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function.
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent
the network from working.
After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.
Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.
It is the most significant issue of ANN. When ANN produces a testing solution, it does
not provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's abilities.
The network is reduced to a specific value of the error, and this value does not give us
optimum results.
Science artificial neural networks that have steeped into the world in the mid-20th century are
exponentially developing. In the present time, we have investigated the pros of artificial neural
networks and the issues encountered in the course of their utilization. It should not be overlooked
that the cons of ANN networks, which are a flourishing science branch, are eliminated individually,
and their pros are increasing day by day. It means that artificial neural networks will turn into an
irreplaceable part of our lives progressively important.
If the weighted sum is equal to zero, then bias is added to make the output non-zero
or something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a certain
maximum value is benchmarked, and the total of weighted inputs is passed through
the activation function.
The first design choice we face is to choose the type of training experience from which our
system will learn. The type of training experience available can have a significant impact on
success or failure of the learner. One key attribute is whether the training experience provides
direct or indirect feedback regarding the choices made by the performance system. For
example, in learning to play checkers, the system might learn from direct training examples
consisting of individual checkers board states and the correct move for each.
In order to complete the design of the learning system, we must now choose
3. a learning mechanism
The next design choice is to determine exactly what type of knowledge will be learned and
how this will be used by the performance program. Let us begin with a checkers-playing
program that can generate the legal moves from any board state. The program needs only to
learn how to choose the best move from among these legal moves.
x5: the number of black pieces threatened by red (i.e., which can be captured on red's next
turn)
Thus, our learning program will represent V(b) as a linear function of the form
V(b)=w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
where wo through w6
6 are numerical coefficients, or weights, to be chosen by the learning
algorithm.
V(b)=w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
In order to learn the target function f we require a set of training examples, each describing a
specific board state b and the training value Vtrain(b) for bb.
The Performance System is the module that must solve the given per-
per formance task, in this
case playing checkers, by using the learned target function(s). It takes an instance of a new
problem (new game) as input and produces a trace of its solution (game history) as output.
The Generalizer takes as input the training examples and produces an output hypothesis that
is its estimate of the target function. It generalizes from the specific training examples,
hypothesizing a general function that covers these examples and othe
otherr cases beyond the
training examples.
The Experiment Generator takes as input the current hypothesis (currently learned
function) and outputs a new problem (i.e., initial board state) for the Performance System to
explore. Its role is to pick new practice problems that will maximize the learning rate of the
overall system.
One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge
held by the learner. For example, consider the space of hypotheses that could in principle be
output by the above checkers learner. This hypothesis space consists of all evaluation
functions that can be represented by some choice of values for the weights wo through w6.
The learner's task is thus to search through this vast space to locate the hyp
hypothesis
othesis that is most
consistent with the available training examples. The LMS algorithm for fitting weights
achieves this goal by iteratively tuning the weights, adding a correction to each weight each
time the hypothesized evaluation function predicts a vvalue
alue that differs from the training value.
This algorithm works well when the hypothesis representation considered by the learner
defines a continuously parameterized space of potential hypotheses.
What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of
problems and representations?
How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the
character of the learner's hypothesis space?
When and how can prior knowledge held by the learner guide the process of
generalizing from examples? Can prior knowledge be helpful even when it is only
approximately correct?
What is the best strategy for choosing a useful next training experience, and how does
the choice of this strategy alter the complexity of the learning problem?
What is the best way to reduce the learning task to one or more function
approximation problems? Put another way, what specific functions should the system
attempt to learn? Can this process itself be automated?
How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
Concept Learning:
Concept learning: Inferring a boolean-valued function from training examples of its input
and output.
What hypothesis representation shall we provide to the learner in this case? Let us begin by
considering a simple representation in which each hypothesis consists of a conjunction of
constraints on the instance attributes. In particular, let each hypothesis be a vector of six
constraints, specifying the values of the six attributes Sky, AirTemp, Humidity, Wind, Water,
and Forecast. For each attribute, the hypothesis will either
specify a single
ingle required value (e.g., Warm) for the attribute, or
The inductive learning hypothesis. Any hypothesis found to approximate the target
function well over a sufficiently large set of training examples will also approximate the
target function well over other unobserved examples.
General-to-Specific
Specific Ordering of Hypotheses Many algorithms for concept learning
organize the search through the hypothesis space by relying on a very useful structure that
exists for any concept learning problem
problem: a general-to-specific
specific ordering of hypotheses. By
taking advantage of this naturally occurring structure over the hypothesis space, we can
design learning algorithms that exhaustively search even infinite hy
hypothesis
pothesis spaces without
explicitly enumerating every
very hypothesis.
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
Now consider the sets of instances that are classified positive by hl and by h2. Because h2
imposess fewer constraints on the instance, it classifies more instances as positive. In fact, any
instance classified positive by hl will also be classified positive by h2. Therefore, we say that
h2 is more general than hl.
FIND-S:
S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS
HYPOTHESIS:
The CANDIDATE-ELIMINATION
ELIMINATION algorithm finds all describable hypotheses that are
consistent with the observed training examples. In order to define this algorithm precisely, we
begin with a few basic definitions. First, let us say that a hypothesis is consist
consistent
ent with the
training examples if it correctly classifies these examples.
CANDIDATE-ELIMINATION
ELIMINATION Learning Algorithm
Algorithm:
What Training Example Should the Learner Request Next? Up to this point we have
assumed that training examples are provided to the learner by so
some
me external teacher. Suppose
instead that the learner is allowed to conduct experiments in which it chooses the next
instance, then obtains the correct classification for this instance from an external oracle (e.g.,
nature or a teacher).
How Can Partially Learned Concepts Be Used? Suppose that no additional training
examples are available beyond the four in our example above, but that the learner is now
required to classify new instances that it has not yet observed. Even though the version space
still contains
ains multiple hypotheses, indicating that the target concept has not yet been fully
learned, it is possible to classify certain examples with the same degree of confidence as if
the target concept had been uniquely identified.
INDUCTIVE BIAS:
This can be used to project the features of higher dimensional space into lower-
dimensional space in order to reduce resources and dimensional costs. In this topic,
"Linear Discriminant Analysis (LDA) in machine learning”, we will discuss the LDA
algorithm for classification predictive modeling problems, limitation of logistic
regression, representation of linear Discriminant analysis model, how to make a
prediction using LDA, how to prepare data for LDA, extensions to LDA and much more.
So, let's start with a quick introduction to Linear Discriminant Analysis (LDA) in machine
learning.
To overcome the overlapping issue in the classification process, we must increase the
number of features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in
a 2-dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these
data points efficiently but using linear Discriminant analysis; we can dimensionally
reduce the 2-D plane into the 1-D plane. Using this technique, we can also maximize
the separability between multiple classes.
Let's consider an example where we have two classes in a 2-D plane having an X-Y
axis, and we need to classify them efficiently. As we have already seen in the above
example that LDA enables us to draw a straight line that can completely separate the
two classes of the data points. Here, LDA uses an X-Y axis to create a new axis by
separating them using a straight line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D
plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the
variation within each class.
In other words, we can say that the new axis will increase the separation between the
data points of the two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform
well for binary classification but falls short in the case of multiple classification
problems with well-separated classes. At the same time, LDA handles these quite
efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.
But LDA also fails in some cases where the Mean of the distributions is shared. In this
case, LDA fails to create a new axis that makes both the classes linearly separable.
1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of
inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the
variance (actually covariance) and hence moderates the influence of different variables
on LDA.
o Face Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used
to minimize the number of features to a manageable number before going through
the classification process. It generates a new template in which each dimension
consists of a linear combination of pixel values. If a linear combination is generated
using Fisher's linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on
the basis of various parameters of patient health and the medical treatment which is
going on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can
be helpful when we want to identify a group of customers who mostly purchase a
product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
o In Learning
Nowadays, robots are being trained for learning and talking to simulate human work,
and it can also be considered a classification problem. In this case, LDA builds similar
groups on the basis of different parameters, including pitches, frequencies, sound,
tunes, etc.
Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
Hence, we can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation function.
ADVERTISEMENT
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the
learning process is slow or has vanishing or exploding gradients.
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight
of input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.
Step-1
In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
Y = f(∑wi*xi + b)
Linear separability
Linear separability is an important concept in machine learning, particularly in the field
of supervised learning. It refers to the ability of a set of data points to be separated
into distinct categories using a linear decision boundary. In other words, if there exists
a straight line that can cleanly divide the data into two classes, then the data is said to
be linearly separable.
Linear separability is a concept in machine learning that refers to the ability to separate
data points in binary classification problems using a linear decision boundary. If the
data points can be separated using a line, linear function, or flat hyperplane, they are
considered linearly separable. Linear separability is an important concept in neural
networks, and it is introduced in the context of linear algebra and optimization theory.
Linearly separable data points can be separated using a line, linear function, or flat
hyperplane. In practice, there are several methods to determine whether data is linearly
Python provides several methods to determine whether data is linearly separable. One
method is linear programming, which defines an objective function subjected to
constraints that satisfy linear separability. Another method is clustering, where if two
clusters with cluster purity of 100% can be found using some clustering methods such
as k-means, then the data is linearly separable.
However, not all data sets are linearly separable. In some cases, it may be impossible
to draw a straight line that can separate the data into distinct categories. For example,
imagine a set of data points that are arranged in a circular pattern, with red and blue
points interspersed throughout. In this case, it is impossible to draw a straight line that
separates the data into two classes.
When faced with data that is not linearly separable, machine learning algorithms must
use more complex decision boundaries to accurately classify the data. For example, a
decision tree or a neural network may be able to accurately classify data that is not
linearly separable.
Linear separability is not only important in the context of machine learning, but it also
has applications in other fields such as physics, biology, and economics. For example,
in physics, linear separability can be used to analyze the relationship between two
physical quantities. In biology, it can be used to study the behavior of animals or to
analyze genetic data. In economics, it can be used to analyze the relationship between
two economic variables.
Example:
One way to test for linear separability is to use linear programming. Linear
programming defines an objective function subject to constraints that satisfy linear
separability. The scipy.optimize.linprog() function in Python can be used to solve linear
programming problems. Here's an example of using scipy.optimize.linprog() to test for
linear separability:
ADVERTISEMENT
1. import numpy as np
2. from scipy.optimize import linprog
3.
4. # Define the data points
5. X = np.array([[1, 2], [2, 3], [3, 1], [4, 3]])
6. y = np.array([1, 1, -1, -1])
7.
8. # Define the objective function and constraints
9. c = np.zeros(X.shape[1] + 1)
10. c[-1] = 1
11. A = np.zeros((X.shape, X.shape[1] + 1))
12. A[:, :-1] = -y[:, np.newaxis] * X
13. A[:, -1] = -y
14. b = -np.ones(X.shape)
15.
16. # Solve the linear programming problem
17. res = linprog(c, A_ub=A, b_ub=b)
18.
19. if res.success:
20. print("The data is linearly separable.")
21. else:
22. print("The data is not linearly separable.")
In this example, we define a set of data points X and their corresponding labels y. We
then define the objective function and constraints for the linear programming
problem. The objective function is simply a vector of zeros with a 1 in the last position,
which corresponds to the bias term in the linear decision boundary. The constraints
are defined such that the dot product of each data point with the weight vector and
bias term is greater than or equal to -1 for negative examples and less than or equal
to 1 for positive examples. We then solve the linear programming problem using
scipy.optimize.linprog() and check if the solution is successful. If the solution is
successful, the data is linearly separable.
Another way to test for linear separability is to use a linear classifier or support vector
machine (SVM) with a linear kernel. For linearly separable datasets, a linear classifier or
SVM with a linear kernel can achieve 100% accuracy to classify data[4]. Here's an
example of using a linear SVM to test for linear separability:
In this example, we define a set of data points X and their corresponding labels y. We
then train a linear SVM using the svm.SVC() function from scikit-learn with a linear
kernel. We then check if the accuracy of the SVM on the training data is 100%.
Moreover, it is important to note that linear separability is not the only criterion for the
effectiveness of a classification algorithm. In some cases, even if the data is linearly
separable, a linear classifier may not be the best choice. For example, if the data is
high-dimensional or contains complex nonlinear relationships, a nonlinear classifier
may be more effective. In such cases, machine learning algorithms such as decision
trees, random forests, or neural networks may be more appropriate
or not it is linearly separable. This means that preprocessing techniques such as feature
selection,
ADVERTISEMENT
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
ADVERTISEMENT
ADVERTISEMENT
The different values for weights or the coefficient of lines (a0, a1) gives a different line
of regression, so we need to calculate the best values for a 0 and a1 to find the best fit
line, so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values.
It can be written as:
Where,
Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual will
be high, and so cost function will high. If the scatter points are close to the regression
line, then the residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It
can be achieved by below method:
1. R-squared method:
multicollinearity, it may difficult to find the true relationship between the predictors
and target variables. Or we can say, it is difficult to determine which predictor variable
is affecting the target variable and which is not. So, the model assumes either little or
no multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual errors.