Machine Learning Notes
Machine Learning Notes
Syllabus:
Unit 1- Introduction to Machine Learning
Basics of Statistics, Introduction of Machine learning, Examples of Machine Learning Problems,
Learning versus Designing, Training versus Testing, Characteristics of Machine learning tasks,
Predictive and descriptive tasks, database and data processing for ML.
Features: Feature types, Feature Construction and Transformation, Feature Selection.
Unit 2- Flavors of Machine Learning
Definition of learning systems, Types: Supervised, Unsupervised, Semi Supervised, Reinforcement
learning with examples, Introduction to Deep Learning, Deep learning vs Machine Learning.
Unit 3- Classification and Regression
Classification: Binary Classification- Assessing Classification performance, Class probability
Estimation- Assessing class probability Estimates, Multiclass Classification.
Regression: Assessing performance of Regression- Error measures, Overfitting- Catalysts for
Overfitting, Case study of Polynomial Regression.
Theory of Generalization: Effective number of hypothesis, Bounding the Growth function, VC
Dimensions, Regularization theory.
Unit 4- Neural Networks
Introduction, Neural Network Elements, Basic Perceptron, Feed Forward Network, Back Propagation
Algorithm, Introduction to Artificial Neural Network.
Unit 5- Machine Learning Models
Linear Models: Least Squares method, Multivariate Linear Regression, Regularized Regression,
Using Least Square regression for Classification.
Logic Based and Algebraic Models: Distance Based Models: Neighbours and Examples, Nearest
Neighbours Classification,
Rule Based Models: Rule learning for subgroup discovery, Association rule mining,
Tree Based Models: Decision Trees
Probabilistic Models: Normal Distribution and Its Geometric Interpretations, Naïve Bayes Classifier,
Discriminative learning with Maximum likelihood
Unit 6- Applications of Machine Learning
Email Spam and Malware Filtering, Image recognition, Speech Recognition, Traffic Prediction, Self-
driving Cars, Virtual Personal Assistant, Medical Diagnosis.
Unit no 01 Introduction to Machine Learning
Before actually defining Machine learning, we should first understand the meaning of two words that
are machine and learning, and then we can understand what machine learning is?
Learning: The ability to improve behavior based on experience is called learning
Machine: A mechanically, electrically or electronically operated device for performing a task
is machine
Machine Learning:
Machine learning explores algorithm learn/ build model from data and that model is used for
prediction, decision making, and for solving tasks.
Definition: A computer program is said to learn from experience E (data) with respect to some class
of task T (prediction, classification etc...) and performance measure P if its performance on task in T
as measured by P improves with experience E.
Machine Learning is a subset of artificial intelligence which focuses mainly on machine learning from
their experience and making predictions based on its experience.
It enables the computers or the machines to make data-driven decisions rather than being explicitly
programmed for carrying out a certain task. These programs or algorithms are designed in a way that
they learn and improve over time when are exposed to new data.
Features contain information about target. More features do not mean more information. Irrelevant
features and redundant features may result in wrong conclusion specially when there is limited
training set and limited computation resources. This leads to curse of dimensionality. To reduce this
curse, feature reduction is implemented. There are two types of feature reduction one is feature
extraction and the other is feature selection. Feature extraction and feature selection method is used to
either improve or maintain the classification, accuracy and simplify classifier complexity.
Feature Selection in machine learning
It is the method of reducing data dimension while doing predictive analysis. One major reason is
that machine learning follows the rule of ―garbage in-garbage out‖ and that is why one needs to be
very concerned about the data that is being fed to the model.
We will discuss various kinds of feature selection techniques in machine learning and why they play
an important role in machine learning tasks.
The feature selection techniques simplify the machine learning models in order to make it easier to
interpret by the researchers. IT mainly eliminates the effects of the curse of dimensionality. Besides,
this technique reduces the problem of over fitting by enhancing the generalization in the model. Thus
it helps in better understanding of data, improves prediction performance, reducing the computational
time as well as space which is required to run the algorithm.
1. Filter Method
This method uses the variable ranking technique in order to select the variables for ordering and here,
the selection of features is independent of the classifiers used. By ranking, it means how much useful
and important each feature is expected to be for classification. It basically selects the subsets of
variables as a pre-processing step independently of the chosen predictor. In filtering, the ranking
method can be applied before classification for filtering the less relevant features. It carries out the
feature selection task as a pre-processing step which contains no induction algorithm. Some examples
of filter methods are mentioned below:
Chi-Square Test: In general term, this method is used to test the independence of two events.
If a dataset is given for two events, we can get the observed count and the expected count and
this test measures how much both the counts are derivate from each other.
Variance Threshold: This approach of feature selection removes all features whose variance
does not meet some threshold. Generally, it removes all the zero-variance features which
mean all the features that have the same value in all samples.
Information Gain: Information gain or IG measures how much information a feature gives
about the class. Thus, we can determine which attribute in a given set of training feature is the
most meaningful for discriminating between the classes to be learned.
2. Wrapper Method
This method utilizes the learning machine of interest as a black box to score subsets of variables
according to their predictive power. In the above figure, in a supervised machine learning, the
induction algorithm is depicted with a set of training instances, where each instance is described by a
vector of feature values and a class label. The induction algorithm which is also considered as the
black box is used to induce a classifier which is useful in classifying. In the wrapper approach, the
feature subset selection algorithm exists as a wrapper around the induction algorithm. One of the main
drawbacks of this technique is the mass of computations required to obtain the feature subset. Some
examples of Wrapper Methods are mentioned below:
Genetic Algorithms: This algorithm can be used to find a subset of features. CHCGA is the
modified version of this algorithm which converges faster and renders a more effective search
by maintaining the diversity and evade the stagnation of the population.
Recursive Feature Elimination: RFE is a feature selection method which fits a model and
removes the weakest feature until the specified number of features is satisfied. Here, the
features are ranked by the model‘s coefficient or feature importance attributes.
Sequential Feature Selection: This naive algorithm starts with a null set and then adds one
feature to the first step which depicts the highest value for the objective function and from the
second step onwards the remaining features are added individually to the current subset and
thus the new subset is evaluated. This process is repeated until the required number of
features is added.
3. Embedded Method
This method tries to combine the efficiency of both the previous methods and performs the selection
of variables in the process of training and is usually specific to given learning machines. This method
basically learns which feature provides the utmost to the accuracy of the model.
Some examples of Embedded Methods are mentioned below:
L1 Regularization Technique such as LASSO: Least Absolute Shrinkage and Selection
Operator (LASSO) is a linear model which estimates sparse coefficients and is useful in some
contexts due to its tendency to prefer solutions with fewer parameter values.
Ridge Regression (L2 Regularization): The L2 Regularization is also known as Ridge
Regression or Tikhonov Regularization which solves a regression model where the loss
function is the linear least squares function and regularization.
Elastic Net: This linear regression model is trained with L1 and L2 as regularizer which
allows for learning a sparse model where few of the weights are non-zero like Lasso and on
the other hand maintaining the regularization properties of Ridge.
As with any method, there are different ways to train machine learning algorithms, each with their
own advantages and disadvantages. To understand the pros and cons of each type of machine
learning, we must first look at what kind of data they ingest. In ML, there are two kinds of data —
labeled data and unlabeled data.
Labeled data has both the input and output parameters in a completely machine-readable pattern, but
requires a lot of human labor to label the data, to begin with. Unlabeled data only has one or none of
the parameters in a machine-readable form. This negates the need for human labor but requires more
complex solutions. There are also some types of machine learning algorithms that are used in very
specific use-cases, but three main methods are used today.
1. Supervised Machine Learning
Supervised learning is one of the most basic types of machine learning. In this type, the machine
learning algorithm is trained on labeled data. Even though the data needs to be labeled accurately for
this method to work, supervised learning is extremely powerful when used in the right circumstances.
In supervised learning, the ML algorithm is given a small training dataset to work with. This training
dataset is a smaller part of the bigger dataset and serves to give the algorithm a basic idea of the
problem, solution, and data points to be dealt with. The training dataset is also very similar to the final
dataset in its characteristics and provides the algorithm with the labeled parameters required for the
problem.
The algorithm then finds relationships between the parameters given, essentially establishing a cause
and effect relationship between the variables in the dataset. At the end of the training,
the algorithm has an idea of how the data works and the relationship between the input and the output.
This solution is then deployed for use with the final dataset, which it learns from in the same way as
the training dataset. This means that supervised machine learning algorithms will continue to improve
even after being deployed, discovering new patterns and relationships as it trains itself on new data.
Supervised learning is commonly used in real world applications, such as face and speech
recognition, products or movie recommendations, and sales forecasting.
In supervised learning, learning data comes with description, labels, targets or desired outputs and
the objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data. The learned rule is then used to label new data with unknown outputs.
Supervised learning involves building a machine learning model that is based on labeled samples.
For example, if we build a system to estimate the price of a plot of land or a house based on various
features, such as size, location, and so on, we first need to create a database and label it. We need to
teach the algorithm what features correspond to what prices. Based on this data, the algorithm will
learn how to calculate the price of real estate using the values of the input features.
Supervised learning deals with learning a function from available training data. Here, a learning
algorithm analyzes the training data and produces a derived function that can be used for mapping
new examples.
Supervised learning can be further classified into two types - Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example predicting real estate
prices. When output Y is discrete valued, it is classification and when Y is continuous, then it is
Regression. Classification attempts to find the appropriate class label, such as analyzing
positive/negative sentiment, male and female persons, benign and malignant tumors, secure and
unsecure loans etc.
a. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
o Logistic Regression
b. Classification
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
o Decision Trees
o Random Forest
o Support vector Machines
o Neural network
o Naïve Bayes
Common examples of supervised learning include classifying e-mails into spam and not-spam
categories, labeling web pages based on their content, and voice recognition.
2. Unsupervised Machine Learning
Unsupervised machine learning holds the advantage of being able to work with unlabeled data. This
means that human labor is not required to make the dataset machine-readable, allowing much larger
datasets to be worked on by the program.
In supervised learning, the labels allow the algorithm to find the exact nature of the relationship
between any two data points. However, unsupervised learning does not have labels to work off of,
resulting in the creation of hidden structures. Relationships between data points are perceived by the
algorithm in an abstract manner, with no input required from human beings.
The creation of these hidden structures is what makes unsupervised learning algorithms versatile.
Instead of a defined and set problem statement, unsupervised learning algorithms can adapt to the data
by dynamically changing hidden structures. This offers more post-deployment development than
supervised learning algorithms.
Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective equipment, or
to group customers with similar behaviors for a sales campaign. It is the opposite of supervised
learning. There is no labeled data here.
When learning data contains only some indications without any description or labels, it is up to the
coder or to the algorithm to find the structure of the underlying data, to discover hidden patterns, or
to determine how to describe the data. This kind of learning data is called unlabeled data.
Suppose that we have a number of data points, and we want to classify them into several groups. We
may not exactly know what the criteria of classification would be. So, an unsupervised learning
algorithm tries to classify the given dataset into a certain number of groups in an optimum way.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for identifying
patterns and trends. They are most commonly used for clustering similar input into logical groups.
a. Clustering:
Clustering is a method of grouping the objects into clusters such that objects with most similarities
remains into a group and has less or no similarities with the objects of another group. Cluster analysis
finds the commonalities between the data objects and categorizes them as per the presence and
absence of those commonalities.
b. Association:
An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of Association
rule is Market Basket Analysis. The list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
1 Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labeled data. using unlabeled data.
2 Supervised learning model takes direct Unsupervised learning model does not take any
feedback to check if it is predicting correct feedback.
output or not.
3 Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
4 In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.
5 The goal of supervised learning is to train the The goal of unsupervised learning is to find the
model so that it can predict the output when it hidden patterns and useful insights from the
is given new data. unknown dataset.
6 Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
8 Supervised learning can be used for those Unsupervised learning can be used for those
cases where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
9 Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.
10 Supervised learning is not close to true Unsupervised learning is more close to the true
Artificial intelligence as in this, we first train Artificial Intelligence as it learns similarly as a
the model for each data, and then only it can child learns daily routine things by his
predict the correct output. experiences.
Intuitively, one may imagine the three types of learning algorithms as Supervised learning where a
student is under the supervision of a teacher at both home and school, Unsupervised learning
where a student has to figure out a concept himself and Semi-Supervised learning where a teacher
teaches a few concepts in class and gives questions as homework which are based on similar
concepts.
4. Reinforcement Machine Learning
Reinforcement learning directly takes inspiration from how human beings learn from data in their
lives. It features an algorithm that improves upon itself and learns from new situations using a trial-
and-error method. Favorable outputs are encouraged or ‗reinforced‘, and non-favorable outputs are
discouraged or ‗punished‘.
Based on the psychological concept of conditioning, reinforcement learning works by putting the
algorithm in a work environment with an interpreter and a reward system. In every iteration of the
algorithm, the output result is given to the interpreter, which decides whether the outcome is favorable
or not.
In case of the program finding the correct solution, the interpreter reinforces the solution by providing
a reward to the algorithm. If the outcome is not favorable, the algorithm is forced to reiterate until it
finds a better result. In most cases, the reward system is directly tied to the effectiveness of the result.
In typical reinforcement learning use-cases, such as finding the shortest route between two points on a
map, the solution is not an absolute value. Instead, it takes on a score of effectiveness, expressed in a
percentage value. The higher this percentage value is, the more reward is given to the algorithm.
Thus, the program is trained to give the best possible solution for the best possible reward.
Here learning data gives feedback so that the system adjusts to dynamic conditions in order to achieve
a certain objective. The system evaluates its performance based on the feedback responses and reacts
accordingly. The best known instances include self-driving cars and chess master algorithm AlphaGo.
Also, used in games where the outcome may be decided only at the end of the game.
There are two important learning models in reinforcement learning:
Markov Decision Process
Q learning
Deep Learning
Deep learning is a branch of machine learning which is completely based on artificial neural
networks, as neural network is going to mimic the human brain so deep learning is also a kind of
mimic of human brain. In deep learning, we don‘t need to explicitly program everything. A formal
definition of deep learning is- neurons
Deep learning is a particular kind of machine learning that achieves great power and flexibility by
learning to represent the world as a nested hierarchy of concepts, with each concept defined in
relation to simpler concepts, and more abstract representations computed in terms of less abstr act
ones.
In human brain approximately 100 billion neurons all together this is a picture of an individual
neuron and each neuron is connected through thousand of their neighbors.
The question here is how we recreate these neurons in a computer. So, we create an artificial
structure called an artificial neural net where we have nodes or neurons. We have some neurons
for input value and some for output value and in between, there may be lots of neurons
interconnected in the hidden layer.
Information is passed between network layers through the function shown above. The major points to
keep note of here are the tunable weight and bias parameters — represented by w and b respectively in
the function above. These are essential to the actual ―learning‖ process of a deep learning algorithm.
After the neural network passes its inputs all the way to its outputs, the network evaluates how good its
prediction was (relative to the expected output) through something called a loss function. As an
example, the ―Mean Squared Error‖ loss function is shown below.
Y hat represents the prediction, while Y represents the expected output. A mean is used if batches of
inputs and outputs are used simultaneously (n represents sample count).
The goal of the network is ultimately to minimize this loss by adjusting the weights and biases of the
network. In using something called ―back propagation‖ through gradient descent, the network
backtracks through all its layers to update the weights and biases of every node in the opposite
direction of the loss function — in other words, every iteration of back propagation should result in a
smaller loss function than before. Without going into the proof, the continuous updates of the weights
and biases of the network ultimately turn it into a precise function approximator — one that models the
relationship between inputs and expected outputs. The ―deep‖ part of deep learning refers to creating
deep neural networks. This refers a neural network with a large amount of layers — with the addition
of more weights and biases, the neural network improves its ability to approximate more complex
functions.
Deep Learning versus Machine Learning
Sr.No Machine Learning Deep Learning
1 Machine learning uses algorithms to parse Deep learning structures algorithms in
data, learn from that data, and make layers to create an ―artificial neural
informed decisions based on what it has network‖ that can learn and make
learned intelligent decisions on its own
2 Works on small amount of Dataset for Works on Large amount of Dataset
accuracy.
3 Dependent on Low-end Machine Heavily dependent on High-end Machine.
4 Divides the tasks into sub-tasks, solves Solves problem end to end.
them individually and finally combine the
results.
5 Takes less time to train. Takes longer time to train.
6 Trains on CPU Trains on GPU for proper training
7 Testing time may increase. Less time to test the data.
8 Machine learning is about computers being Deep learning is about computers learning to
able to think and act with less human think using structures modeled on the
intervention human brain.
9 Machine learning requires less computing Deep learning typically needs less ongoing
power human intervention.
10 Machine learning can‘t easily analyze Deep learning can analyze images, videos,
images, videos, and unstructured data and unstructured data easily
11 Machine learning programs tend to be less Deep learning systems require far more
complex than deep learning algorithms and powerful hardware and resources
can often run on conventional computers
12 Machine learning systems can be set up and Deep learning systems take more time to set
operate quickly but may be limited in the up but can generate results instantaneously
power of their results (although the quality is likely to improve
over time as more data becomes available).
13 Machine learning tends to require structured Deep learning employs neural networks and
data and uses traditional algorithms like is built to accommodate large volumes of
linear regression unstructured data.
14 Machine learning is already in use in your Deep learning technology enables more
email inbox, bank, and doctor‘s office complex and autonomous programs, like
self-driving cars or robots that perform
advanced surgery.
15 The output is in numerical form for The output can be in any form including
classification and scoring applications free form elements such as free text and
sound
16 Limited tuning capability for Can be tuned in various ways
hyperparameter tuning
17 Machine learning requires less data than deep Deep learning requires much more data than
learning to function properly. a traditional machine learning algorithm to
function properly due to complex multilayer
structure.
1. Accuracy
Accuracy is the simple ratio between the numbers of correctly classified points to the total number of
points.
Accuracy is simple to calculate but has its own disadvantages.
Limitations of accuracy
If the data set is highly imbalanced, and the model classifies all the data points as the majority
class data points, the accuracy will be high. This makes accuracy not a reliable performance
metric for imbalanced data.
From accuracy, the probability of the predictions of the model can be derived. So from accuracy,
we cannot measure how good the predictions of the model are.
2. Confusion Matrix
Confusion Matrix is a summary of predicted results in specific table layout that allows visualization of
the performance measure of the machine learning model for a binary classification problem (2 classes)
or multi-class classification problem (more than 2 classes).
Precision helps us understand how useful the results are. Recall helps us understand how complete the
results are.
But to reduce the checking of pockets twice, the F1 score is used. F1 score is the harmonic mean of
precision and recall. It is given as,
The F-score is often used in the field of information retrieval for measuring search, document
classification, and query classification performance.
The F-score has been widely used in the natural language processing literature, such as the evaluation
of named entity recognition and word segmentation.
4. Log Loss
Logarithmic loss (or log loss) measures the performance of a classification model where the prediction
is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the
actual label. Log loss is a widely used metric for Kaggle competitions.
Here ‘N‘ is the total number of data points in the data set, yi is the actual value of y and pi is the
probability of y belonging to the positive class.
Lower the log-loss value, better are the predictions of the model.
5. ROC AUC
A Receiver Operating Characteristic curve or ROC curve is created by plotting the True Positive (TP)
against the False Positive (FP) at various threshold settings. The ROC curve is generated by plotting
the cumulative distribution function of the True Positive in the y-axis versus the cumulative
distribution function of the False Positive on the x-axis.
The area under the ROC curve (ROC AUC) is the single-valued metric used for evaluating the
performance.
The higher the AUC, the better the performance of the model in distinguishing between the classes
In general, an AUC of 0.5 suggests no discrimination, a value between 0.5–0.7 is acceptable and
anything above 0.7 is good-to-go-model. However, medical diagnosis models, usually AUC of 0.95 or
more is considered to be good-to-go-model.
ROC curves are widely used to compare and evaluate different classification algorithms.
ROC curve is widely used when the dataset is imbalanced.
ROC curves are also used in verification of forecasts in meteorology
Class Probability Estimation
For Classification problems in machine learning we often want to know how likely the instance belongs
to the class rather than which class it will belong to. So in many cases we would like to use the
estimated class probability for decision making.
For a variety of applications, machine learning algorithms are required to construct models that
minimize the total loss associated with the decisions, rather than the number of errors. One of the
most efficient approaches to building models that are sensitive to non-uniform costs of errors is to
first estimate the class probabilities of the unseen instances and then to make the decision based on
both the computed probabilities and the loss function.
Example: Consider a scenario where we have to detect a credit fraud. The manager of fraud control
department wants to know not only who are likely to be fraud but also the cases where credit risk is at
stake i.e accounts where company‘s monetary loss is expected to be highest. Here, we must know the
class probability of fraud for that particular case.
Roughly, we would like
(i) The probability estimates to be well calibrated, meaning that if you take 100 cases whose class
membership probability is estimated to be 0.2, then about 20 of them will actually belong to the class.
(ii) The probability estimates to be discriminative. Meaning that they should give different probability
estimates for different examples. Say 0.5 class probability indicates that 50% of population is
fraudulent, which is base rate thus we need discrimination to get some higher/lower class probability
boundary for estimation.
Regression
Regression analysis consists of a set of machine learning methods that allow us to predict a
continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function
of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new
values of the predictor variables (x).
Assessing performance of Regression
Model evaluation is very important in data science. It helps you to understand the performance of your
model and makes it easy to present your model to other people. There are many different evaluation
metrics out there but only some of them are suitable to be used for regression.
There are 3 main metrics for model evaluation in regression:
1. R Square/Adjusted R Square
2. Mean Square Error (MSE)/Root Mean Square Error (RMSE)
3. Mean Absolute Error (MAE)
R Square is calculated by the sum of squared of prediction error divided by the total sum of the square
which replaces the calculated prediction with mean. R Square value is between 0 to 1 and a bigger
value indicates a better fit between prediction and actual value.
R Square is a good measure to determine how well the model fits the dependent variables. However, it
does not take into consideration of overfitting problem. If your regression model has many independent
variables, because the model is too complicated, it may fit very well to the training data but performs
badly for testing data. That is why Adjusted R Square is introduced because it will penalize additional
independent variables added to the model and adjust the metric to prevent overfitting issues.
2. Mean Square Error (MSE)/Root Mean square Error (RMSE)
While R Square is a relative measure of how well the model fits dependent variables, Mean Square
Error is an absolute measure of the goodness for the fit.
MSE is calculated by the sum of square of prediction error which is real output minus predicted output
and then divide by the number of data points. It gives you an absolute number on how much your
predicted results deviate from the actual number. You cannot interpret many insights from one single
result but it gives you a real number to compare against other model results and help you select the best
regression model.
Root Mean Square Error (RMSE) is the square root of MSE. It is used more commonly than MSE
because firstly sometimes MSE value can be too big to compare easily. Secondly, MSE is calculated by
the square of error, and thus square root brings it back to the same level of prediction error and makes it
easier for interpretation.
3. Mean Absolute Error (MAE)
Mean Absolute Error (MAE) is similar to Mean Square Error (MSE). However, instead of the sum of
square of error in MSE, MAE is taking the sum of the absolute value of error.
Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives
larger penalization to big prediction error by square it while MAE treats all errors the same. R
Square/Adjusted R Square is better used to explain the model to other people because you can explain
the number as a percentage of the output variability. MSE, RMSE, or MAE are better be used to
compare performance between different regression models.
Over fitting
A hypothesis is said to be overfit the training data if there is another hypothesis h`
Such that h` has more error than h on training data, h` has less error than h on the test data
Overfitting refers to a model that models the training data too well. Overfitting happens when a model
learns the detail and noise in the training data to the extent that it negatively impacts the performance
of the model on new data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these concepts do not apply to
new data and negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when
learning a target function. As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible and
is subject to overfitting training data. This problem can be addressed by pruning a tree after it has
learned in order to remove some of the detail it has picked up
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.
In a nutshell, Overfitting – High variance and low bias
Techniques to reduce the overfitting
1. Increase training data.
2. Reduce model complexity
3. Early stopping during the training phase (have an eye over the loss over the training period as
soon as loss begins to increase stop training)
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.
Case study of Polynomial Regression
Polynomial regression is a special case of linear regression. With the main idea of how do you select
your features. Looking at the multivariate regression with 2 variables: x1 and x2, linear regression
will look like this:
y = a1 * x1 + a2 * x2.
Now you want to have a polynomial regression (let‘s make 2 degree polynomial). We will create a
few additional features: x1*x2, x1^2 and x2^2. So we will get your ‗linear regression‘:
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
A polynomial term: a quadratic (squared) or cubic (cubed) term turns a linear regression model into a
curve. But because it is the data X that is squared or cubed, not the Beta coefficient, it still qualifies
as a linear model. This makes it a nice, straightforward way to model curves without having to model
complicated nonlinear models. One common pattern within machine learning is to use linear models
trained on nonlinear functions of the data. This approach maintains the generally fast performance of
linear methods, while allowing them to fit a much wider range of data.
For example, a simple linear regression can be extended by constructing polynomial features from the
coefficients. In the standard linear regression case, you might have a model that looks like this for
two-dimensional data:
(w,x) w0 + w1 x1 + w2x2
If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-
order polynomials, so that the model looks like this:
(w,x) w0 + w1 x1 + w2x2 + w3x1x2 + w4x12 + w5x22
The (sometimes surprising) observation is that this is still a linear model: to see this, imagine creating
a new variable
z = [x1,x2,x1x2,x12,x22]
With this re-labeling of the data, our problem can be written
(w,x) w0 + w1 z1 + w2z2 + w3z3 + w4z4 + w5z5
We see that the resulting polynomial regression is in the same class of linear models we‘d considered
above (i.e. the model is linear in w) and can be solved by the same techniques.
By considering, linear fits within a higher-dimensional space built with these basis functions, the
model has the flexibility to fit a much broader range of data.
Theory of Generalization
Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn
from the same distribution as the one used to create the model. When we train a machine learning
model, we don‘t just want it to learn to model the training data. We want it to generalize to data it
hasn‘t seen before. Fortunately, there‘s a very convenient way to measure an algorithm‘s
generalization performance: we measure its performance on a held-out test set, consisting of examples
it hasn‘t seen before. If an algorithm works well on the training set but fails to generalize, we say it is
overfitting.
There‘s an easy way to measure a network‘s generalization performance. We simply partition our data
into three subsets:
A training set, a set of training examples the network is trained on.
A validation set, which is used to tune hyperparameters such as the number of hidden units, or the
learning rate
A test set, which is used to measure the generalization performance
The losses on these subsets are called training, validation, and test loss, respectively. Hopefully it‘s
clear why we need separate training and test sets: if we train on the test data, we have no idea whether
the network is correctly generalizing, or whether it‘s simply memorizing the training.
Effective number of Hypothesis
A hypothesis is an explanation for something.
It is a provisional idea, an educated guess that requires some evaluation.
A good hypothesis is testable; it can be either true or false.
In science, a hypothesis must be falsifiable, meaning that there exists a test whose outcome could
mean that the hypothesis is not true. The hypothesis must also be framed before the outcome of the
test is known.
A good hypothesis fits the evidence and can be used to make predictions about new observations or
new situations. The hypothesis that best fits the evidence and can be used to make predictions is
called a theory, or is part of a theory.
Hypothesis in Machine Learning
An example of a model that approximates the target function and performs mappings of inputs to
outputs is called a hypothesis in machine learning. The choice of algorithm (e.g. neural network) and
the configuration of the algorithm (e.g. network topology and hyperparameters) define the space of
possible hypothesis that the model may represent.
A common notation is used where lowercase-h (h) represents a given specific hypothesis and
uppercase-h (H) represents the hypothesis space that is being searched.
h (hypothesis): A single hypothesis, e.g. an instance or specific candidate model that maps inputs to
outputs and can be evaluated and used to make predictions.
H (hypothesis set): A space of possible hypotheses for mapping inputs to outputs that can be searched,
often constrained by the choice of the framing of the problem, the choice of model and the choice of
model configuration
The choice of algorithm and algorithm configuration involves choosing a hypothesis space that is
believed to contain a hypothesis that is a good or best approximation for the target function.
A hypothesis in machine learning:
1. Covers the available evidence: the training dataset.
2. Is falsifiable (kind-of): a test harness is devised beforehand and used to estimate performance and
compare it to a baseline model to see if is skillful or not.
3. Can be used in new situations: make predictions on new data.
Bounding the Growth Function
Growth Function
Rademacher complexity can be bounded in terms of the growth function. For any hypothesis h ∈ H
and a sample S = {x1,..., xm} ⊆ X, we denote hS = (h(x1),...,h(xm)) ∈ Y m
Dichotomy: Given a hypothesis set H, a dichotomy of a set S is one of the possible ways of labeling
the points of S using a hypothesis in H.
Growth Function: For a hypothesis set H, the growth function ΠH: N → N is defined as
Let α be the count of rows in the S1 group. We also divide the group S2 into S2+ where xN is a ―+‖
and S2- where xN is ―-‖ and each of them have β rows. This means that:
B(N,k) = α + 2β : (0).
Our purpose of the following steps is to find recursive bound of B(N,k) (a bound defined by B on
different values of N & k).
For this purpose, we‘ll start by trying to estimate α + β which is the number of rows in the table
without point xN and the group S2-. The result is a sub-table where all rows are different since the
rows in S1 are inherently different without xN, and the rows in S2+ are different from the ones in S1
because if that wasn‘t the case, the duplicate version of that row in S1 would get its ―uniqueness‖ from
xN and forcing it to leave S1 and join S2 (just like we‘ve seen in the simple case example).
Furthermore, since in the bigger table (N points) there are no k points that have all possible
combinations, it is impossible to find all possible combinations in the smaller table (N-1 points). This
implies that k is a break point for the smaller table too.
Which will give us: α + β < B(N-1,k) : (2).
The next step now is to find an estimation of β by studying the group S2 only and without the xN point:
Because the rows in S2+ are different from the ones in S2- only thanks to xN, when we remove xN,
S2+ becomes the same as S2-.
It should be noted that the given polynomial function is a non-linear function of ‗x‘ but a linear
function of ‗w‘. We train our data on this function to determine the values of w that will make the
function to minimize the error in predicting target values.
The error function used in this case is mean squared error
In order to minimize the error, calculus is used. The derivative of E(w) is equated with 0 to get the
value of w which will result at the minimum value of error function. E(w) is a quadratic equation, but
it‘s derivative will be a linear equation and hence will result in only a single value of w. Let that be
denoted by w*.
So now, we will get the correct value of w but the issue is what degree of polynomial to choose? All
degree of polynomials can be used to fit on the training data, but how to decide the best choice with
minimum complexity?
Furthermore, if we see the Taylor expansion of sine series then any higher order polynomial can be
used to determine the correct value of the function.
The function has trained itself to get the correct target values for all the noise induced data points and
thus has failed to predict the correct pattern. This function may give zero error for training set but will
give huge errors in predicting the correct target values for test dataset.
To avoid this condition regularization is used. Regularization is a technique used for tuning the
function by adding an additional penalty term in the error function. The additional term controls the
excessively fluctuating function such that the coefficients don‘t take extreme values. This technique of
keeping a check or reducing the value of error coefficients are called shrinkage methods or weight
decay in case of neural networks.
Overfitting can also be controlled by increasing the size of training dataset.
Unit No 04 Neural Network
Introduction to Neural Network
The Neuron is the basic unit of neural network.
A neuron takes inputs, does some math with them, and produces one output. Here‘s what a 2-input
neuron looks like:
Next, all the weighted inputs are added together with a bias b
The activation function is used to turn an unbounded input into an output that has a nice, predictable
form. A commonly used activation function is the sigmoid function.
The neuron outputs 0.999 given the inputs x = [2, 3]. That‘s it! This process of passing inputs forward
to get an output is known as feed forward.
Combining Neurons into a Neural Network
A neural network is nothing more than a bunch of neurons connected together. Here‘s what a simple
neural network might look like:
5. Optimizing weights
When a Neural Network is initialized, its weights are randomly assigned. The power of the neural
network comes from its access to a huge amount of control over the data, through the adjusting of these
weights. The network iteratively adjusts weights and measures performance, continuing this procedure
until the predictions are sufficiently accurate or another stopping criterion is reached.
The accuracy of our predictions is determined by a loss function. Also known as a cost function, this
function will compare the model output with the actual outputs and determine how bad our model is in
estimating our dataset. Essentially we provide the model a function that it aims to minimize and it does
this through the incremental tweaking of weights.
A common metric for a loss function is Mean Absolute Error, MAE. This measures the sum of the
absolute vertical differences between the estimates and their actual values.
The job of finding the best set of weights is conducted by the optimiser. In neural networks, the
optimization method used is stochastic gradient descent.
Every time period, or epoch, the stochastic gradient descent algorithm will repeat a certain set of steps
in order to find the best weights.
1. Start with some initial value for the weights
2. Keep updating weights that we know will reduce the cost function
3. Stop when we have reached the minimum error on our dataset
6. Overfitting and underfitting
Overfitting and Underfitting are two of the most important concepts of machine learning, because they
can help give you an idea of whether your ML algorithm is capable of its true purpose, being unleashed
to the world and encountering new unseen data.
Mathematically, overfitting is defined as the situation where the accuracy on your training data is
greater than the accuracy on your testing data. Underfitting is generally defined as poor performance on
both the training and testing side.
b. Add all the multiplied values and call them Weighted Sum.
\
Figure 4.7 Adding with Summation
For classification accuracy, we use the Minimum Correct Classification Rate (MCCR). MCCR is
defined as the minimum of CCR1 and CCR2. CCRn is the ratio of the correctly classified test points
in class n divided by the total number of test points in class n. The MCCR for the linear data set is
zero using a polynomial of order 3. Both images in the figure shows the classification decision
boundary obtained from a Least Squares Regression as detailed above in purple color. The decision
boundary is good, until some outliers data points are added to the blue class as in the image to the
right. The resulted classier penalizes these outliers even that they are 'too correct' data points. What is
in green is the decision boundaries obtained by Logistic Regression. The advantage of it is that it's
prone to outliers, and does not penalize the 'too correct' data points.
1. Distance Based Models
Distance-based models are the second class of Geometric models. Like Linear models, distance-based
models are based on the geometry of data. As the name implies, distance-based models work on the
concept of distance. In the context of Machine learning, the concept of distance is not based on
merely the physical distance between two points. Instead, we could think of the distance between two
points considering the mode of transport between two points. Travelling between two cities by plane
covers less distance physically than by train because as the plane is unrestricted. Similarly, in chess,
the concept of distance depends on the piece used – for example, a Bishop can move
diagonally. Thus, depending on the entity and the mode of travel, the concept of distance can be
experienced differently. The distance metrics commonly used are Euclidean, Minkowski, Manhattan,
and Mahalanobis.
Figure5.2 Distance Metrics
Distance is applied through the concept of neighbors and exemplars. Neighbors are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are
either centroids that find a centre of mass according to a chosen distance metric or medoids that find
the most centrally located data point. The most commonly used centroid is the arithmetic mean, which
minimizes squared Euclidean distance to all other points.
Notes:
The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean position of
all the points in the figure from the centroid point. This definition extends to any object in n-
dimensional space: its centroid is the mean position of all the points.
Medoids are similar in concept to means or centroids. Medoids are most commonly used on data
when a mean or centroid cannot be defined. They are used in contexts where the centroid is not
representative of the dataset, such as in image data.
Examples of distance-based models include the nearest-neighbor models, which use the training data
as exemplars – for example, in classification. The K-means clustering algorithm also uses exemplars
to create clusters of similar data points.
2. Nearest Neighbors Classification
The principle behind nearest neighbor methods is to find a predefined number of training samples
closest in distance to the new point, and predict the label from these. The number of samples can be a
user-defined constant (k-nearest neighbor learning), or vary based on the local density of points
(radius-based neighbor learning). The distance can, in general, be any metric measure: standard
Euclidean distance is the most common choice. Neighbors-based methods are known as non-
generalizing machine learning methods, since they simply ―remember‖ all of its training data.
Despite its simplicity, nearest neighbors has been successful in a large number of classification and
regression problems, including handwritten digits and satellite image scenes. Being a non-parametric
method, it is often successful in classification situations where the decision boundary is very irregular.
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it
does not attempt to construct a general internal model, but simply stores instances of the training data.
Classification is computed from a simple majority vote of the nearest neighbors of each point: a query
point is assigned the data class which has the most representatives within the nearest neighbors of the
point. Two different nearest neighbors classifiers: KNeighborsClassifier implements learning based
on the k nearest neighbors of each query point, where k is an integer value specified by the
user. RadiusNeighbors Classifier implements learning based on the number of neighbors within a
fixed radius r of each training point, where r is a floating-point value specified by the user.
The k-neighbors classification in KNeighbors Classifier is the most commonly used technique. The
optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of
noise, but makes the classification boundaries less distinct.
In cases where the data is not uniformly sampled, radius-based neighbor‘s classification
in RadiusNeighbors Classifier can be a better choice. The user specifies a fixed radius r, such that
points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-
dimensional parameter spaces, this method becomes less effective due to the so-called ―curse of
dimensionality‖.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query
point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it
is better to weight the neighbors such that nearer neighbors contribute more to the fit. This can be
accomplished through the weights keyword. The default value, weights = 'uniform', assigns uniform
weights to each neighbor. Weights = 'distance' assigns weights proportional to the inverse of the
distance from the query point. Alternatively, a user-defined function of the distance can be supplied to
compute the weights.
Figure 5.3 Left figure shows 3 class classification with weights=uniform and that of right figure
is with weights= distance
Association Rule Mining
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently an item set occurs in a transaction. A typical example is
Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items. It allows retailers to identify relationships between the items that people buy
together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.
Before we start defining the rule, let us first see the basic definitions.
Advantages:
Straightforward interpretation
Good at handling complex, non-linear relationships
Disadvantages:
Predictions tend to be weak, as singular decision tree models are prone to overfitting
Unstable, as a slight change in the input dataset can greatly impact the final results
Applications of Decision Tree Machine Learning Algorithm
1. Decision trees are among the popular machine learning algorithms that find great use in
finance for option pricing.
2. Remote sensing is an application area for pattern recognition based on decision trees.
3. Decision tree algorithms are used by banks to classify loan applicants by their probability of
defaulting payments.
4. Gerber Products, a popular baby product company, used decision tree machine learning
algorithm to decide whether they should continue using the plastic PVC (Poly Vinyl
Chloride) in their products.
5. Rush University Medical Centre has developed a tool named Guardian that uses a decision
tree machine learning algorithm to identify at-risk patients and disease trends.
Probabilistic Models
A probabilistic method or model is based on the theory of probability or the fact that randomness
plays a role in predicting future events.
Probabilistic models incorporate random variables and probability distributions into the model of an
event or phenomenon. While a deterministic model gives a single possible outcome for an event, a
probabilistic model gives a probability distribution as a solution. These models take into account the
fact that we can rarely know everything about a situation. There‘s nearly always an element of
randomness to take into account. For example, life insurance is based on the fact we know with
certainty that we will die, but we don‘t know when. These models can be part deterministic and part
random or wholly random.
Normal Distribution and Its Geometric Interpretations
Normal Distribution is an important concept in statistics and the backbone of Machine Learning. A
Data Scientist needs to know about Normal Distribution when they work with Linear Models (perform
well if the data is normally distributed).
As discovered by Carl Friedrich Gauss, Normal Distribution/Gaussian Distribution is a continuous
probability distribution. It has a bell-shaped curve that is symmetrical from the mean point to both
halves of the curve.
Following formula gives the PDF (Probability Density Function) of the normal distribution:
The mean of the normal distribution is μ (mu) and a standard deviation is σ sigma.
PDF gives you the ―relative likelihood of a continuous random variable taking that value‖. In Normal
distribution, it is like the bell-shaped curve.
CDF (Cumulative Distribution Function) is nothing but the integration of pdf. In the normal
distribution, it is shown as Φ(z). Which is nothing but the probability that normally distributed random
variable is less than value z.
Figure 5.4 a) Normal Probability Density function b) Normal Cumulative Distribution
function
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not affect the other.
Hence it is called naive.
P (A/B) - Posterior Probability of class (A, target) given Predictor (B, attributes)
P (B/A) - Likelihood which is the probability of predictor given class.
P (A) - Class Prior Probability
P (B) - Predictor Prior Probability
Algorithm
1. Convert the data set into a frequency table
2. Create Likelihood table by finding the probabilities
3. Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class
with the highest posterior probability is the outcome of prediction.
Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems
etc. They are fast and easy to implement but their biggest disadvantage is that the requirement of
predictors to be independent. In most of the real life cases, the predictors are dependent; this hinders
the performance of the classifier.
Pros:
It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to
other models like logistic regression and you need less training data.
It performs well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a prediction.
This is often known as ―Zero Frequency‖. To solve this, we can use the smoothing technique.
One of the simplest smoothing techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it
is almost impossible that we get a set of predictors which are completely independent.
When to use the Naive Bayes Classifier algorithm?
1. If you have a moderate or large training data set.
2. If the instances have several attributes.
3. Given the classification parameter, attributes that describe the instances should be
conditionally independent.
Unit No 06 Applications of Machine Learning
The machine learning can be used for different applications. Some of the applications are explained
here
1) Email Spam and Malware Filtering
Of more than 300 billion emails sent every day, at least half are spam. Email providers have the huge
task of filtering out the spam and making sure their users receive the messages that matter.
Spam detection is messy. The line between spam and non-spam messages is fuzzy, and the criteria
change over time. From various efforts to automate spam detection, machine learning has so far
proven to be the most effective and the favored approach by email providers. Although we still see
spammy emails, a quick look at the junk folder will show how much spam gets weeded out of our
inboxes every day thanks to machine learning algorithms.
Machine learning algorithms use statistical models to classify data. In the case of spam detection, a
trained machine learning model must be able to determine whether the sequence of words found in an
email are closer to those found in spam emails or safe ones.
Different machine learning algorithms can detect spam, but one that has gained appeal is the ―naïve
Bayes‖ algorithm. As the name implies, naïve Bayes is based on ―Bayes‘ theorem,‖ which describes
the probability of an event based on prior knowledge.
In the case of spam detection, things get a bit more complicated. Our target variable is whether a
given email is ―spam‖ or ―not spam‖ (also called ―ham‖). The features are the words or word
combinations found in the email‘s body. In a nutshell, we want to find out calculate the probability
that an email message is spam based on its text.
The catch here is that our features are not necessarily independent. For instance, consider the terms
―grilled,‖ ―cheese,‖ and ―sandwich.‖ They can have separate meanings depending on whether they
successively or in different parts of the message. Another example is the words ―not‖ and
―interesting.‖ In this case, the meaning can be completely different depending on where they appear in
the message. But even though feature independence is complicated in text data, the naïve Bayes
classifier has proven to be efficient in natural language processing tasks if you configure it properly.
Spam detection is a supervised machine learning problem. This means you must provide your
machine learning model with a set of examples of spam and ham messages and let it find the relevant
patterns that separate the two different categories.
Most email providers have their own vast data sets of labeled emails. For instance, every time you
flag an email as spam in your Gmail account, you‘re providing Google with training data for its
machine learning algorithms.
Therefore, one of the key steps in developing a spam-detector machine learning model is preparing
the data for statistical processing. Before training your naïve Bayes classifier, the corpus of spam and
ham emails must go through certain steps.
We can remove words that appear both in spam and ham emails and don‘t help in telling the
difference between the two classes. These are called ―stop words‖ and include terms such
as the, for, is, to, and some. We can also use other techniques such as ―stemming‖ and
―lemmatization,‖ which transform words to their base forms. Stemming and lemmatization can help
further simplify our machine learning model. When you train your machine learning model on the
training data set, each term is assigned a weight based on how many times it appears in spam and ham
emails. For instance, if ―win big money prize‖ is one of your features and only appears in spam
emails, then it will be given a larger probability of being spam. If ―important meeting‖ is only
mentioned in ham emails, then its inclusion in an email will increase the probability of that email
being classified as not spam. Once you have processed the data and assigned the weights to the
features, your machine learning model is ready filter spam. When a new email comes in, the text is
tokenized and run against the Bayes formula. Each term in the message body is multiplied by its
weight and the sum of the weights determines the probability that the email is spam. (In reality, the
calculation is a bit more complicated, but to keep things simple, we‘ll stick to the sum of weights).
Simple as it sounds, the naïve Bayes machine learning algorithm has proven to be effective for many
text classification tasks, including spam detection. Like other machine learning algorithms, naïve
Bayes does not understand the context of language and relies on statistical relations between words to
determine whether a piece of text belongs to a certain class. This means that, for instance, a naïve
Bayes spam detector can be fooled into overlooking a spam email if the sender just adds some non-
spam words at the end of the message or replace spammy terms with other closely related words.
Naïve Bayes is not the only machine learning algorithm that can detect spam. Other popular
algorithms include recurrent neural networks (RNN) and transformers, which are efficient at
processing sequential data like email and text messages.
2) Image Recognition by Machine Learning
Image recognition refers to technologies that identify places, logos, people, objects, buildings, and
several other variables in images. Users are sharing vast amounts of data through apps, social
networks, and websites. Additionally, mobile phones equipped with cameras are leading to the
creation of limitless digital images and videos. The large volume of digital data is being used by
companies to deliver better and smarter services to the people accessing it.
Image recognition is a part of computer vision and a process to identify and detect an object or
attribute in a digital video or image. Computer vision is a broader term which includes methods of
gathering, processing and analyzing data from the real world. The data is high-dimensional and
produces numerical or symbolic information in the form of decisions.
The major steps in image recognition process are gathering and organizing data build a predictive
model and use it to recognize images. The human eye perceives an image as a set of signals which are
processed by the visual cortex in the brain. This result in a vivid experience of a scene, associated
with concepts and objects recorded in one‘s memory. Image recognition tries to mimic this process.
Computer perceives an image as either a raster or a vector image. Raster images are a sequence of
pixels with discrete numerical values for colors while vector images are a set of color-annotated
polygons. To analyze images the geometric encoding is transformed into constructs depicting physical
features and objects. These constructs can then be logically analyzed by the computer. Organizing
data involves classification and feature extraction. The first step in image classification is to simplify
the image by extracting important information and leaving out the rest. For example, in the below
image if you want to extract cat from the background you will notice a significant variation in RGB
pixel values.
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in KNN algorithm?
o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
o Small value of K
1. Captures fine structure of the problem space better
2. May be necessary for small training set
o Large value of K
1. Less sensitive to noise (particularly class noise)
2. Better probability estimates for discrete class
3. Larger training set allows you to use large value of K
Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the KNN algorithm to the training set
6. Predicting the test results
7. Visualizing the test set results
Advantages of KNN algorithm
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN algorithm
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.
4. Decision Tree
Decision Tree is a supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
Below diagram explains the general structure of a decision tree:
1. Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
2. Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
3. Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
4. Branch/Sub Tree: A tree formed by splitting the tree.
5. Pruning: Pruning is the process of removing the unwanted branches from the tree.
6. Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does decision tree algorithm works?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the Decision Tree algorithm to the training set
6. Predicting the test results
7. Visualizing the test set results
To better understand the Random Forest Algorithm, you should have the knowledge of the Decision
Tree Algorithm
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Why use Random Forest algorithm?
Below are some points that explain why we should use the Random Forest algorithm:
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During
the training phase, each decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts the final decision.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the hyperplane
is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
7. Logistic Regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
supervised learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead
of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification.
Logistic Regression uses the concept of predictive modeling as regression therefore it is called logistic
regression, but it used to classify samples therefore it falls under the classification algorithm.
Logistic Function: Sigmoid
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Assumption of Logistic Regression
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
8. Naive Bayes
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which help in
building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
Why it is called Naive Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Naive Bayes theorem
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and hence
the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below
method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determinations for
multiple regressions.
o It can be calculated from the below formula:
Assumptions of Linear Regression
Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
o Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too wide
or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.