Unit
Unit
1957 — Frank Rosenblatt designed the first neural network for computers (the perceptron),
which simulate the thought processes of the human brain.
1967 — The “nearest neighbor” algorithm was written, allowing computers to begin using
very basic pattern recognition. This could be used to map a route for traveling salesmen,
starting at a random city but ensuring they visit all cities during a short tour.
1997 — IBM’s Deep Blue beats the world champion at chess.
1
Fig. Life cycle of Machine Learning
In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
The term “machine learning” was coined by Arthur Samuel, a computer scientist at IBM and
a pioneer in AI and computer gaming.
2
Samuel designed a computer program for playing checkers. The more the program played,
the more it learned from experience, using algorithms to make predictions.
PROBABILISTIC MODELING
Machine learning algorithms today rely heavily on probabilistic models, which take into
consideration the uncertainty inherent in real-world data. These models make predictions
based on probability distributions, rather than absolute values, allowing for a more nuanced
and accurate understanding of complex systems. One common approach is Bayesian
inference, where prior knowledge is combined with observed data to make predictions.
Another approach is maximum likelihood estimation, which seeks to find the model that
best fits observational data.
What are Probabilistic Models?
Probabilistic models are an essential component of machine learning, which aims to learn
patterns from data and make predictions on new, unseen data. They are statistical models
that capture the inherent uncertainty in data and incorporate it into their predictions.
Probabilistic models are used in various applications such as image and speech
recognition, natural language processing, and recommendation systems. In recent years,
significant progress has been made in developing probabilistic models that can handle large
datasets efficiently.
Categories Of Probabilistic Models
These models can be classified into the following categories:
Generative models
Discriminative models.
Graphical models
Generative models:
Generative models aim to model the joint distribution of the input and output variables.
These models generate new data based on the probability distribution of the original
dataset. Generative models are powerful because they can generate new data that resembles
the training data. They can be used for tasks such as image and speech synthesis, language
translation, and text generation.
Discriminative models
The discriminative model aims to model the conditional distribution of the output variable
given the input variable. They learn a decision boundary that separates the different classes
of the output variable. Discriminative models are useful when the focus is on making
accurate predictions rather than generating new data. They can be used for tasks such
as image recognition, speech recognition, and sentiment analysis.
Graphical models
These models use graphical representations to show the conditional dependence between
variables. They are commonly used for tasks such as image recognition, natural language
processing, and causal inference.
Disadvantages Of Probabilistic Models
There are also some disadvantages to using probabilistic models.
One of the disadvantages is the potential for overfitting, where the model is too specific
to the training data and doesn’t perform well on new data.
Not all data fits well into a probabilistic framework, which can limit the usefulness of
these models in certain applications.
Another challenge is that probabilistic models can be computationally intensive and
3
require significant resources to develop and implement.
EARLY NEURAL NETWORKS:
The neural nets described by McCullough and Pitts in 1944 had thresholds and weights, but they
weren’t arranged into layers, and the researchers didn’t specify any training mechanism. What
McCullough and Pitts showed was that a neural net could, in principle, compute any function that a
digital computer could. The result was more neuroscience than computer science: The point was to
suggest that the human brain could be thought of as a computing device.
The first trainable neural network, the Perceptron, was demonstrated by the Cornell University
psychologist Frank Rosenblatt in 1957. The Perceptron’s design was much like that of the modern
neural net, except that it had only one layer with adjustable weights and thresholds, sandwiched
between input and output layers.
Perceptrons were an active area of research in both psychology and the fledgling discipline of
computer science until 1959, when Minsky and Papert published a book titled “Perceptrons,” which
demonstrated that executing certain fairly common computations on Perceptrons would be
impractically time consuming.
By the 1980s, however, researchers had developed algorithms for modifying neural nets’ weights
and thresholds that were efficient enough for networks with more than one layer, removing many of
the limitations identified by Minsky and Papert. The field enjoyed a renaissance
KERNAL METHODS:
Kernel method is the mathematical technique that is used in machine learning for analyzing data. This method
uses Kernel function - that maps data from one space to another space.
It is generally used in Support Vector Machines (SVMs) where the algorithms classify data by finding the
hyperplane that separates the data points of different classes.
The most important benefit of Kernel Method is that it can work with non-linearly separable data, and it works
with multiple Kernel functions - depending on the type of data.
Because the linear classifier can solve a very limited class of problems, the kernel trick is employed to empower
the linear classifier, enabling the SVM to solve a larger class of problems.
4
What are the types of Kernel methods in SVM models?
Support vector machines use various kinds of kernel methods. Here are a few of them:
1. Linear Kernel
It is used when the data is linearly separable.
K(x1, x2) = x1 . x2
2. Polynomial Kernel
It is used when the data is not linearly separable.
K(x1, x2) = (x1 . x2 + 1)d
3. Gaussian Kernel
The Gaussian kernel is an example of a radial basis function kernel. It can be represented with this equation:
k(xi, xj) = exp(-𝛾||xi - xj||2)
4. Exponential Kernel
Similar to RBF kernel, but it decays much more quickly.
k(x, y) =exp(-||x -y||22)
5. Laplacian Kernel
Similar to RBF Kernel, it has a sharper peak and faster decay.
k(x, y) = exp(- ||x - y||)
DECISION TREES:
A decision tree is a predictive model that uses a flowchart-like structure to make decisions based on
input data. It divides data into branches and assigns outcomes to leaf nodes. Decision trees are used for
classification and regression tasks, providing easy-to-understand models.
A decision tree is a hierarchical model used in decision support that depicts decisions and their potential
outcomes, incorporating chance events, resource expenses, and utility. This algorithmic model utilizes
conditional control statements and is non-parametric, supervised learning, useful for both classification
and regression tasks. The tree structure is comprised of a root node, branches, internal nodes, and leaf
nodes, forming a hierarchical, tree-like structure.
It is a tool that has applications spanning several different areas. Decision trees can be used for
classification as well as regression problems. The name itself suggests that it uses a flowchart like a tree
structure to show the predictions that result from a series of feature-based splits. It starts with a root node
and ends with a decision made by leaves
5
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can
be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of
relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes
of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
6
Steps for Random Forest algorithm work:
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random Forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:
7
Applications of Random Forest
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Gradient Boosting Machine (GBM) is one of the most popular forward learning ensemble
methods in machine learning. It is a powerful technique for building predictive models for
regression and classification tasks.
GBM helps us to get a predictive model in form of an ensemble of weak prediction models
such as decision trees. Whenever a decision tree performs as a weak learner then the resulting
algorithm is called gradient-boosted trees.
It enables us to combine the predictions from various learner models and build a final
predictive model having the correct prediction.
So, the answer to these questions is that a different subset of features is taken by the nodes of
each decision tree to select the best split. It means, that each tree behaves differently, and
hence captures different signals from the same data
o Loss function
o Weak learners
o Additive model
8
1. Loss function:
Although, there is a big family of Loss functions in machine learning that can be used depending on the type of
tasks being solved. The use of the loss function is estimated by the demand of specific characteristics of the
conditional distribution such as robustness. While using a loss function in our task, we must specify the loss
function and the function to calculate the corresponding negative gradient. Once, we get these two functions, they
can be implemented into gradient boosting machines easily. However, there are several loss functions have been
already proposed for GBM algorithms
2. Weak Learner:
Weak learners are the base learner models that learn from past errors and help in building a strong predictive
model design for boosting algorithms in machine learning. Generally, decision trees work as a weak learners in
boosting algorithms
3.Additive Model:
The additive model is defined as adding trees to the model. Although we should not add multiple trees at a time,
only a single tree must be added so that existing trees in the model are not changed. Further, we can also prefer the
gradient descent method by adding trees to reduce the loss
MACHINE LEARNING:
Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions. Machine learning
contains a set of algorithms that work on a huge amount of data. Data is fed to these algorithms to train
them, and on the basis of training, they build the model & perform a specific task
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some of
the inputs are already mapped to the output. More preciously, we can say; first, we train the
machine with the input and corresponding output, and then we ask the machine to predict the
output using the test dataset.
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which are given
below:
9
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is
a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the customers by their purchasing behaviour.
2) Association
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
Accuracy
Accuracy is defined as the ratio of the number of correct predictions to the total number of predictions.
This is the most fundamental metric used to evaluate the model.
PRECISION:
Precision is the ratio of true positives to the summation of true positives and false positives. It basically
analyses the positive predictions
RECALL:
Recall is the ratio of true positives to the summation of true positives and false negatives. It basically
analyses the number of correct positive samples
11
F1 score
The F1 score is the harmonic mean of precision and recall. It is seen that during the precision-
recall trade-off if we increase the precision, recall decreases and vice versa. The goal of the F1
score is to combine precision and recal
Confusion Matrix
A confusion matrix is an N x N matrix where N is the number of target classes. It represents the
number of actual outputs and the predicted outputs
Holdout is the simplest approach. It is used in neural networks as well as in many classifiers. In
this technique, the dataset is divided into train and test datasets.
UNDERFITTING AND OVERFITTING:
Bias and Variance in Machine Learning
Bias: Assumptions made by a model to make a function easier to learn. It is actually the
error rate of the training data. When the error rate has a high value, we call it High Bias and
when the error rate has a low value, we call it low Bias.
Variance: The difference between the error rate of training data and testing data is called
variance. If the difference is high then it’s called high variance and when the difference in
errors is low then it’s called low variance. Usually, we want to make a low variance for
generalized our model.
12
2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.
13