Unit - I: Siddharth Institute of Engineering & Technology:: Puttur
Unit - I: Siddharth Institute of Engineering & Technology:: Puttur
Unit - I: Siddharth Institute of Engineering & Technology:: Puttur
Subject with Code: Machine Learning(20CS0535) Course & Branch: B.Tech - CSE
Regulation: R20 Year &Sem: III-B.Tech & II - Sem
UNIT –I
INTRODUCTION
Following are some key points which show the importance of Machine Learning:
b List out applications and some popular algorithms used in Machine [L2][CO1] [10M]
Learning. Explain it?
Applications of Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify objects,
persons, places, digital images, etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:
Course Code: 20CS0535 R20
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to
text", or "Computer speech recognition." At present, machine learning algorithms are widely used by
various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech
recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with
the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from the user
and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product on
Amazon, then we started getting an advertisement for the same product while internet surfing on the same
browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the product as
per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and
this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-
driving car. It is using unsupervised learning method to train the car models to detect people and objects while
driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Course Code: 20CS0535 R20
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants can help us in
various ways just by our voice instructions such as Play music, call someone, Open an email, Scheduling an
appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever
we perform some online transaction, there may be various ways that a fraudulent transaction can take place
such as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed
Forward Neural network helps us by checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values become the
input for the next round. For each genuine transaction, there is a specific pattern which gets change for the
fraud transaction hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and
downs in shares, so for this machine learning's long short term memory neural network is used for the
prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing
very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in
finding brain tumors and other brain-related diseases easily.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as for
this also machine learning helps us by converting the text into our known languages. Google's GNMT
(Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning that translates
the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which is used
with image recognition and translates the text from one language to another language.
There are numerous machine learning algorithms available, each with its strengths and weaknesses.
The choice of algorithm depends on the nature of the problem, the type and size of the data, and the
desired outcome. Here are some popular machine learning algorithms:
Course Code: 20CS0535 R20
1. Linear Regression: A supervised learning algorithm used for regression tasks. It models the
relationship between the dependent variable and one or more independent variables by fitting a linear
equation to the data.
2. Logistic Regression: A supervised learning algorithm used for classification tasks. It models the
relationship between the independent variables and the probability of a binary outcome using the
logistic function.
3. Decision Trees: Supervised learning algorithms that build a tree-like model of decisions and their
possible consequences. They split the data based on feature values to make predictions.
4. Random Forests: An ensemble learning method that combines multiple decision trees to make
predictions. It improves generalization and reduces overfitting compared to individual decision
trees.
5. Support Vector Machines (SVM): A supervised learning algorithm used for both classification
and regression tasks. SVM finds the best hyperplane that separates data points of different classes
or predicts a continuous target variable.
6. Naive Bayes: A probabilistic supervised learning algorithm based on Bayes' theorem. It assumes
independence among features and is particularly efficient for text classification and spam filtering
tasks.
7. k-Nearest Neighbors (k-NN): A lazy learning algorithm that classifies new instances based on
their similarity to existing labeled instances. It assigns the most frequent class label among the k
nearest neighbors in the feature space.
8. Neural Networks: Deep learning algorithms that consist of interconnected layers of artificial
neurons. They can learn complex patterns and relationships in data and are widely used for image
recognition, natural language processing, and other tasks.
9. Gradient Boosting Methods: Ensemble learning techniques that combine weak learners, such as
decision trees, in a sequential manner to create a strong predictive model. Examples include
AdaBoost, Gradient Boosting Machines (GBM), and XGBoost.
10. Clustering Algorithms: Unsupervised learning algorithms used to identify groups or clusters
within data. Examples include k-means clustering, hierarchical clustering, and DBSCAN.
11. Dimensionality Reduction Algorithms: Techniques used to reduce the number of features in a
dataset while preserving essential information. Principal Component Analysis (PCA) and t-SNE (t-
Distributed Stochastic Neighbor Embedding) are commonly used for dimensionality reduction.
12. Reinforcement Learning Algorithms: Algorithms that learn through interaction with an
environment and receive rewards or penalties based on their actions. Reinforcement learning is often
used in robotics, game playing, and control systems.
Supervised Machine
Learning
Supervised learning is the types of machine learning in which machines are trained using well "labelled"
training data, and on basis of that data, machines predict the output. The labelled data means some input data
is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a student learns in the supervision of
the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine learning
model. The aim of a supervised learning algorithm is to find a mapping function to map the input
variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.0:04/05:45
g Techniques
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the model learns about each type of
data. Once the training process is completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
Course Code: 20CS0535 R20
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape
on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output variable.
It is used for the prediction of continuous variables, such as Weather forecasting, Market Trends, etc. Below
are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Course Code: 20CS0535 R20
Classification algorithms are used when the output variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the training
dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of
different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does
not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is to
identify the image features on their own. Unsupervised learning algorithm will perform this task by clustering
the image dataset into the groups according to similarities between images.
Keep Watching
Course Code: 20CS0535 R20
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences, which
makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such cases,
we need unsupervised learning.
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding outputs
are also not given. Now, this unlabeled input data is fed to the machine learning model in order to train it.
Firstly, it will interpret the raw data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to the
similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
Course Code: 20CS0535 R20
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example
of Association rule is Market Basket Analysis.
o Unsupervised learning is used for more complex tasks as compared to supervised learning because,
in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data.
Reinforcement learning
It is an area of Machine Learning. It is about taking suitable action to maximize reward in a
particular situation. It is employed by various software and machines to find the best possible
behaviour or path it should take in a specific situation. Reinforcement learning differs from
supervised learning in a way that in supervised learning the training data has the answer key with
it so the model is trained with the correct answer itself whereas in reinforcement learning, there is
no answer but the reinforcement agent decides what to do to perform the given task. In the absence
of a training dataset, it is bound to learn from its experience.
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a
particular problem
• Training: The training is based upon the input, The model will return a state and the
user will decide to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Supervised learning algorithms have a wide range of applications across various domains. Here are some
common applications of supervised learning:
1. Image and Object Recognition: Supervised learning algorithms, such as convolutional neural networks
(CNNs), are widely used for image classification, object detection, and recognition tasks. They can
accurately identify and classify objects within images, enabling applications like self-driving cars, facial
recognition, medical imaging analysis, and quality control in manufacturing.
2. Natural Language Processing (NLP): Supervised learning algorithms play a crucial role in NLP tasks,
including sentiment analysis, text classification, named entity recognition, machine translation, and
question-answering systems. They can understand and process human language, enabling applications like
chatbots, virtual assistants, and automated language translation.
3. Fraud Detection: Supervised learning algorithms can identify fraudulent activities in financial
transactions by learning patterns from labeled data. They help detect anomalies, classify transactions as
legitimate or fraudulent, and provide real-time fraud alerts, benefiting industries like banking, insurance,
and e-commerce.
4. Credit Scoring: Supervised learning algorithms are utilized in credit scoring to assess the
creditworthiness of individuals or businesses. By learning from historical data, these algorithms can predict
the likelihood of default or delinquency, helping banks and lending institutions make informed decisions on
granting loans or credit.
5. Medical Diagnosis: Supervised learning algorithms assist in medical diagnosis by learning from labeled
patient data. They can analyze symptoms, patient history, and medical test results to predict diseases,
recommend treatment options, and aid doctors in making accurate diagnoses.
6. Customer Churn Prediction: Supervised learning algorithms can predict customer churn, which is the
likelihood of customers discontinuing their relationship with a business. By analyzing customer behavior,
Course Code: 20CS0535 R20
demographics, and transactional data, these algorithms help identify at-risk customers, allowing businesses
to take proactive measures to retain them.
7. Recommendation Systems: Supervised learning algorithms, such as collaborative filtering and matrix
factorization, power recommendation systems. By learning from user behavior and preferences, these
algorithms can provide personalized recommendations for products, movies, music, and more, enhancing
user experience and driving sales.
8. Speech Recognition: Supervised learning algorithms, like recurrent neural networks (RNNs) and hidden
Markov models (HMMs), enable accurate speech recognition and transcription. They are used in
applications like voice assistants, transcription services, voice-controlled devices, and speech-to-text
conversion.
9. Predictive Maintenance: Supervised learning algorithms can predict equipment failures and
maintenance needs by learning from sensor data, historical maintenance records, and environmental factors.
They help optimize maintenance schedules, reduce downtime, and improve operational efficiency in
industries like manufacturing, energy, and transportation.
10. Stock Market Prediction: Supervised learning algorithms are utilized in analyzing historical stock
market data to predict future trends, price movements, and investment opportunities. They assist traders,
investors, and financial institutions in making informed decisions.
These are just a few examples of how supervised learning algorithms are applied in various fields. The
flexibility and effectiveness of supervised learning make it a valuable tool in numerous industries and
domains, driving advancements and improving decision-making processes.
The aim is to increase the chance of The aim is to increase accuracy, but it
4.
success and not accuracy. does not care about; the success
It works as a computer program that Here, the tasks systems machine takes
6.
does smart work. data and learns from data.
Course Code: 20CS0535 R20
The goal is to learn from data on certain
The goal is to simulate natural
7. tasks to maximize the
intelligence to solve complex problems.
performance on that task.
AI can work with structured, semi- ML can work with only structured and
15.
structured, and unstructured data. semi-structured data.
independent variables.
A classifier is a type of machine learning algorithm that assigns a label to a data input. Classifier algorithms
use labeled data and statistical methods to produce predictions about data input classifications.
1. Logistic Regression
2. K-Nearest Neighbor
3. Support Vector Machine
• Kernel SVM
4. Naïve Bayes
Course Code: 20CS0535 R20
5. Decision Tree Classification
1. LOGISTIC REGRESSION:
Logistic regression is kind of like linear regression, but is used when the dependent variable is not a number
but something else (e.g., a “yes/no” response). It’s called regression but performs classification based on the
regression and it classifies the dependent variable into either of the classes.
Logistic regression is used for prediction of output which is binary, as stated above. For example, if a credit
card company builds a model to decide whether or not to issue a credit card to a customer, it will model for
Linear Regression
Firstly, linear regression is performed on the relationship between variables to get the model. The threshold
Logistic function is applied to the regression to get the probabilities of it belonging in either class.
It gives the log of the probability of the event occurring to the log of the probability of it not occurring. In the
end, it classifies the variable based on the higher probability of either class.
Course Code: 20CS0535 R20
K-NN algorithm is one of the simplest classification algorithms and it is used to identify the data points that
are separated into several classes to predict the classification of a new sample point. K-NN is a non-
parametric, lazy learning algorithm. It classifies new cases based on a similarity measure (i.e., distance
functions).
Course Code: 20CS0535 R20
K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very
large.
Support vector is used for both regression and classification. It is based on the concept of decision planes that
define decision boundaries. A decision plane (hyperplane) is one that separates between a set of objects
It performs classification by finding the hyperplane that maximizes the margin between the two classes with
The learning of the hyperplane in SVM is done by transforming the problem using some linear algebra (i.e.,
the example above is a linear kernel which has a linear separability between each variable).
For higher dimensional data, other kernels are used as points and cannot be classified easily. They are
Kernel SVM
Kernel SVM takes in a kernel function in the SVM algorithm and transforms it into the required form that
The RBF kernel SVM decision region is actually also a linear decision region. What RBF kernel SVM
actually does is create non-linear combinations of features to uplift the samples onto a higher-dimensional
feature space where a linear decision boundary can be used to separate classes.
So, the rule of thumb is: use linear SVMs for linear problems, and nonlinear kernels such as the RBF kernel
4. NAIVE BAYES
The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between predictors
(i.e., it assumes the presence of a feature in a class is unrelated to any other feature). Even if these features
depend on each other, or upon the existence of the other features, all of these properties independently. Thus,
Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial (normal)
distribution of data.
Course Code: 20CS0535 R20
• P(class|data) is the posterior probability of class(target) given predictor(attribute). The probability of a data
point having either class, given the data point. This is the value that we are looking to calculate.
• P(class) is the prior probability of class.
• P(data|class) is the likelihood, which is the probability of predictor given class.
• P(data) is the prior probability of predictor or marginal likelihood.
P(yellow) = 10/17
P(green) = 7/17
P(?) = 4/17
Course Code: 20CS0535 R20
The value is present in checking both the probabilities.
3. Calculate Likelihood
P(data/class) = Number of similar observations to the class/Total no. of points in the class.
P(?/yellow) = 1/7
P(?/green) = 3/10
6. Classification:
The higher probability, the class belongs to that category as from above 75% probability the point belongs to
class green.
Multinomial, Bernoulli naive Bayes are the other models used in calculating probabilities. Thus, a naive
Bayes model is easy to build, with no complicated iterative parameter estimation, which makes it particularly
into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf nodes. It follows Iterative Dichotomiser 3 (ID3)
Entropy
Entropy is the degree or amount of uncertainty in the randomness of elements. In other words, it is a measure
of impurity.
Intuitively, it tells us about the predictability of a certain event. Entropy calculates the homogeneity of a
sample. If the sample is completely homogeneous the entropy is zero, and if the sample is equally divided it
Information Gain
Information gain measures the relative change in entropy with respect to the independent attribute. It tries to
estimate the information contained by each attribute. Constructing a decision tree is all about finding the
attribute that returns the highest information gain (i.e., the most homogeneous branches).
Course Code: 20CS0535 R20
Where Gain(T, X) is the information gain by applying feature X. Entropy(T) is the entropy of the entire set,
while the second term calculates the entropy after applying the feature X.
Information gain ranks attributes for filtering at a given node in the tree. The ranking is based on the highest
The disadvantage of a decision tree model is overfitting, as it tries to fit the model by going deeper in the
An ensemble model is a team of models. Technically, ensemble models comprise several supervised learning
models that are individually trained and the results merged in various ways to achieve the final prediction.
This result has higher predictive power than the results of any of its constituting learning algorithms
independently.
methods combines more than one algorithm of the same or different kind for classifying objects (i.e., an
The general idea is that a combination of learning models increases the overall result selected.
Deep decision trees may suffer from overfitting, but random forests prevent overfitting by creating trees on
random subsets. The main reason is that it takes the average of all the predictions, which cancels out the
biases.
Random forest adds additional randomness to the model while growing the trees. Instead of searching for the
most important feature while splitting a node, it searches for the best feature among a random subset of
features. This results in a wide diversity that generally results in a better model.
Gradient boosting classifier is a boosting ensemble method. Boosting is a way to combine (ensemble) weak
learners, primarily to reduce prediction bias. Instead of creating a pool of predictors, as in bagging, boosting
produces a cascade of them, where each output is the input for the following learner. Typically, in a bagging
algorithm trees are grown in parallel to get the average prediction across all trees, where each tree is built on
a sample of original data. Gradient boosting, on the other hand, takes a sequential approach to obtaining
Course Code: 20CS0535 R20
predictions instead of parallelizing the tree building process. In gradient boosting, each decision tree predicts
the error of the previous decision tree — thereby boosting (improving) the error (gradient).
5. Repeat steps two through four for a certain number of iterations (the number of iterations will be the number
of trees).
a List out various Unsupervised learning techniques used in Machine [L1][CO5] [5M]
4 Learning.
Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and
are allowed to act on that data without any supervision.
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
Course Code: 20CS0535 R20
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example
of Association rule is Market Basket Analysis.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group)
and Soft Clustering (data points can belong to another group also). But there are also other various
approaches of Clustering exist. Below are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high
dimensions.
Course Code: 20CS0535 R20
In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement
of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to
create a tree-like structure, which is also called a dendrogram. The observations or any number of clusters
can be selected by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster.
Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a
Course Code: 20CS0535 R20
cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are different
types of clustering algorithms published, but only a few are commonly used. The clustering algorithm is
based on the kind of data that we are using. Such as, some algorithms need to guess the number of clusters in
the given dataset, whereas some are required to find the minimum distance between the observation of the
dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The number
of clusters must be specified in this algorithm. It is fast with fewer computations required, with the
linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of
data points. It is an example of a centroid-based model, that works on updating the candidates for
centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise.
It is an example of a density-based model similar to the mean-shift, but with some remarkable
advantages. In this algorithm, the areas of high density are separated by the areas of low density.
Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed
that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset
and then successively merged. The cluster hierarchy can be represented as a tree-structure.
1. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify
the number of clusters. In this, each data point sends a message between the pair of data points until
convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.
[L2][CO1] [12M]
Summarize the Guidelines for Machine Learning Experiments.
5
Guidelines for Machine Learning Experiments Before we start experimentation, we need to have a good idea
about what it is we are studying, how the data is to be collected, and how we are planning to analyze it.
We need to start by stating the problem clearly, defining what the objectives are. In machine learning, there
may be several possibilities. As we discussed before, we may be interested in assessing the expected error
(or some other response measure) of a learning algorithm on a particular problem and check that, for example,
the error is lower than a certain acceptable level.
Given two learning algorithms and a particular problem as defined by a dataset, we may want to determine
which one has less generalization error. These can be two different algorithms, or one can be a proposed
improvement of the other, for example, by using a better feature extractor.
In the general case, we may have more than two learning algorithms, and we may want to choose the one
with the least error, or order them in terms of error, for a given dataset. In an even more general setting,
instead of on a single dataset, we may want to compare two or more algorithms on two or more datasets.
We need to decide on what we should use as the quality measure. Most frequently, error is used that is the
misclassification error for classification and mean square error for regression. We may also use some variant;
for example, generalizing from 0/1 to an arbitrary loss, we may use a risk measure. In information retrieval,
we use measures such as precision and recall;
In a cost-sensitive Design and Analysis of Machine Learning Experiments setting, not only the output but
also system parameters, for example, its complexity, are taken into account.
What the factors are depend on the aim of the study. If we fix an algorithm and want to find the best hyper
parameters, then those are the factors. If we are comparing algorithms, the learning algorithm is a factor. If
we have different datasets, they also become a factor. The levels of a factor should be carefully chosen so as
not to miss a good configuration and avoid doing unnecessary experimentation. It is always good to try to
normalize factor levels.
For example, in optimizing k of k-nearest neighbor, one can try values such as 1, 3, 5, and so on, but in
optimizing the spread hof Parzen windows, we should not try absolute values such as 1.0, 2.0, and so on,
because that depends on the scale of the input; it is better to find some statistic that is an indicator of scale—
for example, the average distance between an instance and its nearest neighbor—and try has different
multiples of that statistic. Though previous expertise is a plus in general, it is also important to investigate all
factors and factor levels that may be of importance and not be overly influenced by past experience.
It is always better to do a factorial design unless we are sure that the factors do not interact, because mostly
they do. Replication number depends on the dataset size; it can be kept small when the dataset is large; we
will discuss this in the next section when we talk about resampling. However, too few replicates generate few
data and this will make comparing distributions difficult; in the particular case of parametric tests, the
assumptions of Gaussian its may not be tenable. Generally, given some dataset, we leave some part as the
test set and use the rest for training and validation, probably many times by resampling. How this division is
done is important.
In practice, using small datasets leads to responses with high variance, and the differences will not be
significant and results will not be conclusive. It is also important to avoid as much as possible toy, synthetic
data and use datasets that are collected from real-world under real-life circumstances.
Before running a large factorial experiment with many factors and levels, it is best if one does a few trial runs
for some random settings to check that all is as expected. In a large experiment, it is always a good idea to
Course Code: 20CS0535 R20
save intermediate results (or seeds of the random number generator), so that a part of the whole experiment
can be rerun when desired.
All the results should be reproducible. In running a large experiment with many factors and factor levels,
one should be aware of the possible negative effects of software aging. It is important that an experimenter
be unbiased during experimentation. In comparing one’s favorite algorithm with a competitor, both should
be investigated equally diligently.
In large-scale studies, it may even be envisaged that testers be different from developers. One should avoid
the temptation to write one’s own “library” and instead, as much as possible, use code from reliable sources;
such code would have been better tested and optimized.
As in any software development study, the advantages of good documentation cannot be underestimated,
especially when working in groups. All the methods developed for high-quality software engineering should
also be used in machine learning experiments.
This corresponds to analyzing data in a way so that whatever conclusion we get is not subjective or due to
chance. We cast the questions that we want to answer in a hypothesis testing framework and check whether
the sample supports the hypothesis.
For example, the question "Is A a more accurate algorithm than B?" becomes the hypothesis "Can we say
that the average error of learners trained by A is significantly lower than the average error of learners trained
by B?" As always, visual analysis is helpful, and we can use histograms of error distributions, whisker-and-
box plots, range plots, and so on.
Once all data is collected and analyzed, we can draw objective conclusions. One frequently encountered
conclusion is the need for further experimentation. Most statistical, and hence machine learning or data
mining, studies are iterative. It is for this reason that we never start with all the experimentation. It is suggested
that no more than 25 percent of the available resources should be invested in the first experiment
(Montgomery 2005). The first runs are for investigation only. That is also why it is a good idea not to start
with high expectations, or promises to one’s boss or thesis advisor. We should always remember that
statistical testing never tells us if the hypothesis is correct or false, but how much the sample seems to concur
with the hypothesis. There is always a risk that we do not have a conclusive result or that our conclusions be
wrong, especially if the data is small and noisy. When our expectations are not met, it is most helpful to
investigate why they are not. For example, in checking why our favorite algorithm A has worked awfully bad
on some cases, we can get a splendid idea for some improved version of A.
All improvements are due to the deficiencies of the previous version; finding a deficiency is but a helpful
hint that there is an improvement we can make! But we should not go to the next step of testing the improved
version before we are sure that we have completely analyzed the current data and learned all we could learn
from it. Ideas are cheap, and useless unless tested, which is costly
a Explain Model Selection in Machine learning. [L2][CO1] [6M]
6 Model Selection: Model selection refers to the process of choosing the best model from a set of
candidate models for a specific task or problem. In machine learning, a model is a mathematical
representation of the relationships between input variables (features) and the target variable (output).
Model selection is crucial because different models have different complexities, assumptions, and
performance characteristics, and choosing an appropriate model can greatly impact the accuracy and
efficiency of the learning system.
Course Code: 20CS0535 R20
Model Complexity: Complex models can potentially capture intricate patterns in the data but may be prone
to overfitting,. The trade-off between complexity and generalization is often a key factor in model selection.
Domain Knowledge: Understanding the problem domain and having prior knowledge about the data can
guide the selection of an appropriate model.
Training Data Availability: The amount of available training data influences the choice of model.
Model Performance Metrics: The choice of performance metrics depends on the nature of the problem.
For example, in classification tasks, metrics like accuracy, precision, recall, and F1-score are commonly
used.
Computational Resources: When selecting a model, it's important to consider the available computational
resources, such as processing power, memory, and time constraints.
1.
Course Code: 20CS0535 R20
b Discriminate Generalization in machine learning with examples [L5][CO1] [6M]
Generalization: Generalization refers to the ability of a trained model to perform well on unseen or new
data that it hasn't encountered during the training phase. The ultimate goal of machine learning is to develop
models that generalize well, as they can make accurate predictions or decisions on real-world, unseen
instances.
To achieve good generalization, it's important to balance model complexity and simplicity. If a model is too
simple, it may underfit the data, failing to capture important patterns. On the other hand, if a model is too
complex, it may overfit the training data, memorizing noise or irrelevant details and performing poorly on
new data.
Regularization techniques, such as L1 and L2 regularization, dropout, or early stopping, can help control
the complexity of models and prevent overfitting.
Course Code: 20CS0535 R20
Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labeled data. using unlabeled data.
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the
hidden patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it the hidden patterns and useful insights from
is given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
Course Code: 20CS0535 R20
learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine things
predict the correct output. by his experiences.
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize
reward in a particular situation. It is employed by various software and machines to find the best possible
behaviour or path it should take in a specific situation. Reinforcement learning differs from supervised
learning in a way that in supervised learning the training data has the answer key with it so the model is
trained with the correct answer itself whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given task. In the absence of a training dataset, it
is bound to learn from its experience.
Example: The problem is as follows: We have an agent and a reward, with many hurdles in between.
The agent is supposed to find the best possible path to reach the reward. The following problem explains
the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that is the
diamond and avoid the hurdles that are fired. The robot learns by trying all the possible paths and then
choosing the path which gives him the reward with the least hurdles. Each right step will give the robot a
reward and each wrong step will subtract the reward of the robot. The total reward will be calculated
when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a particular
problem
• Training: The training is based upon the input, The model will return a state and the user will
decide to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Types of Reinforcement: There are two types of Reinforcement:
2. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior, increases the
strength and the frequency of the behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
Course Code: 20CS0535 R20
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the results
3. Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative condition is
stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement Learning –
1. Linear Regression
It is one of the most widely known modeling techniques and the most famous regression technique in
Machine Learning. Linear regression is usually among the first few topics which people pick while learning
predictive modeling. In this technique, the dependent variable is continuous, the independent variable(s)
can be continuous or discrete, and the nature of the regression line is linear.
Linear Regression establishes a relationship between the dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as Regression line).
It is represented by an equation Y=a+b*X + e, where a is the intercept, b is the slope of the line and e is
error term. This equation can be used to predict the value of the target variable based on the given predictor
variable(s).
The difference between simple linear regression and multiple linear regression, multiple linear regression
has (>1) independent variables, whereas simple linear regression has only 1 independent variable. Now, the
question is "How do we obtain best-fit line?".
How to obtain the best fit line (Value of a and b)?
This task can be easily accomplished by Least Square Method. It is the most common method used for
fitting a regression line. It calculates the best-fit line for the observed data by minimizing the sum of the
squares of the vertical deviations from each data point to the line. Because the deviations are first squared,
when added, there is no canceling out between positive and negative values.
2. Logistic Regression
Logistic regression in Machine Learning is used to find the probability of event=Success and event=Failure.
We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in
nature. Here the value of Y ranges from 0 to 1 and it can be represented by the following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Course Code: 20CS0535 R20
Above, p is the probability of the presence of the characteristic of interest. A question that you should ask
here is "why have we used to log in the equation?".
Since we are working here with a binomial distribution (dependent variable), we need to choose a link
function which is best suited for this distribution. And, it is a logit function. In the equation above, the
parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing
the sum of squared errors (like in ordinary regression).
Important Points:
3. Polynomial Regression
A regression equation in Machine Learning is a polynomial regression equation if the power of the
independent variable is more than 1. The equation below represents a polynomial equation:
y=a+b*x^2
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data
points.
Important Points:
• While there might be a temptation to fit a higher degree polynomial to get a lower error, this can
result in over-fitting. Always plot the relationships to see the fit and focus on making sure that the
curve fits the nature of the problem. Here is an example of how plotting can help:
Course Code: 20CS0535 R20
• Especially look out for curve towards the ends and see whether those shapes and trends make sense.
Higher polynomials can end up producing weird results on extrapolation.
4. Stepwise Regression
This form of regression is used when we deal with multiple independent variables. In this regression
technique in Machine Learning, the selection of independent variables is done with the help of an automatic
process, which involves no human intervention.
This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to discern
significant variables. Stepwise regression fits the regression model by adding/dropping covariates one at a
time based on a specified criterion. Some of the most commonly used Stepwise regression methods are
listed below:
• Standard stepwise regression does two things. It adds and removes predictors as needed for each
step.
• Forward selection starts with the most significant predictor in the model and adds variable for each
step.
• Backward elimination starts with all predictors in the model and removes the least significant
variable for each step.
This modeling technique aims to maximize the prediction power with a minimum number of predictor
variables. It is one of the methods to handle higher dimensionality of data set.
Establish the Association rules in unsupervised learning.
A [L3][CO2] [6M]
10
ASSOCIATION RULES:
Association rule learning is a kind of unsupervised learning technique that tests for the reliance of one data
element on another data element and design appropriately so that it can be more cost-effective. It tries to
discover some interesting relations or associations between the variables of the dataset. It depends on various
rules to find interesting relations between variables in the database.
The association rule learning is the most important approach of machine learning, and it is employed in
Market Basket analysis, Web usage mining, continuous production, etc. In market basket analysis, it is an
approach used by several big retailers to find the relations between items.
Types of Association Rule Learning
There are the following types of Association rule learning which are as follows −
Apriori Algorithm − This algorithm needs frequent datasets to produce association rules. It is designed to
work on databases that include transactions. This algorithm needs a breadth-first search and hash tree to
compute the itemset efficiently.
It is generally used for market basket analysis and support to learn the products that can be purchased
together. It can be used in the healthcare area to discover drug reactions for patients.
Eclat Algorithm − The Eclat algorithm represents Equivalence Class Transformation. This algorithm needs
a depth-first search method to discover frequent itemsets in a transaction database. It implements quicker
execution than Apriori Algorithm.
F-P Growth Algorithm − The F-P growth algorithm represents Frequent Pattern. It is the enhanced version
of the Apriori Algorithm. It describes the database in the form of a tree structure that is referred to as a
frequent pattern or tree. This frequent tree aims to extract the most frequent patterns.
Course Code: 20CS0535 R20
There are various applications of Association Rule which are as follows −
• Items purchased on a credit card, such as rental cars and hotel rooms, support insight into the
following product that customer are likely to buy.
• Optional services purchased by tele-connection users (call waiting, call forwarding, DSL, speed call,
etc.) support decide how to bundle these functions to maximize revenue.
• Banking services used by retail users (money industry accounts, CDs, investment services, car loans,
etc.) recognize users likely to needed other services.
• Unusual group of insurance claims can be an expression of fraud and can spark higher investigation.
Medical patient histories can supports expressions of likely complications based on definite set of
treatments.
b Analyze the real world applications of ML. [L4][CO6] [6M]
Machine learning has found numerous applications across various industries, revolutionizing processes and
enabling the development of innovative solutions. Here are some real-world applications of machine learning:
1. Healthcare: Machine learning is used for medical diagnosis, patient monitoring, and treatment planning. It
can analyze medical records, images, and genomic data to assist in early disease detection, personalized
medicine, and predicting patient outcomes. Machine learning models can also help identify patterns and
anomalies in large healthcare datasets for improved decision-making.
2. Finance: Machine learning is widely applied in financial institutions for fraud detection, credit scoring,
algorithmic trading, and risk assessment. It can analyze vast amounts of financial data to identify fraudulent
transactions, predict market trends, and optimize investment strategies. Machine learning models are also
used for automated trading based on historical and real-time market data.
3. Retail and E-commerce: Machine learning is used for personalized recommendations, demand forecasting,
inventory management, and pricing optimization. By analyzing customer behavior, browsing history, and
purchase patterns, machine learning models can recommend relevant products to users, optimize pricing
strategies, and predict customer preferences to improve sales and customer satisfaction.
4. Transportation and Logistics: Machine learning is utilized for route optimization, demand forecasting, and
predictive maintenance in transportation and logistics. It can analyze historical data, real-time traffic
information, and weather conditions to optimize routes for delivery vehicles, forecast demand for
transportation services, and detect anomalies in equipment performance to prevent breakdowns.
5. Manufacturing: Machine learning is used in manufacturing industries for quality control, predictive
maintenance, and process optimization. It can analyze sensor data from production lines to detect anomalies
and ensure product quality. Machine learning models can also predict equipment failures, enabling proactive
maintenance to minimize downtime and maximize productivity.
6. Natural Language Processing (NLP): Machine learning techniques are applied in NLP applications such
as language translation, sentiment analysis, chatbots, and voice assistants. NLP models can understand and
generate human language, enabling accurate translation between languages, sentiment analysis of customer
feedback, and interactive conversational experiences.
7.
8. Autonomous Vehicles: Machine learning plays a crucial role in autonomous vehicles by enabling object
detection and recognition, scene understanding, and decision-making. Machine learning models process
sensor data from cameras, LiDAR, and radar to detect and classify objects on the road, navigate complex
environments, and make real-time decisions to ensure safe driving.
9. Energy and Utilities: Machine learning is used for energy load forecasting, anomaly detection in power
grids, and optimizing energy consumption. It can analyze historical energy consumption data, weather
conditions, and other factors to predict future energy demand and optimize energy generation and distribution.
These are just a few examples of the vast range of real-world applications of machine learning. The versatility
and potential of machine learning continue to expand, with ongoing research and development pushing the
boundaries of what is possible in various industries and domains.
UNIT-II
Course Code: 20CS0535 R20
SUPERVISED LEARNING
Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labeled data. using unlabeled data.
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the
hidden patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it the hidden patterns and useful insights from
is given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine things
predict the correct output. by his experiences.
• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is
a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
Course Code: 20CS0535 R20
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step
-3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Course Code: 20CS0535 R20
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
1. LOGISTIC REGRESSION:
Logistic regression is kind of like linear regression, but is used when the dependent variable is not
a number but something else (e.g., a “yes/no” response). It’s called regression but performs
classification based on the regression and it classifies the dependent variable into either of the
classes.
Logistic regression is used for prediction of output which is binary, as stated above. For example,
if a credit card company builds a model to decide whether or not to issue a credit card to a
customer, it will model for whether the customer is going to “default” or “not default” on
their card.
Course Code: 20CS0535 R20
Linear Regression
Firstly, linear regression is performed on the relationship between variables to get the model. The
threshold for the classification line is assumed to be at 0.5.
It gives the log of the probability of the event occurring to the log of the probability of it not
occurring. In the end, it classifies the variable based on the higher probability of either class.
K-NN algorithm is one of the simplest classification algorithms and it is used to identify the data
points that are separated into several classes to predict the classification of a new sample point. K-
NN is a non-parametric, lazy learning algorithm. It classifies new cases based on a similarity
measure (i.e., distance functions).
Course Code: 20CS0535 R20
Course Code: 20CS0535 R20
K-NN works well with a small number of input variables (p), but struggles when the number of
inputs is very large.
Support vector is used for both regression and classification. It is based on the concept of decision
planes that define decision boundaries. A decision plane (hyperplane) is one that separates
between a set of objects having different class memberships.
It performs classification by finding the hyperplane that maximizes the margin between the two
classes with the help of support vectors.
The learning of the hyperplane in SVM is done by transforming the problem using some linear
algebra (i.e., the example above is a linear kernel which has a linear separability between each
variable).
For higher dimensional data, other kernels are used as points and cannot be classified easily. They
are specified in the next section.
Course Code: 20CS0535 R20
Kernel SVM
Kernel SVM takes in a kernel function in the SVM algorithm and transforms it into the required
form that maps data on a higher dimension which is separable.Types of kernel functions:
Kernel trick uses the kernel function to transform data into a higher dimensional feature space and
makes it possible to perform the linear separation for classification.
The RBF kernel SVM decision region is actually also a linear decision region. What RBF kernel
SVM actually does is create non-linear combinations of features to uplift the samples onto a
higher-dimensional feature space where a linear decision boundary can be used to separate
classes.
Course Code: 20CS0535 R20
So, the rule of thumb is: use linear SVMs for linear problems, and nonlinear kernels such as the
RBF kernel for non-linear problems.
4. NAIVE BAYES
The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions
between predictors (i.e., it assumes the presence of a feature in a class is unrelated to any other
feature). Even if these features depend on each other, or upon the existence of the other features,
all of these properties independently. Thus, the name naive Bayes.
Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial
(normal) distribution of data.
Course Code: 20CS0535 R20
P(yellow) = 10/17
P(green) = 7/17
P(?) = 4/17
3. Calculate Likelihood
P(data/class) = Number of similar observations to the class/Total no. of points in the class.
P(?/yellow) = 1/7
Course Code: 20CS0535 R20
P(?/green) = 3/10
12. Classification:
The higher probability, the class belongs to that category as from above 75% probability the point
belongs to class green.
Multinomial, Bernoulli naive Bayes are the other models used in calculating probabilities. Thus, a
naive Bayes model is easy to build, with no complicated iterative parameter estimation, which
makes it particularly useful for very large datasets.
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision tree
is incrementally developed. The final result is a tree with decision nodes and leaf nodes. It follows
Iterative Dichotomiser 3 (ID3) algorithm structure for determining the split.
Course Code: 20CS0535 R20
Entropy
Entropy is the degree or amount of uncertainty in the randomness of elements. In other words, it is
a measure of impurity.
Intuitively, it tells us about the predictability of a certain event. Entropy calculates the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero, and if the
sample is equally divided it has an entropy of one.
Information Gain
Information gain measures the relative change in entropy with respect to the independent attribute.
It tries to estimate the information contained by each attribute. Constructing a decision tree is all
about finding the attribute that returns the highest information gain (i.e., the most homogeneous
branches).
Where Gain(T, X) is the information gain by applying feature X. Entropy(T) is the entropy of the
entire set, while the second term calculates the entropy after applying the feature X.
Course Code: 20CS0535 R20
Information gain ranks attributes for filtering at a given node in the tree. The ranking is based on
the highest information gain entropy in each split.
The disadvantage of a decision tree model is overfitting, as it tries to fit the model by going deeper
in the training set and thereby reducing test accuracy.
Random forest classifier is an ensemble algorithm based on bagging i.e bootstrap aggregation.
Ensemble methods combines more than one algorithm of the same or different kind for classifying
objects (i.e., an ensemble of SVM, naive Bayes or decision trees, for example.)
Course Code: 20CS0535 R20
The general idea is that a combination of learning models increases the overall result selected.
Deep decision trees may suffer from overfitting, but random forests prevent overfitting by creating
trees on random subsets. The main reason is that it takes the average of all the predictions, which
cancels out the biases.
Random forest adds additional randomness to the model while growing the trees. Instead of
searching for the most important feature while splitting a node, it searches for the best feature
among a random subset of features. This results in a wide diversity that generally results in a better
model.
Regression techniques:
Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Logistic regression
o Polynomial Regression
o Stepwise Regression
1. Linear Regression
It is one of the most widely known modeling techniques and the most famous regression technique
in Machine Learning. Linear regression is usually among the first few topics which people pick
while learning predictive modeling. In this technique, the dependent variable is continuous, the
independent variable(s) can be continuous or discrete, and the nature of the regression line is
linear.
Linear Regression establishes a relationship between the dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as Regression line).
Course Code: 20CS0535 R20
It is represented by an equation Y=a+b*X + e, where a is the intercept, b is the slope of the line
and e is error term. This equation can be used to predict the value of the target variable based on
the given predictor variable(s).
The difference between simple linear regression and multiple linear regression, multiple linear
regression has (>1) independent variables, whereas simple linear regression has only 1
independent variable. Now, the question is "How do we obtain best-fit line?".
How to obtain the best fit line (Value of a and b)?
This task can be easily accomplished by Least Square Method. It is the most common method used
for fitting a regression line. It calculates the best-fit line for the observed data by minimizing the
sum of the squares of the vertical deviations from each data point to the line. Because the
deviations are first squared, when added, there is no canceling out between positive and negative
values.
2. Logistic Regression
Logistic regression in Machine Learning is used to find the probability of event=Success and
event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1,
True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can be represented
by the following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Above, p is the probability of the presence of the characteristic of interest. A question that you
should ask here is "why have we used to log in the equation?".
Since we are working here with a binomial distribution (dependent variable), we need to choose a
link function which is best suited for this distribution. And, it is a logit function. In the equation
above, the parameters are chosen to maximize the likelihood of observing the sample values rather
than minimizing the sum of squared errors (like in ordinary regression).
Course Code: 20CS0535 R20
Important Points:
3. Polynomial Regression
A regression equation in Machine Learning is a polynomial regression equation if the power of the
independent variable is more than 1. The equation below represents a polynomial equation:
y=a+b*x^2
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into
the data points.
Important Points:
• While there might be a temptation to fit a higher degree polynomial to get a lower error,
this can result in over-fitting. Always plot the relationships to see the fit and focus on
making sure that the curve fits the nature of the problem. Here is an example of how
plotting can help:
• Especially look out for curve towards the ends and see whether those shapes and trends
make sense. Higher polynomials can end up producing weird results on extrapolation.
Course Code: 20CS0535 R20
4. Stepwise Regression
This form of regression is used when we deal with multiple independent variables. In this
regression technique in Machine Learning, the selection of independent variables is done with the
help of an automatic process, which involves no human intervention.
This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to discern
significant variables. Stepwise regression fits the regression model by adding/dropping covariates
one at a time based on a specified criterion. Some of the most commonly used Stepwise regression
methods are listed below:
• Standard stepwise regression does two things. It adds and removes predictors as needed for
each step.
• Forward selection starts with the most significant predictor in the model and adds variable
for each step.
• Backward elimination starts with all predictors in the model and removes the least
significant variable for each step.
This modeling technique aims to maximize the prediction power with a minimum number of predictor
variables. It is one of the methods to handle higher dimensionality of data set.
a Compare Univariate and Multivariate Decision Trees. [L5][CO1] [6M]
3
Univariate Tree: A univariate tree, also known as a decision tree, is a predictive model that uses a
tree-like structure to make predictions or decisions based on a single input variable (feature). It is a
supervised learning algorithm commonly used for both classification and regression tasks. In a
univariate tree, the tree structure is built by recursively partitioning the data based on the values of
a single feature at each internal node.
The decision tree starts with the entire dataset at the root node and selects the best feature to split
the data based on certain criteria (e.g., information gain or Gini index). The data is then divided into
subsets based on the feature value, and the process is repeated recursively for each subset until a
stopping condition is met, such as reaching a maximum tree depth or having a minimum number of
samples at a node. The leaf nodes of the tree contain the predicted outcomes or values.
Multivariate Tree: A multivariate tree, also known as a random forest or ensemble tree, is an
extension of the univariate tree that uses multiple input variables (features) to make predictions. It
combines the predictions of multiple univariate trees to improve the overall accuracy and robustness
of the model. A multivariate tree is typically used for classification and regression tasks.
Instead of using a single feature at each internal node, a multivariate tree randomly selects a subset
of features and builds univariate trees using these selected features. The number of features sampled
at each node and the number of trees in the forest are hyperparameters that can be adjusted. During
prediction, each tree in the forest independently makes a prediction, and the final prediction is
determined by aggregating the individual tree predictions, such as taking a majority vote for
classification tasks or averaging for regression tasks.
Course Code: 20CS0535 R20
The use of multiple features and trees in a multivariate tree helps to capture more complex relationships and
reduces the risk of overfitting. It can handle high-dimensional datasets and provide better generalization
performance compared to a single univariate tree.
Pruning involves the process of removing branches or nodes from a decision tree to simplify its
structure and make it more general. This is typically done by setting certain conditions or criteria
that determine when and how to prune the tree. There are two main types of pruning techniques:
1. Pre-Pruning (Early Stopping): Pre-pruning involves stopping the growth of the tree before it
becomes fully expanded. This is usually done by setting stopping criteria based on various
measures such as maximum tree depth, minimum number of samples required at a node, minimum
Course Code: 20CS0535 R20
improvement in impurity measures (e.g., information gain or Gini index), or other statistical
significance tests. If a node does not meet these criteria, it is considered a leaf node and no further
splitting is performed.
2. Post-Pruning (Cost Complexity Pruning): Post-pruning involves growing the tree to its full size
and then selectively removing branches or nodes based on their estimated predictive ability. This is
done by assigning a cost or penalty to each node based on measures like impurity or error rate. A
complexity parameter, such as the cost complexity parameter or pruning parameter, is used to
control the trade-off between simplicity and accuracy. By iteratively removing nodes with the
highest cost, the tree is pruned to a more optimal size that balances complexity and performance.
The goal of pruning is to find the right balance between complexity and generalization. By
reducing the complexity of the decision tree, pruning helps to avoid overfitting and improves the
model's ability to generalize well to unseen data. Pruning is an essential step in decision tree
construction, especially when dealing with complex datasets or when the decision tree grows too
large.
Parametric Methods uses a fixed number of Non-Parametric Methods use the flexible
parameters to build the model. number of parameters to build the model.
Parametric Methods require lesser data than Non-Parametric Methods requires much more
Non-Parametric Methods. data than Parametric Methods.
Parametric data handles – Intervals data or But non-parametric methods handle original
ratio data. data.
Course Code: 20CS0535 R20
Here when we use parametric methods then When we use non-parametric methods then the
the result or outputs generated can be easily result or outputs generated cannot be seriously
affected by outliers. affected by outliers.
Parametric methods have more statistical Non-parametric methods have less statistical
power than Non-Parametric methods. power than Parametric methods.
Probabilistic Modeling:
Bayesian decision theory starts with the assumption that the underlying data distribution and the
relationships between inputs and outputs are probabilistic in nature.
It involves modeling the joint probability distribution of the input features (X) and the corresponding output
labels (Y) using techniques such as Bayesian networks, Gaussian processes, or probabilistic graphical
models.
Prior Knowledge:
Bayesian decision theory incorporates prior knowledge or beliefs about the data before observing any new
instances. This prior information is usually expressed through prior probability distributions or prior
assumptions about the parameters of the model.
Likelihood Estimation:
Given the observed data, Bayesian decision theory aims to estimate the likelihood of different classes or
labels given the input features.
The likelihood is computed based on the probabilistic model and the observed data, using techniques such
as maximum likelihood estimation or Bayesian inference.
Bayesian Inference:
Bayesian decision theory leverages Bayesian inference to update the prior knowledge based on the
observed data and compute posterior probabilities.
The posterior probabilities represent the updated belief about the class labels given the observed data and
are computed using Bayes' theorem.
Decision Rule:
Once the posterior probabilities are obtained, a decision rule is applied to make predictions or decisions.
The decision rule can be based on maximizing the posterior probability (maximum a posteriori estimation),
or it can consider various loss functions or utility functions to minimize the expected loss or maximize
expected utility.
Decision Boundary:
Bayesian decision theory provides a framework for defining decision boundaries that separate different
classes based on the posterior probabilities.
The decision boundaries can be determined by setting thresholds on the posterior probabilities or by
considering the costs associated with different misclassifications.
Course Code: 20CS0535 R20
Optimal Decision-Making:
Bayesian decision theory aims to make decisions that minimize the expected loss or maximize the expected
utility, considering the posterior probabilities and the decision rule.
This allows for optimal decision-making under uncertainty, taking into account both the prior knowledge
and the observed data.
Bayesian decision theory provides a coherent and principled approach to supervised learning by
incorporating probabilistic modeling, Bayesian inference, and decision theory. It allows for the integration
of prior knowledge and uncertainty, leading to robust and optimal decision-making in various domains.
Bayesian Decision Theory is a fundamental statistical approach to the problem of pattern classification. It is
considered as the ideal pattern classifier and often used as the benchmark for other algorithms because its
decision rule automatically minimizes its loss function. It might not make much sense right now, so hold on,
It makes the assumption that the decision problem is posed in probabilistic terms, and that all the relevant
*P(ω|x) ≡ called the posterior, it is the probability of the predicted class to be ω for a given entry of
feature (x). Analogous to P(O|θ), because the class is the desired outcome to be predicted according to
the data distribution (model). Capital 'P' because ω is a discrete random variable.
* p(x|ω) ≡ class-conditional probability density function for the feature. We call it likelihood of ω
with respect to x, a term chosen to indicate that, other things being equal, the category (or class) for
which it is large is more "likely" to be the true category. It is a function of parameters within the
parameteric space that describes the probability of obtaining the observed data (x). Small 'P' because x is
a continous random variable. We usually assume it to be following Gaussian Distribution.
* P(ω) ≡ a priori probability (or simply prior) of class ω. It is usually pre-determined and depends on
the external factors. It means how probable the occurence of class ω out of all the classes.
* p(x) ≡ called the evidence, it is merely a scaling factor that guarantees that the posterior probabilities
sum to one. p(x) = sum(p(x|ω)*P(ω)) over all the classes.
So finally we get the following equation to frame our decision rule
Course Code: 20CS0535 R20
Bayes’ Formula for Classification
Decision Rule
The above equation is the governing formula for our decision theory. The rule is as follows:
For each sample input, it calculates its posterior and assign it to the class corresponding to the maximum value
Linear Regression:
o Linear Regression is one of the most simple Machine learning algorithm that comes under
Supervised Learning technique and used for solving regression problems.
o It is used for predicting the continuous dependent variable with the help of independent
variables.
o The goal of the Linear regression is to find the best fit line that can accurately predict the
output for the continuous dependent variable.
o If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such regression is
called as Multiple Linear Regression.
o By finding the best fit line, algorithm establish the relationship between dependent variable
and independent variable. And the relationship should be of linear nature.
o The output for Linear regression should only be the continuous values such as price, age,
salary, etc. The relationship between the dependent variable and independent variable can be
shown in below image:
In above image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:
y= a0+a1x+ ε
Course Code: 20CS0535 R20
Where, a0 and a1 are the coefficients and ε is the error term.
Logistic Regression:
o Logistic regression is one of the most popular Machine learning algorithm that comes under
Supervised Learning techniques.
o It can be used for Classification as well as for Regression problems, but mainly used for
Classification problems.
o Logistic regression is used to predict the categorical dependent variable with the help of
independent variables.
o The output of Logistic Regression problem can be only between the 0 and 1.
o Logistic regression can be used where the probabilities between two classes is required. Such
as whether it will rain today or not, either 0 or 1, true or false etc.
o Logistic regression is based on the concept of Maximum Likelihood estimation. According
to this estimation, the observed data should be most probable.
o In logistic regression, we pass the weighted sum of inputs through an activation function that
can map values in between 0 and 1. Such activation function is known as sigmoid
function and the curve obtained is called as sigmoid curve or S-curve. Consider the below
image:
Sigmoid Function
Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1.
i.e predicted y.
Sigmoid function
As shown above, the figure sigmoid function converts the continuous variable data into
the probability i.e. between 0 and 1.
Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but
has a greater number of hidden layers. single-layer neural network with four main parameters, i.e., input
values, weights and Bias, net sum, and an activation function.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate
on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the
output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having various
layers in which activation function does not remain linear, similar to a single layer perceptron model. Instead
of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.
In the multi-layer perceptron diagram above, we can see that there are three inputs and thus three input
nodes and the hidden layer has three nodes. The output layer gives two outputs, therefore there are two
output nodes. The nodes in the input layer take input and forward it for further process, in the diagram
Course Code: 20CS0535 R20
above the nodes in the input layer forwards their output to each of the three nodes in the hidden layer,
and in the same way, the hidden layer processes the information and passes it to the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.
Backpropagation is a supervised learning algorithm, for training Multi-layer Perceptrons (Artificial Neural
Networks).
Whatever you study in soft computing you can write the same answer aslo.
https://drive.google.com/file/d/1iWpQCYLJisBe8IXeWUmBVS2ehOKEUejW/view?usp=s
haring
UNIT –III
UNSUPERVISED LEARNING
[L2][CO2] [6M]
a Explain the various Clustering algorithms.
2
Clustering algorithms are a type of unsupervised machine learning technique used to group similar data
points together based on their inherent characteristics or similarities.
Types of Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-
defined groups. The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.
Density-based Clustering.: The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed as long as the dense region can be connected.
his algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and
high dimensions.
Course Code: 20CS0535 R20
Distribution-based Clustering: In the distribution model-based clustering method, the data is divided
based on the probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
• Hierarchical Clustering: Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of clusters to be created.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership
to be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes
also known as the Fuzzy k-means algorithm.
These algorithms analyze the patterns and structures within the data to identify groups or clusters that
share similar properties. Here are some popular clustering algorithms:
K-means: K-means is one of the most widely used clustering algorithms. It aims to partition data into
K distinct clusters based on the mean value of the data points. The algorithm iteratively assigns data
points to the nearest cluster centroid and updates the centroids until convergence.
Course Code: 20CS0535 R20
Hierarchical Clustering: Hierarchical clustering builds a hierarchy of clusters, either bottom-up
(agglomerative) or top-down (divisive). The algorithm starts with each data point as a separate cluster
and then merges or splits clusters based on their similarities until a desired number of clusters is
obtained.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN groups data
points based on their density. It defines clusters as areas of high-density separated by areas of low-
density. It can discover clusters of arbitrary shape and is robust to noise and outliers.
Mean Shift: Mean Shift iteratively shifts the centroids of clusters towards the densest regions of data
points. It starts with an initial set of centroids and updates them based on the mean shift of data points
within a certain radius until convergence. It is effective in identifying clusters with irregular shapes and
varying densities.
Gaussian Mixture Models (GMM): GMM assumes that the data points are generated from a mixture
of Gaussian distributions. It models the data as a collection of Gaussian components, each representing
a cluster. The algorithm estimates the parameters of the Gaussian distributions to identify the clusters.
Spectral Clustering: Spectral clustering uses the eigenvalues and eigenvectors of a similarity matrix
to perform dimensionality reduction and then applies a clustering algorithm (e.g., K-means) on the
reduced representation. It is particularly effective in identifying non-linear and complex structures.
Agglomerative Clustering: Agglomerative clustering is a bottom-up approach where each data point
starts as a separate cluster, and then clusters are successively merged based on their similarity until a
stopping criterion is met. It forms a hierarchical cluster tree or dendrogram.
These are just a few examples of clustering algorithms, and there are many other variations and
specialized algorithms available depending on the specific requirements and characteristics of the data.
It's important to choose the appropriate clustering algorithm based on the nature of the data and the
desired outcome.
[L1][CO6] [6M]
B List out the various applications of clustering.
Clustering algorithms have various applications across different domains. Here are some common
applications of clustering:
These are just a few examples of the wide range of applications where clustering algorithms can be
employed. The suitability of clustering depends on the specific problem and the nature of the data
being analyzed.
In machine learning, mixture models are a class of latent variable models that are used to represent
complex distributions by combining simpler component distributions. Latent variable models involve
unobserved variables (latent variables) that are used to capture hidden patterns or structure in the data.
Let's consider an example of a mixture of Gaussian distributions, which is one of the most commonly
used types of mixture models. In this case, the observed data is assumed to come from a combination
of several Gaussian distributions.
Model Representation:
Latent Variables: We introduce a set of latent variables, often called "mixture indicators" or "cluster
assignments," denoted as z. Each latent variable z corresponds to a specific component of the mixture.
Parameters: We have a set of parameters for the mixture model, including the mixing proportions π
and the parameters (mean and covariance) of each Gaussian component.
Data Generation:
Sample Cluster: For each data point, we first sample a latent variable z from a categorical
distribution according to the mixing proportions π. This determines the component from which the
data point will be generated.
Generate Data: Given the selected component, we sample the data point x from the corresponding
Gaussian distribution.
Model Inference:
Given observed data points x, the goal is to infer the latent variables z and the model parameters.
Inference can be done using various techniques such as Expectation-Maximization (EM) algorithm,
variational inference, or Markov chain Monte Carlo (MCMC) methods.
Model Learning:
The model parameters, including the mixing proportions π and the Gaussian parameters, are learned
from the observed data using the chosen inference algorithm.
The learning process involves iteratively updating the model parameters until convergence,
maximizing the likelihood or posterior probability of the observed data.
Model Utilization:
Once the model is learned, it can be used for various tasks such as clustering, density estimation,
anomaly detection, or generating new data points from the learned distribution.
Mixture models are powerful tools in machine learning as they can capture complex data distributions
by combining simpler components. They are widely used in various domains, including image
analysis, natural language processing, recommendation systems, and many more.
Course Code: 20CS0535 R20
b How mixture density is calculated in unsupervised learning? [L1][CO2] [6M]
In unsupervised learning, the calculation of the mixture density involves
estimating the parameters of a mixture model from the observed data. The
mixture density represents the probability density function (PDF) of the observed
data, which is a combination of multiple component densities.
It's important to note that the specific algorithms and techniques used for the
estimation and calculation of the mixture density may vary depending on the
chosen mixture model and the inference method employed (e.g., EM algorithm,
variational inference, etc.).
Analyze the working principle of K-means Clustering.
a
[L4][CO2] [7M]
4
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
Course Code: 20CS0535 R20
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
b Give the different types of Partitional algorithms used in clustering. [L2][CO2] [5M]
Partitional clustering algorithms are a class of clustering algorithms that partition the dataset into non-
overlapping clusters. Here are some commonly used types of partitional clustering algorithms:
1. K-means: K-means is a widely used partitional clustering algorithm. It aims to partition the data into
K clusters, where K is pre-specified by the user. The algorithm iteratively assigns data points to the
nearest cluster centroid and updates the centroids until convergence.
2. K-medoids: K-medoids is a variation of K-means that uses actual data points, known as medoids, as
cluster centers. It is robust to outliers compared to K-means, as medoids can be any data point in the
cluster rather than the mean of the cluster.
3. Fuzzy C-means: Fuzzy C-means extends K-means by allowing data points to belong to multiple
clusters with different degrees of membership. It assigns membership weights to data points indicating
their degree of belongingness to each cluster. This algorithm is useful when data points exhibit partial
membership to different clusters.
4. Partitioning Around Medoids (PAM): PAM is a partitional clustering algorithm that, similar to K-
medoids, uses medoids as cluster centers. It differs from K-medoids in the way it selects initial
medoids and updates them during the iterative process. PAM aims to minimize the total dissimilarity
between data points and their closest medoid.
5. CLARA (Clustering Large Applications): CLARA is an algorithm that extends PAM to handle
large datasets. It samples subsets of the data and applies PAM to each subset, providing an
approximate clustering solution. The final clustering is obtained by merging the results of multiple
runs.
6. CLARANS (Clustering Large Applications based on RANdomized Search): CLARANS is
another partitional clustering algorithm suitable for large datasets. It randomly explores the search
space to find the best medoids and avoid exhaustive search. It offers a trade-off between efficiency
and accuracy.
7. X-means: X-means is an extension of K-means that automatically determines the optimal number of
clusters. It starts with a single cluster and recursively splits clusters based on a statistical criterion
until the optimal number of clusters is found.
8. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): BIRCH is a partitional
clustering algorithm that constructs a tree-like structure called the Clustering Feature Tree (CFT) to
perform clustering. It performs hierarchical clustering on the CFT, resulting in a set of subclusters that
represent the final clustering solution.
a List out the various types of Cluster methods in unsupervised learning. [L1][CO6] [6M]
5
Types of Clustering
Several approaches to clustering exist. For an exhaustive list, see A Comprehensive Survey of
Clustering Algorithms Xu, D. & Tian, Y. Ann. Data. Sci. (2015) 2: 165. Each approach is best suited
to a particular data distribution. Below is a short discussion of four common approaches, focusing on
centroid-based clustering using k-means.
Centroid-based Clustering
Density-based Clustering
Density-based clustering connects areas of high example density into clusters. This allows for
arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have
difficulty with data of varying densities and high dimensions. Further, by design, these algorithms do
not assign outliers to clusters.
This clustering approach assumes data is composed of distributions, such as Gaussian distributions.
In Figure 3, the distribution-based algorithm clusters data into three Gaussian distributions. As
distance from the distribution's center increases, the probability that a point belongs to the distribution
decreases. The bands show that decrease in probability. When you do not know the type of
distribution in your data, you should use a different algorithm.
Hierarchical Clustering
Hierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well
suited to hierarchical data, such as taxonomies. See Comparison of 61 Sequenced Escherichia coli
Genomes by Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example. In addition,
another advantage is that any number of clusters can be chosen by cutting the tree at the right level.
Similarities:
1. Unsupervised Learning: Both average-link clustering and k-means are unsupervised learning
algorithms, meaning they do not require labeled data for training. They discover patterns and
groupings in the data without prior knowledge of the class labels.
2. Iterative Process: Both algorithms use an iterative process to refine their cluster assignments. They
repeatedly update the cluster centroids or merge clusters until convergence or a stopping criterion is
met.
Differences:
In traditional k-means clustering, each data point is assigned to the cluster with the nearest centroid,
where the centroid is the mean vector of the data points in that cluster. The algorithm aims to
minimize the sum of squared distances between the data points and their assigned centroids. However,
this approach assumes that the clusters are spherical and that the data features are continuous and
normally distributed.
Generalized k-means clustering relaxes these assumptions and offers more flexibility. Here are a few
key elements that can be customized in generalized k-means clustering:
1. Distance metrics: Instead of relying solely on the Euclidean distance, generalized k-means allows for
the use of other distance metrics that are more suitable for specific data types. For example, for
categorical data, Hamming distance or Jaccard distance can be used.
2. Cluster shape: Traditional k-means assumes that clusters are spherical and have equal variance.
Generalized k-means allows for different cluster shapes, such as elliptical or arbitrary-shaped clusters.
This is achieved by using a covariance matrix for each cluster and considering the Mahalanobis
distance to measure the dissimilarity between data points and cluster centroids.
3. Weighting: Generalized k-means allows for assigning different weights to different dimensions or
features of the data. By assigning appropriate weights, certain dimensions can be emphasized or de-
emphasized in the clustering process.
4. Constraints: Generalized k-means can incorporate additional constraints into the clustering process.
For example, constraints can be applied to enforce that certain data points must belong to specific
clusters or that clusters must have a minimum number of data points.
Overall, generalized k-means clustering offers more flexibility and adaptability to different data types
and clustering scenarios. By customizing the distance metric, cluster shape, weighting, and
constraints, it becomes possible to better model and analyze complex data sets in a way that suits the
specific requirements of the problem at hand.
b Estimate the problems associated with clustering large data. [L5][CO6] [6M]
Clustering large data sets can pose several challenges and problems. Here are some common issues
associated with clustering large data:
1. Scalability: As the data size increases, clustering algorithms may struggle to handle the computational
and memory requirements. The time complexity of clustering algorithms can be quite high, and the
Course Code: 20CS0535 R20
computational cost grows exponentially with the number of data points. Efficient algorithms and
distributed computing techniques are required to tackle scalability issues.
2. High Dimensionality: Large data sets often have a high number of dimensions or features, which can
lead to the curse of dimensionality. In high-dimensional spaces, the distance between points becomes
less meaningful, and the clustering algorithms may struggle to find meaningful clusters.
Dimensionality reduction techniques or feature selection methods can be employed to mitigate this
problem.
3. Computational Complexity: Many clustering algorithms have computational complexities that are
quadratic or higher, such as hierarchical clustering algorithms or k-means clustering. With large data
sets, these algorithms can become prohibitively slow or impractical to execute. Approximation
techniques, parallelization, or sampling methods may be used to address this challenge.
4. Noise and Outliers: Large data sets often contain noise, outliers, or irrelevant data points. These
outliers can have a significant impact on clustering results, as they may form their own clusters or
disrupt the clustering of other data points. Preprocessing steps, such as outlier detection and data
cleaning, are important to handle noisy data effectively.
5. Cluster Interpretability: Interpreting and understanding clusters in large data sets can be
challenging. Visualizing high-dimensional data becomes more difficult, and it may be hard to discern
meaningful patterns or extract insights from the clustering results. Advanced visualization techniques
and dimensionality reduction methods can help in improving interpretability.
6. Cluster Validity and Evaluation: Assessing the quality and validity of clustering results becomes
more complex with large data sets. Traditional clustering evaluation metrics may not be suitable, and
it may be difficult to define ground truth or expert-labeled clusters for comparison. Developing
appropriate evaluation measures for large-scale clustering is an ongoing research area.
7. Storage and Memory Constraints: Large data sets require significant storage space and memory to
process and store intermediate results during clustering. Managing storage and memory constraints
can be challenging, particularly when dealing with distributed computing or limited resources.
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure
is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both
differ depending on how they work. As there is no requirement to predetermine the number of clusters
as we did in the K-Means algorithm.
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets
into clusters, it follows the bottom-up approach. It means, this algorithm considers each dataset as a
single cluster at the beginning, and then start combining the closest pair of clusters together. It does this
until all the clusters are merged into a single cluster that contains all the datasets.
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number
of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there
will now be N-1 clusters.
Course Code: 20CS0535 R20
o Step-3: Again, take the two closest clusters and merge them together to form one cluster. There
will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider
the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC
algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data
points, and the x-axis shows all the data points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative clustering, and
the right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a
cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular
shape. The hight is decided according to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater than
the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and
P4, P5, and P6, in another dendrogram.
Course Code: 20CS0535 R20
o At last, the final dendrogram is created that combines all the data points together.
The closest distance between the two clusters is crucial for the hierarchical clustering. There are various
ways to calculate the distance between two clusters, and these ways decide the rule for clustering. These
measures are called Linkage methods. Some of the popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:
Single linkage, also known as the nearest-neighbor linkage, measures the distance between two
clusters as the shortest distance between any two points in the two clusters.
Let's say we have the following data points and their pairwise distances:
• A: (1, 1)
• B: (2, 2)
• C: (4, 4)
• D: (6, 6)
Course Code: 20CS0535 R20
Initially, each data point is considered as a separate cluster.
The dendrogram representation of the clustering process would show the steps of merging clusters
based on single linkage.
2. Complete Linkage: It is the farthest distance between the two points of two different clusters.
It is one of the popular linkage methods as it forms tighter clusters than single-linkage.
Complete linkage, also known as the farthest-neighbor linkage, measures the distance between two
clusters as the maximum distance between any two points in the two clusters.
The dendrogram representation of the clustering process using complete linkage would show the steps
of merging clusters based on the maximum distance.
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets
is added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.
Average linkage measures the distance between two clusters as the average distance between all pairs
of points from the two clusters.
The dendrogram representation of the clustering process using average linkage would show the steps
of merging clusters based on the average distance.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the type of problem or
business requirement.
These are just examples to demonstrate the basic concepts of single linkage, complete linkage, and
average linkage in hierarchical clustering. In practice, various other linkage methods and distance
metrics can be used based on the specific requirements of the data and the clustering task.
The closest distance between the two clusters is crucial for the hierarchical clustering. There
are various ways to calculate the distance between two clusters, and these ways decide the
rule for clustering. These measures are called Linkage methods. Some of the popular linkage
methods are given below:
5. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:
6. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
Course Code: 20CS0535 R20
linkage.
7. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage
methods.
8. Centroid Linkage: It is the linkage method in which the distance between the centroid
of the clusters is calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the type of problem or
business requirement.
Here are several ways to measure the distance between two clusters in cluster analysis. Here are a few
commonly used distance measures:
Euclidean Distance: It is the straight-line distance between two points in the Euclidean space. In the
context of clustering, the distance between two clusters is computed as the Euclidean distance
between their centroid points. The centroid of a cluster is the mean of the feature values of all the
points in that cluster.
Manhattan Distance: Also known as the city block distance or L1 norm, it is the sum of the absolute
differences between the coordinates of two points. In clustering, the Manhattan distance between two
clusters can be calculated as the average Manhattan distance between all pairs of points from the two
clusters.
Minkowski Distance: It is a generalization of Euclidean and Manhattan distances. The Minkowski
distance between two points is defined as the nth root of the sum of the absolute values raised to the
power of n of the differences of their coordinates. When n=1, it reduces to Manhattan distance, and
when n=2, it reduces to Euclidean distance.
Mahalanobis Distance: It takes into account the covariance structure of the data and is used when the
data has correlated features. The Mahalanobis distance between two clusters is calculated based on the
Mahalanobis distance between their centroid points, which considers the covariance matrix of the
data.
Linkage-based Distances: In hierarchical clustering, distances between clusters can be measured
using different linkage methods, such as single linkage, complete linkage, and average linkage. These
methods define the distance between two clusters based on the distances between their individual
points.
Course Code: 20CS0535 R20
The choice of distance measure depends on the nature of the data and the specific requirements of the
clustering problem. Different distance measures may yield different cluster structures and
interpretations, so it is important to consider the characteristics of your data and the goals of your
analysis when selecting a distance measure.
UNIT-IV
NON PARAMETRIC METHODS
&
DIMENTIONALITY REDUCTION
Algorithms that do not make strong assumptions about the form of the
mapping function are called nonparametric machine learning
algorithms. By not making assumptions, they are free to learn any
functional form from the training data.
Nonparametric methods are good when you have a lot of data and no
prior knowledge, and when you don’t want to worry too much about
choosing just the right features
.
Nonparametric methods seek to best fit the training data in
constructing the mapping function, whilst maintaining some ability to
generalize to unseen data. As such, they are able to fit a large number
of functional forms.
Some more examples of popular nonparametric machine learning
algorithms are:
• k-Nearest Neighbors
• Decision Trees like CART and C4.5
• Support Vector Machines
Benefits of Nonparametric Machine Learning Algorithms:
Flexibility: Capable of fitting a large number of functional forms.
Power: No assumptions (or weak assumptions) about the underlying
function.
Performance: Can result in higher performance models for
prediction.
Limitations of Nonparametric Machine Learning Algorithms:
More data: Require a lot more training data to estimate the mapping
function.
Slower: A lot slower to train as they often have far more parameters
to train.
Overfitting: More of a risk to overfit the training data and it
is harder to explain why specific predictions are made.
b List out advantages and limitations of non-parametric methods [L2][CO3] [8M]
in ML
Histogram estimator
The density of a sample is dependent on the number of training
samples present in that bin. In constructing the histogram of densities
we choose the origin and the bin width, the position of origin affects
the estimation near the boundaries.
Course Code: 20CS0535 R20
Naive Estimator
Unlike the Histogram estimator, the Naive estimator does not use the
concept of origin. There is no assumption of choosing the origin. The
density of the sample depends on the neighboring training samples.
Given the training set X = {xt}Nt=1 and the bin width h, the Naive
density estimator function is:
•
• The values in the range of h/2 to the left and right of the sample
involve the density contribution.
•
Kernel Density Estimator (KDE)
Kernel estimator is used to smoothen the probability distribution
function (pdf) and cumulative distribution function (CDF) graphics.
The kernel is nothing but a weight. Gaussian Kernel is the most
popular kernel:
•
The kernel estimator is also called Parzen Window:
Course Code: 20CS0535 R20
•
• As you can observe, as |x – xt| increases that means, the training
sample is far away from the given sample, and the kernel value
decreases. Hence we can say that the contribution of a farther sample
is less when compared to the nearest training samples. There are
many more kernels: Gaussian, Rectangular, Triangular, Biweight,
Uniform, Cosine, etc.
•
K – Nearest Neighbor Estimator (KNN Estimator)
Unlike the previous methods of fixing the bin width h, in this
estimation, we fix the value of nearest neighbors k. The density of a
sample depends on the value of k and the distance of the kth nearest
neighbor from the sample. This is close enough to the Kernel
estimation method. The K-NN density estimation is, where dk(x) is
the Euclidean distance from the sample to its kth nearest neighbor.
4
b [L5][C04] 6M
Differentiate Exploratory and Confirmatory factor analysis.
There is no assumed
Parametric methods assumed
distribution in non-parametric
to be a normal distribution.
methods.
Examples: Logistic
Examples: KNN, Decision Tree
Regression, Naïve Bayes
Model, etc.
Model, etc.
This type of Factor Analysis is used when your variables are structured
in changeable groups. For example, you may have a teenager’s health
questionnaire with several points like sleeping patterns, wrong
addictions, psychological health, mobile phone addiction, or learning
disabilities.
The Multiple Factor Analysis is performed in two steps which are:-
• Firstly, the Principal Component Analysis will perform on
each and every section of the data. Further, this can give a
useful eigenvalue, which is actually used to normalize the data
sets for further use.
• The newly formed data sets are going to merge into a
distinctive matrix and then global PCA is performed.
Generalized Procrustes Analysis (GPA) :
The Procrustes analysis is actually a suggested way to compare
then the two approximate sets of configurations and shapes, which
were originally developed to equivalent to the two solutions from
Factor Analysis, this technique was actually used to extend the GP
Analysis so that more than two shapes could be compared in many
ways. The shapes are properly aligned to achieve the target shape.
Mainly GPA (Generalized Procrustes Analysis) uses geometric
transformations.
Geometric progressions are :
• Isotropic rescaling,
• Reflection,
• Rotation,
• Translation of matrices to compare the sets of data.
Eigenvalues
When factor analysis going to generate the factors, each and every
factor has ab associated eigenvalue which will give the total variance
explained by each factor.
Usually, the factors having eigenvalues greater than 1 are useful :
Percentage of variation explained by
F1 = Eigenvalue of Factor 1/No. of Variables
Here –
F1 – Factor 1
F2 – Factor 2
F3 – Factor 3
The factors that affect the question the most (and therefore have the
highest factor loadings) are bolded. Factor loadings are similar to
correlation coefficients in that they can vary from -1 to 1. The closer
factors are to -1 or 1, the more they affect the variable.
List out and explain the various dimensionality reduction techniques [12M]
[L2][CO3]
7
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.
There are two ways to apply the dimension reduction technique, which
are given below:
Feature Selection
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only
the relevant features is taken. Some common techniques of filters
method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes
a machine learning model for its evaluation. In this method, some
features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to
increase the accuracy of the model. This method is more accurate than
the filtering method but complex to work. Some common techniques
of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing
many dimensions into space with fewer dimensions. This approach is
useful when we want to keep the whole information but use fewer
resources while processing the information.
Auto-encoders
Applications of LDA
Some of the common real-world applications of Linear discriminant
Analysis are given below:
Face Recognition
Face recognition is the popular application of computer vision, where
each face is represented as the combination of a number of pixel
values. In this case, LDA is used to minimize the number of features
to a manageable number before going through the classification
process. It generates a new template in which each dimension consists
of a linear combination of pixel values. If a linear combination is
generated using Fisher's linear discriminant, then it is called Fisher's
face.
Medical
In the medical field, LDA has a great application in classifying the
patient disease on the basis of various parameters of patient health and
the medical treatment which is going on. On such parameters, it
classifies disease as mild, moderate, or severe. This classification
helps the doctors in either increasing or decreasing the pace of the
treatment.
Customer Identification
In customer identification, LDA is currently being applied. It means with
the help of LDA; we can easily identify and select the features that can
specify the group of customers who are likely to purchase a specific
product in a shopping mall. This can be helpful when we want to identify a
group of customers who mostly purchase a product in a shopping mall.
For Predictions
LDA can also be used for making predictions and so in decision making.
For example, "will you buy this product” will give a predicted result of
either one or two possible classes as a buying or not.
In Learning
Nowadays, robots are being trained for learning and talking to simulate
human work, and it can also be considered a classification problem. In this
case, LDA builds similar groups on the basis of different parameters,
including pitches, frequencies, sound, tunes, etc.
Compare Multidimensionality scaling and Metric dimensionality scaling.
a [L5][CO5] [6M]
9
Multidimensional scaling (MDS) and Metric multidimensional scaling
(MMDS) are both techniques used in data analysis to visualize and analyze
the relationships between objects or entities based on their similarities or
dissimilarities. However, there are some key differences between these two
methods.
Conceptual Difference:
Course Code: 20CS0535 R20
MDS: Multidimensional scaling is a general term that refers to a
family of methods aimed at representing the structure of similarity
or dissimilarity data in a lower-dimensional space. MDS attempts to
preserve the original distances or dissimilarities between objects in
the data.
MMDS: Metric multidimensional scaling is a specific form of MDS
that assumes the underlying distances or dissimilarities between
objects are metric (i.e., satisfy the triangle inequality). It aims to find
a low-dimensional representation that not only preserves the ordinal
relationships between objects but also satisfies the triangle
inequality.
Mathematical Difference:
MDS: MDS techniques, such as classical MDS or non-metric MDS,
focus on finding a configuration of points in a lower-dimensional
space that best approximates the pairwise dissimilarities between
objects. It uses optimization algorithms to minimize the discrepancy
between observed dissimilarities and distances in the reduced space.
MMDS: MMDS, on the other hand, specifically deals with metric
dissimilarities. It constructs a Euclidean distance matrix based on the
dissimilarities and then applies classical MDS to obtain a low-
dimensional representation that respects the metric properties of the
data.
Data Requirements:
MDS: MDS can handle various types of dissimilarity measures,
including ordinal, interval, or even non-metric dissimilarities. It is
more flexible in terms of data requirements and can be applied to
both metric and non-metric data.
MMDS: MMDS assumes that the dissimilarity measures are metric,
meaning they obey the triangle inequality. This assumption restricts
its applicability to situations where the data can be represented by a
metric space.
Preserved Relationships:
MDS: In MDS, the goal is to preserve the original pairwise
dissimilarities or similarities as closely as possible in the lower-
dimensional space. The emphasis is on preserving the ordinal
relationships between objects.
MMDS: MMDS aims to preserve the metric relationships between
objects, in addition to the ordinal relationships. It ensures that the
distances between objects in the reduced space conform to the triangle
inequality.
List out the applications of MDS.
b [6M]
[L1][CO6]
Multidimensional scaling (MDS) has various applications across different
fields. Some of the common applications of MDS include:
Selects a subset of
Extracts a new set of
relevant features
1. features that are more
from the original set
informative and compact.
of features.
Can be categorized
Can be categorized into
into filter, wrapper,
3. linear and nonlinear
and embedded
methods.
methods.
Wrapper Methods
● Forward selection
● Backward Selection
● Exhaustive selection
Course Code: 20CS0535 R20
● Recursive Selection
Filter Methods
● Missing value
● Information gain
● Chi-square Test
● Fisher’s Score
Embedded Methods
● Regularization
● Random Forest Importance
Course Code: 20CS0535 R20
Course Code: 20CS0535 R20
UNIT –V
REINFORCEMENT LEARNING
The above image shows the robot, diamond, and fire. The goal of the
robot is to get the reward that is the diamond and avoid the hurdles
that are fired. The robot learns by trying all the possible paths and
Course Code: 20CS0535 R20
then choosing the path which gives him the reward with the least
hurdles. Each right step will give the robot a reward and each wrong
step will subtract the reward of the robot. The total reward will be
calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model
will start
• Output: There are many possible outputs as there are a variety of
solutions to a particular problem
• Training: The training is based upon the input, The model will
return a state and the user will decide to reward or punish the
model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Trained using
Works on
unlabelled data
Definition interacting with
without any
the environment
guidance.
No – predefined
Type of data Unlabelled data
data
K – Means, Q – Learning,
Algorithms
C – Means, Apriori SARSA
[L2][CO6] [6M]
a List the applications of Reinforcement Learning and explain it.
3
RL has numerous applications across various domains. Here are
some notable applications of reinforcement learning:
1. Game Playing: RL has been highly successful in game-
playing scenarios. For instance, AlphaGo, developed by Deep
Mind, used RL to defeat world champions in the board game
Go. RL has also been applied to games like chess, poker, and
video games, achieving remarkable results.
2. Robotics: RL enables robots to learn tasks and behaviours
autonomously. Robots can learn to grasp objects, walk,
navigate through environments, and perform complex tasks
using reinforcement learning algorithms.
3. Autonomous Vehicles: Reinforcement learning can be
employed to train autonomous vehicles to make decisions in
dynamic and uncertain environments. RL helps in tasks like
lane following, collision avoidance, and efficient route
planning.
4. Resource Management: RL can optimize resource allocation
in various domains, such as energy management, traffic signal
control, and inventory management. It learns to make
decisions that maximize efficiency, minimize costs, or
optimize performance based on feedback and rewards.
5. Recommendation Systems: Reinforcement learning can
enhance recommendation systems by learning user preferences
and making personalized recommendations. By incorporating
user feedback and reinforcement signals, RL algorithms can
adapt and improve the recommendations over time.
6. Healthcare: RL can assist in optimizing treatment plans and
personalized medicine. It can learn from patient data and
clinical trials to suggest appropriate interventions, drug
dosages, and treatment schedules.
7. Finance: RL can be applied to algorithmic trading, portfolio
management, and risk analysis. RL algorithms can learn to
make trading decisions by analysing market data, optimizing
portfolios, and adapting to changing market conditions.
8. Industrial Control Systems: Reinforcement learning can
optimize complex industrial processes by learning control
policies that maximize efficiency, reduce downtime, and
minimize resource consumption. It has applications in areas
like manufacturing, power systems, and chemical processes.
Course Code: 20CS0535 R20
9. Natural Language Processing: RL algorithms have been
used in natural language processing tasks such as dialogue
systems, machine translation, and text generation. RL can
improve the performance of language models by learning to
generate coherent and contextually appropriate responses.
10. Education: Reinforcement learning can be employed in
adaptive learning systems and intelligent tutoring systems. It
can adapt the learning experience based on the student's
progress, providing personalized feedback and optimizing
learning outcomes.
Linear Regression,
Q – Learning,
Algorithms Logistic Regression,
SARSA
SVM, KNN etc.
Risk Evaluation,
Application Self Driving Cars, Gaming, Healthcare
Forecast Sales
We can represent the agent state using the Markov State that contains
all the required information from the history. The State St is Markov
state if it follows the given condition:
Markov Property:
It says that "If the agent is present in the current state S1, performs an
action a1 and move to the state s2, then the state transition from s1 to
s2 only depends on the current state and future action and states do not
depend on past actions, rewards, or states."
a Explain in detail about Single State Case: K-Armed Bandit problem [L2][CO4] [6M]
5
A bandit is defined as someone who steals your money. A one-armed
bandit is a simple slot machine wherein you insert a coin into the
machine, pull a lever, and get an immediate reward. But why is it
called a bandit? It turns out all casinos configure these slot machines
in such a way that all gamblers end up losing money!
Course Code: 20CS0535 R20
A multi-armed bandit is a complicated slot machine wherein instead
of 1 , there are several levers which a gambler can pull, with each
lever giving a different return. The probability distribution for the
reward corresponding to each lever is different and is unknown to the
gambler.
b Distinguish between model based learning and temporal difference [L5][CO5] [6M]
learning.
Model-based learning and temporal difference (TD) learning are two
approaches to reinforcement learning, which is a branch of machine
learning concerned with learning optimal behavior through
interaction with an environment. Here are the key differences
between these two approaches:
1. Learning Approach:
Course Code: 20CS0535 R20
• Model-Based Learning: In model-based learning, the
agent learns a model of the environment, including its
dynamics and transition probabilities. It then uses this
model to plan and make decisions about its actions.
• Temporal Difference Learning: TD learning is a
model-free learning approach. Instead of explicitly
learning the dynamics of the environment, the agent
directly estimates the value or utility of states or state-
action pairs through trial-and-error experience.
2. Planning vs. Direct Learning:
• Model-Based Learning: With a learned model of the
environment, model-based learning algorithms can
perform planning, which involves simulating different
sequences of actions and estimating their outcomes to
make decisions.
• Temporal Difference Learning: TD learning
algorithms do not perform explicit planning. They
learn from direct experience by updating value
estimates based on the observed rewards and the
estimated values of subsequent states.
3. Exploration vs. Exploitation:
• Model-Based Learning: Model-based learning can
incorporate explicit exploration strategies based on
uncertainty about the model. By actively exploring the
environment, the agent can improve its model and
make more informed decisions.
• Temporal Difference Learning: TD learning
algorithms typically use exploration strategies to
balance exploration and exploitation but do not rely on
a learned model to guide their exploration. Common
approaches include epsilon-greedy or softmax
exploration.
4. Sample Efficiency:
• Model-Based Learning: Model-based learning
algorithms can achieve higher sample efficiency since
they can leverage their learned model to plan and
simulate potential outcomes before executing actions.
• Temporal Difference Learning: TD learning
algorithms might require more samples to converge to
an optimal policy since they rely on direct interaction
with the environment to estimate values.
5. Computational Complexity:
• Model-Based Learning: Model-based learning can be
computationally more demanding because it involves
learning and maintaining a model of the environment,
as well as performing planning and simulation.
• Temporal Difference Learning: TD learning
algorithms are often computationally simpler since
they do not require explicit modeling or planning. They
update value estimates based on observed rewards and
subsequent state values.
In practice, the choice between model-based learning and TD
Course Code: 20CS0535 R20
learning depends on the specific problem, available computational
resources, and the trade-off between sample efficiency and
computational complexity.
Gamma (γ): the discount rate. A value between 0 and 1. The higher
the value the less you are discounting.
Lambda (λ): the credit assignment variable. A value between 0 and 1.
The higher the value the more credit you can assign to further back
states and actions.
Alpha (α): the learning rate. How much of the error should we accept
and therefore adjust our estimates towards. A value between 0 and 1.
A higher value adjusts aggressively, accepting more of the error while
a smaller one adjusts conservatively but may make more conservative
moves towards the actual values.
Delta (δ): a change or difference in value.
b List out the advantages, disadvantages of Temporal difference [6M]
[L2][CO5]
learning.
Advantages:
• It can learn in every step online or offline.
• It can learn from a sequence which is not complete as well.
• It can work in continuous environments.
• It has lower variance compared to MC method and is more
efficient than MC method.
Limitations :
• It is a biased estimation.
• It is more sensitive to initialization.
b Difference between Model based learning and Model free learning [L1][CO4] [6M]
Model-based Algorithm
Model-based algorithm updates Q-table of the next state S and
greedy action A. Based on the highest reward, it chooses the next
action. At last, it tries to maximize the rewards of all episodes this
way.
It is also known as off-policy model as it’s primary job is to
understand the environment and then create a state-action table. This
table is used for getting the prediction of rewards in every state.
In off-policy methods, the policy used to generate behavior, called
the behavior policy, maybe unrelated to the policy that is evaluated
and improved called estimation policy. DQN is an example of a
model-based algorithm.
Suppose you are learning to swim in swimming pool. You’ll learn it
by failing and getting experience from your failures. Your swimming
model will be trained based on the conditions of the swimming pool.
Now, if you are told to swim in flowing water than it will be a
challenging task for your model.
Model-free Algorithm
Model-free algorithms update Q-table of the next state S and current
policy’s action A’. It won’t try to understand the whole environment
but instead, it follows policy approach. Policy could be some
algorithm like the actor-critic. DDPG is an example of a model-free
algorithm which is based on the actor-critic approach.
The difference
Here, the main difference is that model-based algorithm tries to get
familiar with its environment. The model-free algorithm tries to
optimize its policy gradient. If the environment is changed
completely then, the model-free algorithm has a higher chance of
success than a model-based algorithm.
There are several methods that could be used to differentiate between
a model-based and a model-free algorithm. Those methods are as
follows:
1. If reward is estimated before the action is taken then this is a
model-based algorithm.
2. If the accuracy of the model decreased with change in the
environment then it could be a model-based algorithm.
Course Code: 20CS0535 R20
In the real world, we don’t have a fixed environment in every
situation. So, most of the use cases could be solved using a model-
free algorithm. Self-driving cars, robots, big games like AlphaGo