ML Unit-1
ML Unit-1
INTRODUCTION
1.a) What is Machine learning? Explain the need of it. [L2][CO1][2M]
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on their
own. The term machine learning was first introduced by Arthur Samuel in 1959.A machine has the
ability to learn if it can improve its performance by gaining more data. The need for machine learning
is increasing day by day. The reason behind the need for machine learning is that it is capable of doing tasks
that are too complex for a person to implement directly.
Following are some key points which show the importance of Machine Learning:
It can handle vast amounts of data efficiently, making it suitable for big data
applications.
Solving complex problems, which are difficult for a human
Finding hidden patterns and extracting useful information from data
1b) List out applications and some popular algorithms used in Machine
Learning. Explain it? [L2][CO1]10m
1. Healthcare: Machine learning revolutionizes healthcare by enabling advanced diagnostics,
personalized treatments, and predictive analysis. By analyzing medical data such as patient history, lab
results, and imaging scans, ML algorithms assist doctors in identifying diseases early and recommend
appropriate treatments. For instance, IBM Watson Health uses ML to analyze large datasets, helping
oncologists make informed decisions about cancer treatment plans, potentially improving patient
outcomes.
2. Finance: In the finance sector, machine learning plays a crucial role in fraud detection, risk
assessment, and algorithmic trading. By studying patterns in transactions, ML models can identify
suspicious activities and prevent fraud. PayPal, for example, uses ML-powered systems to monitor
real-time transaction data and detect anomalies that could indicate fraudulent behaviour, safeguarding
customers and businesses alike.
3. Retail and E-commerce: ML enhances the shopping experience by providing personalized
recommendations, optimizing inventory management, and improving marketing strategies. E-
commerce platforms like Amazon employ ML algorithms to analyze customer browsing and
1
purchasing habits, offering tailored product recommendations. This feature increases customer
satisfaction and drives sales growth for businesses.
4. Automotive Industry: Autonomous vehicles rely heavily on machine learning for navigation,
obstacle detection, and decision-making in real-time. Tesla’s Autopilot system uses ML to process data
from cameras, sensors, and GPS to detect lanes, identify obstacles, and drive safely without human
intervention. This technology is paving the way for self-driving cars to become a common feature on
roads.
5. Natural Language Processing (NLP): NLP-powered applications enable computers to
understand, interpret, and generate human language. Tools like Google Translate utilize ML-based
neural machine translation to deliver accurate translations while preserving context. Similarly, ML
drives virtual assistants and chatbots like Siri and Alexa, allowing them to understand voice commands
and provide helpful responses.
6. Image and Video Analysis: Machine learning is widely used for facial recognition, object
detection, and video analysis. Social media platforms like Facebook (Meta) use ML algorithms for
facial recognition, helping users tag friends in photos easily. ML is also used in entertainment, such as
creating realistic visual effects for movies or enhancing video content.
7. Cybersecurity: Cybersecurity applications of ML include real-time threat detection, vulnerability
assessment, and incident response. Companies like Darktrace use ML algorithms to monitor network
traffic, identify anomalies, and prevent cyberattacks before they cause damage. By constantly learning
from historical data, ML systems can adapt to new threats effectively.
8. Education: ML personalizes education by adapting lessons to students' learning styles and
proficiency levels. Platforms like Duolingo employ ML to customize language lessons based on the
user’s performance and pace, ensuring an engaging and effective learning experience. Additionally,
ML is used to automate grading and provide instant feedback.
9. Agriculture: Precision agriculture uses machine learning to monitor crops, predict weather
conditions, and optimize farming resources. John Deere has integrated ML into its smart tractors,
which analyze field data to determine the best methods for planting and harvesting. This technology
helps increase agricultural productivity and sustainability.
10. Energy Sector: Machine learning optimizes energy consumption, resource management, and
renewable energy integration. Google DeepMind uses ML in its data centers to predict and reduce
energy usage for cooling systems, achieving up to 40% cost savings. ML also aids in forecasting
energy demands and balancing supply in smart grids.
2
6. Random Forest: Random Forest is an ensemble learning algorithm that builds multiple decision
trees and combines their outputs for better accuracy and robustness. It is widely used for classification
and regression problems and handles missing data well.
2a) Explain the various types of Machine Learning techniques with neat diagrams. [L2][CO1] 8m
At a broad level, machine learning can be classified into three types:
1. Supervised learning
2.Unsupervised learning
3.Reinforcement learning
3
7. Optimize the Model:
- Fine-tune hyperparameters (e.g., learning rate, depth of the tree) using techniques like Grid
Search or Random Search.
- Prevent overfitting by using methods such as cross-validation, regularization, or dropout.
8. Deploy the Model:
- Once the model performs well on the testing data, deploy it to make predictions on real-world
data.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning:
o Linear Regression
4
2. Classification: Classification algorithms are used when the output variable is categorical,
which means there are two classes such as Yes-No, Male-Female, True-false, etc
How It Works:
Unsupervised learning algorithms rely on discovering similarities or groupings in data based on their
inherent features. The model explores the data distribution and organizes it, providing insights that
would be difficult for humans to interpret manually, especially in large datasets.
1.Clustering: - Clustering involves grouping data points into clusters based on their similarity or
distance from one another. Points within the same cluster are more similar to each other compared to
those in other clusters.
2. Dimensionality Reduction:
- This technique reduces the number of features in a dataset while retaining its meaningful
characteristics. It helps in visualizing high-dimensional data and speeding up computations.
- Applications: Data compression, feature extraction, and preprocessing for supervised learning.
1. Data Collection:
5
- Example: Collect demographic information for users without knowing their preferences.
2. Preprocessing:
- Clean and normalize the data to ensure uniformity. Standardize feature scales to avoid dominance
of certain variables.
3. Algorithm Selection:
- Choose an unsupervised learning technique based on your objective (e.g., clustering for grouping,
PCA for dimensionality reduction).
- Feed the dataset to the selected algorithm. The model identifies patterns, clusters, or latent
features without any labelled guidance.
5. Interpret Results:
- Evaluate the clusters or patterns discovered by the algorithm and interpret their meaning in the
real-world context.
1. Market Segmentation: Retail companies group customers into clusters based on purchasing
behaviours to design targeted marketing campaigns.
2. Anomaly Detection:
In banking and cybersecurity, unsupervised learning detects unusual behaviours, such as fraudulent
transactions or network breaches.
3. Recommender Systems:
Streaming platforms like Netflix use clustering to recommend shows or movies by grouping users
with similar viewing patterns.
4. Image Compression:
Dimensionality reduction algorithms reduce the size of images while retaining their essential
features.
Clustering algorithms are used to group genes with similar expressions, aiding in medical research.
The unsupervised learning algorithm can be further categorized into two types of problems:
6
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group. Cluster
analysis finds the commonalities between the data objects and categorizes them as per the presence and
absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs together in
the dataset. Association rule makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of Association rule is
Market Basket Analysis.
Reinforcement learning
It is an area of Machine Learning. It is about taking suitable action to maximize reward in a
particular situation. It is employed by various software and machines to find the best possible
behaviour or path it should take in a specific situation. Reinforcement learning differs from
supervised learning in a way that in supervised learning the training data has the answer key
with it so the model is trained with the correct answer itself whereas in reinforcement learning,
there is no answer but the reinforcement agent decides what to do to perform the given task. In
the absence of a training dataset, it is bound to learn from its experience.
7
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a
particular problem
• Training: The training is based upon the input, The model will return a state and the
user will decide to reward or punish the model based on its output.
• The model keeps continues to learn.
The best solution is decided based on the maximum reward.
1. Email Spam Detection :The system identifies patterns within emails—like certain
keywords, sender information, or formatting—based on a dataset labelled as "spam" or "not
spam." It then applies these learned patterns to incoming emails to separate unwanted messages.
For example, spam filters in email services like Gmail automatically sort suspicious emails into
the spam folder, saving users from potential phishing or advertising overload.
2. Image Classification :By analyzing labelled images, the model understands visual
features like shapes, colours, and textures to categorize new images. For example, a photo
classification system trained on pictures of dogs and cats can distinguish between the two in
new images by recognizing fur patterns, facial structures, or tail shapes.
3. Fraud Detection :Using labelled transactional data that highlights fraudulent activity, the
model detects irregular spending patterns or unusual account behaviours to flag potential fraud.
For instance, banks monitor credit card transactions and alert users when they detect an anomaly,
such as unexpected purchases from foreign locations.
4. Sentiment Analysis :Based on text samples marked as positive, negative, or neutral, the
model can gauge sentiment in online reviews, social media posts, or customer feedback. For
instance, businesses use this technology to analyze tweets and product reviews, helping them
understand customer satisfaction and adapt their strategies accordingly.
6. Medical Diagnosis :With access to medical records and diagnostic data labelled by
healthcare experts, the model can identify diseases or conditions. For example, an AI-powered
tool can detect pneumonia by analyzing thousands of labelled chest X-ray images, providing
doctors with a reliable second opinion.
7. Stock Price Prediction :Using financial data like historical stock prices, market trends,
and economic indicators, the model estimates future stock values. For example, investment
firms leverage these predictions to guide their trading strategies and portfolio management.
Supervised learning is like a teacher-student model, where the data acts as the teacher guiding
the algorithm to make accurate predictions or classifications.
8
3a)Compare Machine Learning and Artificial Intelligence. [L6][CO5] 6m
ARTIFICIAL INTELLIGENCE MACHINE LEARNING
1956 The terminology “Artificial The terminology “Machine Learning” was first
Intelligence” was originally used by John used in 1952 by IBM computer scientist Arthur
McCarthy, who also hosted the first AI Samuel, a pioneer in artificial intelligence and
conference. computer games.
AI stands for Artificial intelligence, where ML stands for Machine Learning which is
intelligence is defined as the ability to acquire defined as the
and apply knowledge. acquisition of knowledge or skill
AI is the broader family consisting of ML and Machine Learning is the subset of Artificial
DL as its components. Intelligence.
The aim is to increase the chance of success and The aim is to increase accuracy, but it does not
not accuracy. care about; the success
It works as a computer program that does smart Here, the tasks systems machine takes data and
work. learns from data.
The goal is to simulate natural intelligence to The goal is to learn from data on certain tasks to
solve complex problems. maximize the performance on that task.
AI has a very broad variety of applications. The scope of machine learning is constrained.
AI can work with structured, semi structured, and ML can work with only structured and semi
unstructured data. structured data.
3b) Describe classification techniques in supervised learning with an example. [L2][CO1] [6M]
Classification techniques in supervised learning involve training models to categorize data into
predefined classes. These methods aim to learn patterns or features from labelled training data
and then use that knowledge to classify new data accurately. Common techniques include:
9
A classifier is a type of machine learning algorithm that assigns a label to a data input. Classifier
algorithms use labelled data and statistical methods to produce predictions about data input
classifications.
1. Logistic Regression
2. K-Nearest Neighbour
• Kernel SVM
4. Naïve Bayes
1. Logistic Regression
- It predicts the probability of a data point belonging to a certain class by modelling the
relationship between input features and output labels.
- Example : Predicting whether an email is spam or not based on features like word
frequency, sender, and formatting.
2. Decision Trees
- A tree-like structure where each node represents a decision based on feature values, and leaf
nodes denote the classification outcome.
10
- Example : Categorizing animals as mammals, reptiles, or birds based on features like body
temperature and mode of reproduction.
3. Random Forest
- It combines multiple decision trees to improve classification accuracy by taking the majority
vote from all trees.
- Finds a hyperplane that separates data points into classes by maximizing the margin between
them.
- Example : Classifying handwriting digits (like '7' or '9') based on pixel intensity values.
- Classifies data points based on the class of the nearest neighbours in the feature space.
- Example : Identifying whether a fruit is an apple or orange based on size, colour, and
texture.
11
6. Neural Networks
- Mimics the structure of the human brain, consisting of layers of neurons to learn complex
patterns and classifications.
- Example : Recognizing faces in images by analyzing facial features like eyes, nose, and
mouth.
7. Naive Bayes
- Uses probabilities to classify data based on Bayes' theorem, assuming features are
independent.
4a)List out various Unsupervised learning techniques used in Machine Learning. [L1][CO5]
[5M]
Unsupervised learning is a type of machine learning in which models are trained using
unlabelled dataset and are allowed to act on that data without any supervision.
oClustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them as
per the presence and absence of those commonalities.
oAssociation: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.
12
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also
other various approaches of Clustering exist. Below are the main clustering methods used in
Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
4b) Illustrate the clustering techniques in unsupervised learning with examples. [L3][CO2] [7M]
1. Partitioning Methods
These methods divide the dataset into distinct non-overlapping clusters. Each data point belongs
to exactly one cluster.
K-Means Clustering : It partitions data into K clusters by minimizing the distance between
data points and their respective cluster centroid. K-means is ideal for spherical and compact
clusters but requires you to predefine the number of clusters.
K-Medoids Clustering : Similar to K-means, but uses actual data points (medoids) as cluster
centers instead of centroids. K-medoids is more robust to noise and outliers.
2. Hierarchical Clustering
13
- Agglomerative Clustering (Bottom-Up): Each data point starts as its own cluster, and
clusters are merged iteratively based on similarity.
- Divisive Clustering (Top-Down): All data points start in one cluster, which is split
recursively into smaller clusters.
Example : Breaking down customer groups based on broad categories, then into
subcategories.
3. Density-Based Clustering
Clusters are formed based on the density of data points, identifying regions of high data
concentration.
It groups data points in dense areas and labels sparse regions as noise. DBSCAN is suitable for
clusters of arbitrary shapes.
Example : Identifying star clusters in space based on the density of celestial objects.
4. Model-Based Clustering
This method assumes data is generated from a mixture of probability distributions, such as
Gaussian distributions.
14
- Gaussian Mixture Models (GMM) : Each cluster is modelled as a Gaussian distribution.
GMM provides probabilities for data points belonging to a cluster, allowing overlapping
clusters.
5. Grid-Based Clustering
The data space is divided into a grid of finite cells, and clusters are formed based on cell density.
- Wave Cluster : Uses wavelet transformation to group data points in grids, useful for
spatial clustering.
6. Fuzzy Clustering
Data points can belong to multiple clusters with varying degrees of membership.
- Fuzzy C-Means (FCM) : Each data point has a membership value to clusters rather than
being assigned to one exclusively.
Example : Classifying images with overlapping features, such as blending two colours.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1.K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms.
It classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2.Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating the
candidates for centroid to be the centre of the points within a given region.
15
Guidelines for Machine Learning Experiments Before we start experimentation, we need to
have a good idea about what it is we are studying, how the data is to be collected, and how we
are planning to analyze it.
o Aim of the Study
o Selection of the Response Variable
o Selection of the Response Variable
o Choice of Factors and Levels
o Choice of Experimental Design
o Performing the Experiment
o Statistical Analysis of the Data
o Conclusions and Recommendations
A. Aim of the Study:
We need to start by stating the problem clearly, defining what the objectives are. In machine
learning, there may be several possibilities. As we discussed before, we may be interested in
assessing the expected error (or some other response measure) of a learning algorithm on a
particular problem and check that, for example, the error is lower than a certain acceptable
level.
Given two learning algorithms and a particular problem as defined by a dataset, we may want
to determine which one has less generalization error. These can be two different algorithms, or
one can be a proposed improvement of the other, for example, by using a better feature
extractor.
In the general case, we may have more than two learning algorithms, and we may want to
choose the one with the least error, or order them in terms of error, for a given dataset. In an
even more general setting, instead of on a single dataset, we may want to compare two or more
algorithms on two or more datasets.
B. Selection of the Response Variable
We need to decide on what we should use as the quality measure. Most frequently, error is
used that is the misclassification error for classification and mean square error for regression.
We may also use some variant; for example, generalizing from 0/1 to an arbitrary loss, we may
use a risk measure. In information retrieval, we use measures such as precision and recall;
In a cost-sensitive Design and Analysis of Machine Learning Experiments setting, not only the
output but also system parameters, for example, its complexity, are taken into account.
C. Choice of Factors and Levels
What the factors are depend on the aim of the study. If we fix an algorithm and want to find the
best hyper parameters, then those are the factors. If we are comparing algorithms, the learning
algorithm is a factor. If we have different datasets, they also become a factor. The levels of a
factor should be carefully chosen so as not to miss a good configuration and avoid doing
unnecessary experimentation. It is always good to try to normalize factor levels.
For example, in optimizing k of k-nearest neighbour, one can try values such as 1, 3, 5, and so
on, but in optimizing the spread hof Parzen windows, we should not try absolute values such as
1.0, 2.0, and so on, because that depends on the scale of the input; it is better to find some
statistic that is an indicator of scale— for example, the average distance between an instance
and its nearest neighbour—and try has different multiples of that statistic. Though previous
expertise is a plus in general, it is also important to investigate all factors and factor levels that
may be of importance and not be overly influenced by past experience.
D. Choice of Experimental Design
It is always better to do a factorial design unless we are sure that the factors do not interact,
because mostly they do. Replication number depends on the dataset size; it can be kept small
when the dataset is large; we will discuss this in the next section when we talk about resampling.
16
However, too few replicates generate few data and this will make comparing distributions
difficult; in the particular case of parametric tests, the assumptions of Gaussian its may not be
tenable. Generally, given some dataset, we leave some part as the test set and use the rest for
training and validation, probably many times by resampling. How this division is done is
important.
In practice, using small datasets leads to responses with high variance, and the differences will
not be significant and results will not be conclusive. It is also important to avoid as much as
possible toy, synthetic data and use datasets that are collected from real-world under real-life
circumstances.
Before running a large factorial experiment with many factors and levels, it is best if one does a
few trial runs for some random settings to check that all is as expected. In a large experiment, it
is always a good idea to save intermediate results (or seeds of the random number generator),
so that a part of the whole
experiment can be rerun when desired.
All the results should be reproducible. In running a large experiment with many factors and
factor levels, one should be aware of the possible negative effects of software aging. It is
important that an experimenter be unbiased during experimentation. In comparing one’s
favourite algorithm with a competitor, both should be investigated equally diligently.
In large-scale studies, it may even be envisaged that testers be different from developers. One
should avoid the temptation to write one’s own “library” and instead, as much as possible, use
code from reliable sources; such code would have been better tested and optimized.
As in any software development study, the advantages of good documentation cannot be
underestimated, especially when working in groups. All the methods developed for high-quality
software engineering should also be used in machine learning experiments.
F. Statistical Analysis of the Data
This corresponds to analyzing data in a way so that whatever conclusion we get is not subjective
or due to chance. We cast the questions that we want to answer in a hypothesis testing
framework and check whether the sample supports the hypothesis.
For example, the question "Is A a more accurate algorithm than B?" becomes the hypothesis
"Can we say that the average error of learners trained by A is significantly lower than the
average error of learners trained by B?" As always, visual analysis is helpful, and we can use
histograms of error distributions, whisker-and-box plots, range plots, and so on.
G. Conclusions and Recommendations
Once all data is collected and analysed, we can draw objective conclusions. One frequently
encountered conclusion is the need for further experimentation. Most statistical, and hence
machine learning or data mining, studies are iterative. It is for this reason that we never start
with all the experimentation. It is suggested that no more than 25 percent of the available
resources should be invested in the first experiment (Montgomery 2005). The first runs are for
investigation only. That is also why it is a good idea not to start with high expectations, or
promises to one’s boss or thesis advisor. We should always remember that statistical testing
never tells us if the hypothesis is correct or false, but how much the sample seems to concur with
the hypothesis. There is always a risk that we do not have a conclusive result or that our
conclusions be wrong, especially if the data is small and noisy. When our expectations are not
met, it is most helpful to investigate why they are not. For example, in checking why our
favourite algorithm A has worked awfully bad on some cases, we can get a splendid idea for
some improved version of A.
17
All improvements are due to the deficiencies of the previous version; finding a deficiency is but
a helpful hint that there is an improvement we can make! But we should not go to the next step
of testing the improved version before we are sure that we have completely analysed the current
data and learned all we could learn from it. Ideas are cheap, and useless unless tested, which is
costly.
6a)Explain Model Selection in Machine learning. [L2][CO1] [6M]
Model selection in machine learning is a crucial process where the most appropriate algorithm or
model is chosen to solve a specific problem based on factors like data, objectives, and
constraints. Here’s a detailed explanation:
1. Understanding the Problem
- Clearly define the task: Is it classification, regression, clustering, or something else?
- Identify the type of data: Tabular, text, images, or time-series data may require specific
models.
- Assess requirements: Are interpretability, scalability, or computational efficiency critical?
2. Exploring the Dataset
- Size of Data : Some models, like neural networks, require large datasets to perform well,
while simpler algorithms like linear regression can handle smaller datasets effectively.
- Feature Types : Categorical or numerical features may need preprocessing for some
models. For example, decision trees handle categorical data naturally, while Support Vector
Machines require numerical data.
- Quality of Data : Assess missing values, outliers, and noise, which can affect model
performance.
3. Criteria for Model Selection
- Accuracy : Models should perform well on training and test data to ensure reliability.
- Speed : If computational time is a concern, simpler models like Logistic Regression may be
preferable over complex ones like Gradient Boosted Trees or Deep Learning models.
- Scalability : Consider if the model can handle larger datasets or additional features
efficiently.
- Interpretability : If understanding the decision-making process is important, models like
decision trees or linear regression are easier to interpret than black-box models like neural
networks.
4. Evaluating Different Models
Before finalizing, different models can be evaluated using approaches like:
- Cross-Validation : Splitting the data into training and validation sets and measuring
performance across multiple iterations ensures the chosen model generalizes well.
- Hyperparameter Tuning : Adjust model parameters to find the optimal configuration for
performance.
- Performance Metrics :
- For classification: Accuracy, F1-score, precision, recall, etc.
- For regression: Mean squared error (MSE), R-squared, etc.
5. Testing with Baseline Models
Start with simpler models as benchmarks (e.g., Linear Regression, Decision Trees) and
compare them against more complex models like ensemble methods (Random Forest, Boost) or
deep learning.
6. Trade-Off Analysis
Every model has strengths and weaknesses. Consider trade-offs based on:
- Performance vs Interpretability: Neural networks may perform better but are harder to
interpret.
18
- Complexity vs Practicality: Ensemble methods might be accurate but require more resources.
7. Tools for Model Selection
Utilize frameworks like:
- Scikit-Learn : A wide range of algorithms for easy experimentation.
- TensorFlow/Kera’s/PyTorch : For deep learning tasks.
- Auto ML Tools : Automatically select models and tune hyperparameters.
8. Testing on Real-World Data
Once a model is selected and trained, test it on unseen data or simulated environments to
ensure its robustness and applicability.
Recognize the differences between various classes or categories in the training data.
Apply this differentiation to unseen data by generalizing its learning in a way that maintains
accuracy without confusing classes.
For example, in image classification, a model trained to distinguish between cats and dogs
should consistently classify a new, unseen image of a dog correctly, even if it's slightly
different (such as a unique breed or unusual angle).
Factors Affecting Discriminate Generalization
Feature Engineering: Models must focus on meaningful features that help discriminate
between classes. For instance, texture and shape might be important for distinguishing cats
from dogs.
Model Complexity: Overly simple models might fail to capture subtle differences between
classes, while overly complex models might memorize training data, leading to poor
generalization.
Training Data Quality: Data imbalance (e.g., far more images of cats than dogs) can hinder
a model's ability to generalize across all classes equally.
19
Regularization: Techniques like L2 regularization or dropout prevent the model from
overfitting, allowing better generalization.
Evaluation Metrics: Metrics like precision and recall for each class ensure the model is not
favouring one category at the expense of another.
Examples of Discriminate Generalization in Machine Learning
Example 1: Image Classification
A model trained to classify images into "cars" and "trucks" must learn features that
distinguish between them, such as:
Trucks often have larger, boxy bodies with cargo space. To test generalization, the model
might be evaluated on new images of vehicles with unusual designs (e.g., a sporty truck).
Negative words: "bad," "ugly," "sad." The model should generalize well by accurately
predicting sentiment in reviews containing rare, complex sentences.
Robust Features: Extract features that capture the true essence of the difference between
categories (e.g., edges in image classification, or frequency of certain keywords in text
analysis).
Advanced Architectures: Employ models that are naturally good at discrimination, such as
convolutional neural networks (CNNs) for images or transformers for text.
Monitoring Overfitting: Regularly compare training and validation performance to ensure the
model is generalizing correctly.
Real-World Applications
1.Healthcare Diagnostics: Discriminate between benign and malignant tumours from medical
images.
3.Spam Detection: Separate spam emails from legitimate emails using NLP.
20
7a) Compare Supervised learning and Unsupervised learning[L6][CO1] [6M]
Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labelled data. using unlabelled data.
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when the hidden patterns and useful insights from
it is given new data. the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for those
cases where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
4.Action (A): The possible moves or choices the agent can make.
21
5.Reward (R): Feedback the agent receives after performing an action; can be positive or
negative.
Interaction: The agent observes the state, takes an action, and receives a reward from the
environment.
Learning: Based on the action and reward, the agent updates its policy using algorithms like Q-or Deep Q-
Learning.
Optimization: The agent continues this cycle to improve its decisions.
State (S): The current state could be "red light," "green light," or "yellow light."
Reward (R):
o Positive reward (+1) for stopping at a red light or moving at a green light.
o Negative reward (-1) for running a red light or stopping unnecessarily at a green light.
Example: The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.
22
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that is
the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible paths and
then choosing the path which gives him the reward with the least hurdles. Each right step will give the
robot a reward and each wrong step will subtract the reward of the robot. The total reward will be
calculated when it reaches the final reward that is the diamond.
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
oClustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them as
per the presence and absence of those commonalities.
oAssociation: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also
other various approaches of Clustering exist. Below are the main clustering methods used in
Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
1. Partitioning Methods
These methods divide the dataset into distinct non-overlapping clusters. Each data point belongs
to exactly one cluster.
23
K-Means Clustering : It partitions data into K clusters by minimizing the distance between
data points and their respective cluster centroid. K-means is ideal for spherical and compact
clusters but requires you to predefine the number of clusters.
K-Medoids Clustering : Similar to K-means, but uses actual data points (medoids) as cluster
centers instead of centroids. K-medoids is more robust to noise and outliers.
2. Hierarchical Clustering
- Agglomerative Clustering (Bottom-Up): Each data point starts as its own cluster, and
clusters are merged iteratively based on similarity.
- Divisive Clustering (Top-Down): All data points start in one cluster, which is split
recursively into smaller clusters.
Example : Breaking down customer groups based on broad categories, then into
subcategories.
3. Density-Based Clustering
Clusters are formed based on the density of data points, identifying regions of high data
concentration.
24
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) :
It groups data points in dense areas and labels sparse regions as noise. DBSCAN is suitable for
clusters of arbitrary shapes.
Example : Identifying star clusters in space based on the density of celestial objects.
4. Model-Based Clustering
This method assumes data is generated from a mixture of probability distributions, such as
Gaussian distributions.
5. Grid-Based Clustering
The data space is divided into a grid of finite cells, and clusters are formed based on cell density.
- Wave Cluster : Uses wavelet transformation to group data points in grids, useful for
spatial clustering.
6. Fuzzy Clustering
Data points can belong to multiple clusters with varying degrees of membership.
- Fuzzy C-Means (FCM) : Each data point has a membership value to clusters rather than
being assigned to one exclusively.
Example : Classifying images with overlapping features, such as blending two colours.
25
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1.K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms.
It classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2.Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating the
candidates for centroid to be the centre of the points within a given region.
ASSOCIATION RULES:
Association rule learning is a kind of unsupervised learning technique that tests for the reliance of one
data element on another data element and design appropriately so that it can be more cost-effective. It
tries to discover some interesting relations or associations between the variables of the dataset. It
depends on various rules to find interesting relations between variables in the database.
The association rule learning is the most important approach of machine learning, and it is employed in
Market Basket analysis, Web usage mining, continuous production, etc. In market basket analysis, it is
an approach used by several big retailers to find the relations between items.
There are the following types of Association rule learning which are as follows −
Apriori Algorithm − This algorithm needs frequent datasets to produce association rules. It is designed
to work on databases that include transactions. This algorithm needs a breadth-first search and hash tree
to compute the itemset efficiently.
It is generally used for market basket analysis and support to learn the products that can be purchased
together. It can be used in the healthcare area to discover drug reactions for patients.
Eclat Algorithm − The Eclat algorithm represents Equivalence Class Transformation. This algorithm
needs a depth-first search method to discover frequent item sets in a transaction database. It implements
quicker execution than Apriori Algorithm.
26
F-P Growth Algorithm − The F-P growth algorithm represents Frequent Pattern. It is the enhanced
version of the Apriori Algorithm. It describes the database in the form of a tree structure that is referred
to as a frequent pattern or tree. This frequent tree aims to extract the most frequent patterns. There are
various applications of Association Rule which are as follows −
• Items purchased on a credit card, such as rental cars and hotel rooms, support insight into the
following product that customer are likely to buy.
• Optional services purchased by tele-connection users (call waiting, call forwarding, DSL, speed call,
etc.) support decide how to bundle these functions to maximize revenue.
• Banking services used by retail users (money industry accounts, CDs, investment services, car loans,
etc.) recognize users likely to needed other services.
• Unusual group of insurance claims can be an expression of fraud and can spark higher investigation.
Medical patient histories can supports expressions of likely complications based on definite set
of treatments.
9) Analyze the classification and regression techniques in supervised learning. [L4][CO1] [12M]
Supervised Learning
Supervised learning is a branch of machine learning where models are trained using labelled data. It means
the input data comes with corresponding output labels. The goal is to learn the mapping between inputs and
outputs to make accurate predictions on new, unseen data.
The two primary types of supervised learning problems are classification (for discrete outputs) and
regression (for continuous outputs). Let’s dive deeper into each.
1. Classification Techniques
Classification involves predicting a categorical variable or class label from input data. The model assigns
the input to one of the predefined categories.
Characteristics of Classification
- Output : Discrete classes or categories (e.g., "Yes" or "No," "Spam" or "Not Spam").
- Goal : Minimize misclassification errors and improve accuracy.
- Example Problems : Email spam detection, disease diagnosis (e.g., cancer vs non-cancer), image
recognition.
Classification Process
1. Data Preparation :
- Ensure data is labelled (each input has a corresponding output class).
- Preprocess data (handle missing values, normalize features).
2. Model Selection :
- Simple Models : Logistic Regression for binary classification tasks.
- Complex Models : Neural Networks for tasks like image and text classification.
27
3. Training the Model :
- Apply a classification algorithm to learn decision boundaries.
- Use labelled data to adjust parameters.
4. Evaluation :
- Test the model on unseen data using metrics like accuracy, precision, recall, and F1-score.
1. Logistic Regression : Predicts probabilities of classes using the sigmoid function. Suitable for binary
classification tasks.
2. Support Vector Machines (SVM) : Finds hyperplanes that separate classes with maximum margin.
- Example : Classifying species of flowers based on petal dimensions.
3. Decision Trees : Uses a tree-like structure to split data based on feature values.
- Example : Classifying customers based on purchasing behaviour.
4. Random Forests : Combines multiple decision trees to improve accuracy and reduce overfitting.
- Example : Fraud detection in credit card transactions.
5. Neural Networks : Suitable for high-dimensional data like images, videos, and text.
- Example : Identifying handwritten digits.
6. Logistic Regression
Logistic regression in Machine Learning is used to find the probability of event=Success and
event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1, True/ False,
Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can be represented by the following
equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence.
Evaluation Metrics :
- Accuracy : Percentage of correct predictions.
- Precision : Fraction of relevant instances among the retrieved ones.
- Recall : Fraction of relevant instances retrieved out of all relevant ones.
28
- F1-Score : Combines precision and recall into a single metric.
- Confusion Matrix : Provides detailed insight into classification errors for each class.
2. Regression Techniques
Regression is used for predicting a continuous numeric output. The goal is to find the relationship between
dependent and independent variables.
Characteristics of Regression
- Output : Numeric or continuous values (e.g., 0.75, 1000).
- Goal : Minimize the error between predicted and actual values.
- Example Problems : Forecasting stock prices, predicting house prices, estimating temperature.
Regression Process
1. Data Preparation :
- Handle missing values and outliers.
- Scale features for certain algorithms like gradient descent.
2. Model Selection :
- Simple Models : Linear Regression for straight-line relationships.
- Complex Models : Polynomial Regression for non-linear relationships.
4. Evaluation :
- Test the model on unseen data using metrics like mean squared error (MSE) and R² score.
3. Support Vector Regression (SVR) : Uses principles of SVM to predict continuous outputs.
- Example : Predicting temperature variations.
4. Decision Trees for Regression : Splits the data into regions with similar output values.
- Example : Predicting sales figures based on marketing spend.
29
5. Neural Networks for Regression : Suitable for modelling complex, non-linear dependencies.
- Example : Predicting energy usage from historical data.
Evaluation Metrics :
- Mean Absolute Error (MAE) : Average absolute differences between predicted and actual values.
- Mean Squared Error (MSE) : Average squared differences between predicted and actual values.
- Root Mean Squared Error (RMSE) : Square root of MSE for better interpretation.
- R² Score : Measures the proportion of variance in the output explained by the model.
---
Conclusion
Both classification and regression are fundamental supervised learning techniques, addressing different
types of prediction problems. Classification is suited for categorical outputs, while regression handles
continuous numeric outputs. Selecting the right algorithm depends on the nature of the data, the problem,
and the desired accuracy.
The association rule learning is the most important approach of machine learning, and it is employed in
Market Basket analysis, Web usage mining, continuous production, etc. In market basket analysis, it is an
approach used by several big retailers to find the relations between items.
30
version of the Apriori Algorithm. It describes the database in the form of a tree structure that is referred to
as a frequent pattern or tree. This frequent tree aims to extract the most frequent patterns. There are various
applications of Association Rule which are as follows −
• Items purchased on a credit card, such as rental cars and hotel rooms, support insight into the
following product that customer are likely to buy.
• Optional services purchased by tele-connection users (call waiting, call forwarding, DSL, speed call,
etc.) support decide how to bundle these functions to maximize revenue.
• Banking services used by retail users (money industry accounts, CDs, investment services, car loans,
etc.) recognize users likely to needed other services.
• Unusual group of insurance claims can be an expression of fraud and can spark higher investigation.
Medical patient histories can supports expressions of likely complications based on definite set of
treatments.
Machine learning has found numerous applications across various industries, revolutionizing processes and
enabling the development of innovative solutions. Here are some real-world applications of machine
learning:
Healthcare: Machine learning is used for medical diagnosis, patient monitoring, and treatment
planning. It can analyze medical records, images, and genomic data to assist in early disease detection,
personalized medicine, and predicting patient outcomes. Machine learning models can also help identify
patterns and anomalies in large healthcare datasets for improved decision-making.
31
Finance: Machine learning is widely applied in financial institutions for fraud detection, credit
scoring, algorithmic trading, and risk assessment. It can analyze vast amounts of financial data to identify
fraudulent transactions, predict market trends, and optimize investment strategies. Machine learning
models are also used for automated trading based on historical and real-time market data.
Retail and E-commerce: Machine learning is used for personalized recommendations, demand
forecasting, inventory management, and pricing optimization. By analyzing customer behavior, browsing
history, and purchase patterns, machine learning models can recommend relevant products to users,
optimize pricing strategies, and predict customer preferences to improve sales and customer satisfaction.
Transportation and Logistics: Machine learning is utilized for route optimization, demand
forecasting, and predictive maintenance in transportation and logistics. It can analyze historical data,
realtime traffic information, and weather conditions to optimize routes for delivery vehicles, forecast
demand for transportation services, and detect anomalies in equipment performance to prevent
breakdowns.
Manufacturing: Machine learning is used in manufacturing industries for quality control,
predictive maintenance, and process optimization. It can analyze sensor data from production lines to
detect anomalies and ensure product quality. Machine learning models can also predict equipment failures,
enabling proactive maintenance to minimize downtime and maximize productivity.
Natural Language Processing (NLP): Machine learning techniques are applied in NLP
applications such as language translation, sentiment analysis, chatbots, and voice assistants. NLP models
can understand and generate human language, enabling accurate translation between languages, sentiment
analysis of customer feedback, and interactive conversational experiences.
Autonomous Vehicles: Machine learning plays a crucial role in autonomous vehicles by enabling object
detection and recognition, scene understanding, and decision-making. Machine learning models process
sensor data from cameras, LiDAR, and radar to detect and classify objects on the road, navigate complex
environments, and make real-time decisions to ensure safe driving.
Energy and Utilities: Machine learning is used for energy load forecasting, anomaly detection in power
grids, and optimizing energy consumption. It can analyze historical energy consumption data, weather
conditions, and other factors to predict future energy demand and optimize energy generation and
distribution.
These are just a few examples of the vast range of real-world applications of machine learning. The
versatility and potential of machine learning continue to expand, with ongoing research and development
pushing the boundaries of what is possible in various industries and domains.
32