0% found this document useful (0 votes)
23 views76 pages

IR Unit 2 (1,2)

The document discusses various machine learning algorithms, focusing on supervised and unsupervised learning, particularly classification algorithms. It explains the differences between regression and classification, key definitions, types of classification, and specific algorithms like Naïve Bayes and Support Vector Machines. Additionally, it covers feature selection, dimensionality reduction, clustering techniques, and real-life applications such as spam detection and sentiment analysis.

Uploaded by

dramya761
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views76 pages

IR Unit 2 (1,2)

The document discusses various machine learning algorithms, focusing on supervised and unsupervised learning, particularly classification algorithms. It explains the differences between regression and classification, key definitions, types of classification, and specific algorithms like Naïve Bayes and Support Vector Machines. Additionally, it covers feature selection, dimensionality reduction, clustering techniques, and real-life applications such as spam detection and sentiment analysis.

Uploaded by

dramya761
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

UNIT 2

Chap
1,2
Test Classification / Categorization
Algorithm
• The Supervised Machine Learning algorithm can be broadly
classified into Regression and Classification Algorithms.
• In Regression algorithms, we have predicted the output
for continuous values, such as predicting the price of a
house, temperature, or stock market trends. but to predict the
categorical values, such as determining whether an email is
spam or not spam, or classifying customers into categories
like "High Risk" or "Low Risk." we need Classification
algorithms.
What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning technique
that is used to identify the category of new observations on the
basis of training data.
In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number
of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not
Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.
Key Definitions:
Documents and Classes:
D: A collection of documents to be classified.
C={C 1 ,C 2 ,...,C k,Cp }: A set of predefined classes or
categories. Like Promotion, Social, private etc
Classification Function F:
F(d,Cp):
A binary function that determines whether a document d belongs to
a specific
class Cp
F(d,Cp)=1: If d belongs to Cp
F(d,Cp)=0: If d does not belong to Cp
• Types of Classification:
• Single-label Classification: Each document is assigned exactly one class.
• Multi-label Classification: A document can belong to multiple classes
simultaneously.
Test Classification Algorithm
• Text categorization is an effective activity that can be accomplished
using a variety of classification algorithms.
• Text classification algorithms are categorized into two groups.
1. Supervised algorithms
2. Unsupervised algorithms
Supervised learning
• Supervised learning is a type of machine learning algorithm that learns from
labeled data.
• Labeled data is data that has been tagged with a correct answer or
classification.
• Supervised learning involves training a machine from labeled data.
• The machine learns the relationship between inputs (fruit images) and outputs
(fruit labels).

• The trained machine can then make predictions on new, unlabeled data.
• Unsupervised learning :
• Unsupervised learning is a type of machine learning that learns from unlabeled
data.
• This means that the data does not have any pre-existing labels or categories.
• The goal of unsupervised learning is to discover patterns and relationships
in the data without any explicit guidance.
• Here the task of the machine is to group unsorted information according
to similarities, patterns, and differences without any prior training of
data.
Naïve Bayes/ Bayes Theorem
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training
dataset.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit
is identified on the bases of color, shape, and taste, then red, spherical, and
sweet fruit is recognized as an apple.
• Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
• The formula for Bayes' theorem is given as:

Where:
P(A∣B) is the posterior probability, the probability of event A occurring
given that event B has occurred.
P(B∣A) is the likelihood, the probability of observing event B given that event
A is true.
P(A) is the prior probability, the initial probability of event A before
any evidence is taken into account.
P(B) is the marginal likelihood or total probability of observing event B,
regardless of the occurrence of event A.
Real-Life Example
(Medical
• Event Test):
A: The person has the disease.
• Event B: The person tests positive for the disease.
Let’s define the following:
Prior Probability P(A): The probability that a person has the disease.
Likelihood P(B∣A): The probability that a person with the disease will test
positive (say 99%).
Marginal Likelihood P(B): The total probability that a person will test
positive, regardless of whether they have the disease or not. Posterior
Probability
P(A∣B): The probability that the person has the disease given that they tested
positive.
• Let's walk through the process of applying Naive Bayes' Theorem using
the frequency table provided, with the goal of predicting which type of
fruit corresponds to the properties {Yellow, Sweet, Long}.
• guide you step by step for both Mango and Banana.
• Naive Bayes' Theorem for Classification:
P(X∣Fruit)= P(Yellow∣Fruit)×P(Sweet∣Fruit) ×
P(Long∣Fruit)×P(Fruit)
We’ll first calculate the conditional probabilities for Mango and Banana.
1. Mango:
Support Vector Machine(SVM)
• Definition:
SVM is a supervised machine learning algorithm used for classification
and regression tasks.
• It works by finding the best boundary (or hyperplane) that separates different
classes in the data.
• SVMs are highly adaptable, making them suitable for various applications
such as text classification, image classification, spam
detection, handwriting identification, face detection, and anomaly
detection.
• SVM is best suited for classification tasks.
• The primary objective of the SVM algorithm is to identify the optimal
hyperplane in an N-dimensional space that can effectively separate data points
into different classes in the feature space.
• Data points closest to the hyperplane; these points influence the position and
orientation of the hyperplane.
• Margin: The distance between the hyperplane and the nearest data points
of each class. SVM tries to maximize this margin for better
classification.
• Example in Real Life:
• Spam Email Detection: SVM separates spam emails from regular emails by
analyzing features like word frequency.
• Tumor Classification:
• In medical imaging, SVM separates images into "tumor" and "non-tumor"
classes.
Feature
Selection
• Definition:
Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant,
irrelevant, or noisy features.
• By reducing the number of terms (features), we decrease the time and
resources needed to train a classifier.
• Example: Imagine sorting through a library. Instead of looking at every
book (all terms), you only focus on books about “science fiction”
(important terms).
• Here’s a simple explanation of the feature selection algorithm:
Step 1 :Extract Vocabulary: Identify all unique terms in the training
documents.
• Example: Classifying Emails as Spam or Not
• Email 1: "cheap deal today"
• Email 2: "limited offer buy now"
• Email 3: "team meeting schedule."
• Vocabulary: {cheap, deal, today, limited, offer, buy, now, team, meeting,
schedule}.
Step 2: Calculate how much each word helps in classifying emails:
• Words like "cheap" and "buy" often appear in spam emails.
• Words like "team" and "meeting" appear in non-spam emails.
Step 3: Rank the words based on their contribution to
distinguishing spam from non-spam:
• Spam-related: cheap (0.8), buy (0.7), offer (0.6).
• Non-spam-related: team (0.9), meeting (0.8).
Step 4: Select the top k words. For k=3 choose:
{cheap, team, meeting}.
Dimensionality
Reduction
• The number of input features, variables, or columns
present in a given dataset is known as dimensionality, and
the process to reduce these features is called
dimensionality reduction.
• A dataset contains a huge number of input features in
various cases, which makes the predictive modeling task
more complicated. Because it is very difficult to visualize
or make predictions for the training dataset with a high
number of features, for such cases, dimensionality
reduction techniques are required to use.
Benefits of applying
Dimensionality Reduction
• By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
• Less Computation training time is required for reduced dimensions
of features.
• Reduced dimensions of features of the dataset help in visualizing the
data quickly.
• Itremoves the redundant features (if present) by taking care
of multicollinearity.
Differentiate between feature
Selection and Dimensionality
Reduction
Application of Text categorization and
Filtering
• Text categorization and filtering involve organizing and extracting relevant
information from text data based on predefined categories or criteria.
• These methods are widely used in various real-life applications, such as
1. spam detection,
2. sentiment analysis, and
3. Classifying Advertisements
4. Text Filtering
1. Spam Detection
• Purpose: Identify and filter out spam (unwanted) emails or messages.
• How It Works: A classifier is trained on labeled examples of spam and
non- spam emails.
• It looks for features like: Specific keywords: "win", "free", "offer".
2. Sentiment Analysis
• Purpose: Analyze text to determine the sentiment (positive, negative,
or neutral).
• How It Works: The system identifies opinion-based words and phrases in
text.
• Assigns a sentiment score based on the context.
• Example:
• Review: "The movie was amazing, I loved it!" → Positive Sentiment.
• Review: "Terrible customer service, very disappointed." → Negative
Sentiment.
3. Classifying Advertisements :
Purpose: Categorize ads into relevant categories for better targeting.
• How It Works:Text in the ad (e.g., title, description) is analyzed to identify
the most suitable category.
• Categories could include "Jobs," "Real Estate," "Electronics," etc.
• Example:
• Ad: "Brand new iPhone 14 for sale. Excellent condition." → Classified as
Electronics.
• Ad: "Looking for a software developer in New York" → Classified as Jobs.
4. Text Filtering
• Text filtering is a technique used to remove or modify unwanted content
from user interactions, such as messages or search queries, based on
certain criteria.
•The different types of text filtering
are: 1.Content-based Filtering
5. Collaborative Filtering
6. Hybrid Filtering
1. Content-Based Filtering
• Definition: This approach recommends items based on the
similarity between the content of items and the user’s
past preferences.
• How it Works:
• Analyzes the features of the items (e.g., keywords,
and, descriptions).
• Matches these features with a user profile built
from previously interacted content.
• Example:
• A movie recommendation system suggests films with
similar, actors, or directors as the movies a user has already
rated highly.
• 2. Collaborative Filtering
• Definition: This approach relies on user interaction data
(ratings, clicks, etc.) to recommend items based on similarities
between users or items.
• 3. Hybrid Filtering
• Definition: This approach combines content-based and
collaborative filtering methods to leverage their strengths
and mitigate their weaknesses.
• Example:
• A music app combines user preferences (content-based) and the
listening habits of similar users (collaborative) to recommend
new songs.
Differentiate Between Information
Filtering and Retrieval
Difference Between Classification and
Clustering
Clustering Techniques
• Clustering is a process of grouping a set of objects or data points into
clusters so that:
• Similar objects are grouped together in one cluster.
• Dissimilar objects are placed in different clusters.
• When we want to divide a large group of things (like customers, students,
or items) into smaller groups based on how similar they are, we use a
process called Clustering.
• Clustering is the most common form of Unsupervised Learning.
• After Clustering, each cluster is assigned a number called a Cluster ID
• Two most popular clustering algorithms are K means and Hierarchical .
Types of Clustering Methods
Partitioning Clustering
Divides the dataset into k distinct groups or clusters.
It divides the data into hierarchical groups.
It is also known as the Centroid-based method.
Algorithms: K-Means algorithm
Example: Grouping customers in a store based on purchasing behavior (e.g.,
frequent buyers, occasional buyers).
• Density-Based Clustering
Clusters are formed based on density of data points.
• Dense areas form clusters, while sparse regions are considered noise or outliers.
• A sparse region typically refers to an area in data or a system where information
or resources are distributed sparsely or unevenly.
• Algorithm: DBSCAN

Example:
• Imagine you are grouping houses in a city based on their price and size:
• Most houses are grouped in areas where the prices and sizes are close to each other
(e.g., small houses with low prices and large houses with high prices form clusters).
A few houses are isolated because:
• They are too expensive for their size (an outlier).
• They are in remote areas with very few neighbors (sparse region).
Distribution Model-Based Clustering
Assumes data is generated by a specific probability distribution and tries to
fit the data to that model.
• Algorithm: Expectation-Maximization (EM)
Example: Grouping students based on their test scores using Gaussian
distributions.
• Expectation Step (E-step):The algorithm guesses which probability
distribution each data point likely belongs to based on the current
parameters of the distributions.
• Step 1: Guess the number of clusters. Let's say you assume there are 3
clusters (low, medium, high scores).
• Step 2: In the E-step, the algorithm estimates which students belong to each
of these three groups based on their test scores.
• Step 3:
• In the M-step, the algorithm updates the parameters of the Gaussian
distributions (e.g., the mean and standard deviation) for each group based
on the students assigned to each group.
• The EM algorithm helps you find groups even if the data is not perfectly
separated, by fitting it into a distribution (like a Gaussian curve) and refining
it over time.
Connectivity-Based Clustering
Also called hierarchical clustering, it builds a tree-like structure of clusters.
1.Objects close to each other are grouped first, and this process continues.
2.Algorithms: Agglomerative Hierarchical Clustering, Divisive
Clustering

Example: Organizing books in a library by topics and subtopics.


• Step-by-Step Example:
1. Start with individual books as separate clusters:
1.Book A: "Data Science Basics"
2.Book B: "Machine Learning Algorithms"
3.Book C: "Data Structures and Algorithms"
4.Book D: "Python Programming"
5.Book E: "Deep Learning for Beginners"
• Find the closest pairs of books (based on similarity in content):
• Book A ("Data Science Basics") and Book B ("Machine Learning
Algorithms") are closest because both are about data science and machine
learning.
• Book C ("Data Structures and Algorithms") is closest to Book D
("Python Programming") because both are related to computer
science.
• Continue merging:
• Now, you have the following clusters:
• Cluster 1: "Data Science Basics" and "Machine Learning Algorithms"
• Cluster 2: "Data Structures and Algorithms" and "Python Programming"
• Book E ("Deep Learning for Beginners") is still in its own cluster.
• Now, Cluster 1 and Cluster 2 are both about computer science, so you merge
these into a larger cluster.
• Book E could be grouped into the "deep learning" subtopic.
K means Clustering
• K-Means Clustering is an Unsupervised Machine Learning algorithm,
which groups the unlabeled dataset into different clusters.
• Unsupervised Machine Learning is the process of teaching a computer to
use unlabeled, unclassified data and enabling the algorithm to operate
on that data without supervision.
• Without any previous data training, the machine’s job in this case is to
organize unsorted data according to parallels, patterns, and variations.
• K-means is an iterative, centroid-based clustering algorithm that
partitions a dataset into similar groups based on the distance between their
centroids.
K-means is an iterative,
centroid-based
clustering algorithm
• Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one group that
has similar properties.
• It starts by randomly assigning the clusters centroid in the space.
• Then each data point assign to one of the cluster based on its distance from
centroid of the cluster.
• After assigning each point to one of the cluster, new cluster centroids are
assigned.
• This process runs iteratively until it finds good cluster.
Hierarchical Clustering
• A Hierarchical clustering method works via grouping data into a tree
of clusters.
• Hierarchical clustering begins by treating every data point as a separate
cluster.
• Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and

2. Merge the 2 maximum comparable clusters.

3. We need to continue these steps until all the clusters are merged together.
Agglomerative Hierarchical Clustering (Bottom-Up Approach):
• This is the most common approach.
• Starts with each data point as its own individual cluster.
• The algorithm gradually merges the closest clusters to form larger clusters
until all points are in one cluster.
Divisive Hierarchical Clustering (Top-Down Approach):
This approach starts with all data points in a single cluster.
The algorithm splits the large cluster into smaller clusters, based on some
criteria, and continues this splitting process until each data point is in its own
individual cluster.
Example: You start with all books in the library in one big "Books" cluster, and
then you split them into "Science," "Arts," "Technology," etc., until each book
is in its own cluster.
Methods to Find closest pair of
Clusters
• In Hierarchical Clustering, one important aspect is deciding how
to determine the "closeness" between clusters when they need to
be merged.
• There are several methods used to find the closest pair of
clusters, and the most common ones are:
1. Single-Linkage (Nearest Point Linkage)
2. Complete-Linkage (Farthest Point Linkage)
3. Average-Linkage
1. Single-Linkage (Nearest Point Linkage)
In single linkage, the distance between two clusters is defined as the shortest
distance between any two points—one from each cluster.
Essentially, it's the distance between the closest pair of points in the two
clusters.
For two clusters R and S, the single linkage returns the minimum distance
between two points i and j such that i belongs to R and j belongs to S.
2. Complete Linkage:(Farthest Point Linkage)
• In complete linkage, the distance between two clusters is defined as the
maximum distance between any two points—one from each cluster.
• Essentially, it's the distance between the farthest pair of points in the two
clusters.
• For two clusters R and S, the complete linkage returns the maximum distance
between two points i and j such that i belongs to R and j belongs to S.
3. Average-Linkage
In average linkage, the distance between two clusters is defined as the average
distance between all pairs of points—one from each cluster.
This means you calculate the distance between every pair of points, one
from each cluster, and then take the average of all these distances.
For two clusters R and S, first for the distance between any data-point i in R
and any data-point j in S and then the arithmetic mean of these distances are
calculated. Average Linkage returns this value of the arithmetic mean.
Difference Between Kmeans and Hierarcical
clustering
k-means Clustering Hierarchical Clustering

k-means, using a pre-specified number of


clusters, the method assigns records to each Hierarchical methods can be either divisive or
cluster to find the mutually exclusive cluster of agglomerative.
spherical shape based on distance.

In hierarchical clustering one can stop at any number


K Means clustering needed advance knowledge of K of clusters, one find appropriate by interpreting the
i.e. no. of clusters one want to divide your data. dendrogram.(bottom to top or top to bottom)

Agglomerative methods begin with ‘n’ clusters and


One can use median or mean as a cluster centre to sequentially combine similar clusters until only one
represent each cluster. cluster is obtained.
Divisive methods work in the opposite
Methods used are normally less direction, beginning with one cluster that
computationally intensive and are suited includes all the records and Hierarchical
with very large datasets. methods are especially useful when the
K-Means is relatively simple and fast because target is to arrange the clusters into a natural
it only calculates distances and updates hierarchy.(Smaller groups (clusters) are
cluster centers. It's great for large datasets. nested within larger groups.)

In K Means clustering, since one start with


random choice of clusters, the results In Hierarchical Clustering, results are
produced by running the algorithm many reproducible in Hierarchical clustering
times may differ.

K- means clustering a simply a division of the


set of data objects into non-overlapping A hierarchical clustering is a set of nested
subsets (clusters) such that each data clusters that are arranged as a tree.
object is in exactly one subset).
Evaluations Of Clustering Result
• Three important Factors:
1. Clustering Tendency
2. Number of clusters k
3. Clustering Quality
Clustering Tendency
Clustering Tendency is the ability to determine whether a
dataset naturally forms groups (clusters) before actually
applying a clustering algorithm. It's like asking, "Can my data
be grouped meaningfully, or is it just random?“
Why is Clustering Tendency Important?
• Not all datasets are suitable for clustering.
• If the data doesn't have a natural structure, clustering
may give meaningless or arbitrary results.
• Clustering tendency helps us check if clustering is even
possible or worth it for a given dataset.
want to group customers
based on their buying habits.
If customers clearly behave
differently (e.g., some buy
luxury items, others buy
budget products), clustering is
possible.
If all customers buy random
things with no clear patterns,
clustering won't make sense.
2. Number of clusters k :
Choosing the number of clusters (K) in clustering is a critical step, and there
are different approaches to decide the best K.
Here's a breakdown of the two main approaches
1. Domain Knowledge Approach
2. Data-Driven Approach

3. Domain Knowledge Approach:


It give some prior knowledge on finding the number of clusters.
This approach uses your understanding of the data or the problem you're
solving to choose the number of clusters.
If you already have some knowledge about the data or the context, you
can make an educated guess about the appropriate number of clusters.
Example:
Imagine you are working for a clothing store and want to cluster customers
based on their shopping habits.
If you know that there are three main types of customers (budget, mid-range,
and high-end shoppers),
you might choose K=3 based on that understanding.
2. Data-Driven Approach :
This approach uses data itself to find the best number of clusters.
Here are a few methods within this approach:
1. Empirical Approach
2. Elbow Method
3. Statistical Methods
Empirical Approach:
You try different values of K (for example, K=2, 3, 4, etc.) and see which one
produces meaningful or useful clusters.
Elbow Method:
The Elbow Method is a way to help decide the best number of clusters (K)
when you're using a clustering algorithm like K-Means.
How Does It Work?
• Choose different K values (the number of clusters):
• Start with K=1 (one cluster) and increase K step by step (K=2, K=3, etc.).
• When plotting the points for clusters (k = 1, 2, 3, 4, etc.), you notice the drop
slows significantly at 3 clusters.
• Plot a graph:
• Plot the number of clusters (K) on the X-axis.
• Look for the "elbow" point:
• The elbow is the point where adding more clusters doesn’t improve the
results much anymore.
3. Statistical Methods:
Gap Statistics helps you find the best number of clusters by comparing the
performance of your clustering model with random clustering.
It looks for a "gap" between the performance of your real clustering and
random clustering to determine how meaningful the clusters are.
3. Clustering Quality :
When evaluating the quality of a clustering result, we use two types
of measures:
1. Extrinsic and
2. Intrinsic.
These measures help assess how well the clustering reflects the actual structure
of the data.
Clustering for Query expansion
and Result Grouping
• Query Expansion: This refers to the process of enhancing a search query by
adding additional terms or phrases to retrieve more relevant results.
Clustering techniques can help identify groups of related terms or
documents, which can then suggest additional keywords for expanding the
original query.
• Result Grouping: Once search results are retrieved, clustering algorithms
group them into clusters based on similarity. This helps users navigate results
by topic or theme, making it easier to explore related information.
• Dice's Coefficient for Measuring Similarity
• Dice's Coefficient is a statistical measure used to determine the
similarity between two sets. It is often used in clustering to compare the
similarity between terms, queries, or documents.
• Formula:
• The Dice's Coefficient is defined as:
• Explanation with Example
• Intrinsic Measures (Internal Measures)
• Intrinsic measures evaluate the clustering quality based only on the data
and the clusters themselves, without any external reference.
• These metrics focus on how well the data points are grouped within each
cluster and how separated the clusters are.
• Extrinsic Measures (External Measures)
• Extrinsic measures assess the quality of the clustering by comparing the
clustering result with some external reference(predefined reference).
• These measures are useful when you have labeled data to compare your
clustering result against.

You might also like