0% found this document useful (0 votes)
13 views15 pages

Unit 3

Uploaded by

Spoorti Gaded
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Unit 3

Uploaded by

Spoorti Gaded
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1. Differentiate between supervised and unsupervised learning methods.

Supervised and unsupervised learning are two main categories of machine learning
techniques. The key differences between these two approaches are:

1. Supervised learning:
- In supervised learning, the machine learning algorithm is trained on labeled data,
which means that the data has both input features and corresponding output labels.
- The goal of supervised learning is to predict or classify new, unseen data based on the
labeled data.
- Supervised learning algorithms are often used for tasks such as regression,
classification, and object detection.
- Examples of supervised learning algorithms include linear regression, decision trees,
and neural networks.

2. Unsupervised learning:
- In unsupervised learning, the machine learning algorithm is trained on unlabeled data,
which means that the data has only input features and no corresponding output labels.
- The goal of unsupervised learning is to find patterns or structure in the data that can
help us gain insights or make predictions.
- Unsupervised learning algorithms are often used for tasks such as clustering,
dimensionality reduction, and anomaly detection.
- Examples of unsupervised learning algorithms include k-means clustering, principal
component analysis (PCA), and autoencoders.

In summary, supervised learning relies on labeled data to make predictions or


classifications, while unsupervised learning aims to discover patterns or structure in
unlabeled data.

2. Differentiate between classification and prediction


Classification and prediction are two types of machine learning tasks that are used for
different purposes. Here are the differences between them:

1. Definition:
- Classification is a type of supervised learning method that involves categorizing input
data into pre-defined classes or categories.
- Prediction is a type of machine learning method that involves estimating the future
outcome or value of a particular variable.

2. Goal:
- The goal of classification is to accurately predict the class or category of a new input
data based on previous examples.
- The goal of prediction is to estimate the future value or outcome of a particular
variable based on historical data.

3. Input Data:
- In classification, the input data is usually pre-labeled and belongs to a specific class
or category.
- In prediction, the input data is usually continuous and not pre-labeled.

4. Output:
- In classification, the output is a category or label that the input data belongs to.
- In prediction, the output is a continuous value that estimates the future value of a
particular variable.

5. Examples:
- Examples of classification include email spam filtering, image recognition, and
sentiment analysis.
- Examples of prediction include stock price forecasting, weather forecasting, and
sales forecasting.

In summary, classification is used to categorize input data into pre-defined classes, while
prediction is used to estimate the future value of a particular variable based on
historical data.

3. Discuss the issues in classification.


Classification is a supervised learning method in which an algorithm is trained to assign
predefined labels to input data based on their characteristics or features. However,
there are several issues that arise during the classification process, some of which are:

1. Overfitting: Overfitting occurs when a classification algorithm is too complex and tries
to fit the training data too closely. This can result in poor performance on new data and
inaccurate predictions.

2. Underfitting: Underfitting occurs when a classification algorithm is too simple and


fails to capture the complexity of the data. This can also lead to poor performance and
inaccurate predictions.

3. Imbalanced Data: In some cases, the data used for classification may be imbalanced,
meaning that there are significantly more instances of one class than another. This can
lead to biased models that are better at predicting the majority class and poor at
predicting the minority class.
4. Feature Selection: Choosing the right features for classification is critical. The quality
and relevance of the features can greatly affect the performance of the classification
algorithm.

5. Model Interpretability: In some cases, it may be important to understand how a


classification model arrives at its predictions. However, some classification algorithms,
such as neural networks, are difficult to interpret.

6. Scalability: Some classification algorithms may not be scalable, meaning that they do
not perform well on large datasets or require significant computational resources.

7. Noise: Noise in data can affect the performance of classification algorithms. This noise
can be in the form of outliers, errors, or irrelevant features.

4. Describe the types of data in cluster analysis with example.

Cluster analysis is a technique used in machine learning and data mining to group similar
data points together based on their features or attributes. The type of data used in
cluster analysis can be categorized into four types:

1. Interval data: This type of data is numerical and continuous, with equal intervals
between each value. Examples include temperature in Celsius or Fahrenheit, or age in
years.

2. Ratio data: Similar to interval data, ratio data is also numerical and continuous, but it
has a meaningful zero point. Examples include height in centimeters, weight in
kilograms, or income in dollars.

3. Ordinal data: This type of data is categorical and has a natural ordering or ranking
between the categories. Examples include educational level (e.g., high school, bachelor's
degree, master's degree), income level (e.g., low, medium, high), or movie ratings (e.g.,
one star, two stars, three stars).

4. Nominal data: Nominal data is also categorical, but there is no natural order or
ranking between the categories. Examples include gender (male or female),
race/ethnicity (White, Black, Hispanic, etc.), or type of car (sedan, SUV, truck).
In cluster analysis, each data point is assigned to a cluster based on the similarity of its
features to the other data points in the cluster. The type of data used can affect the
clustering algorithm used and the interpretation of the resulting clusters.

5. Discuss Density-based methods of clustering.Discuss some of the important requirements of a


good
clustering algorithm.
Density-based methods of clustering are used to identify clusters in data that have high
density regions. These methods are particularly useful for datasets with arbitrary shapes
and sizes, and can identify clusters of varying densities. Some popular density-based
clustering algorithms include DBSCAN, OPTICS, and DENCLUE.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular


density-based clustering algorithm. It groups together points that are close to each
other and have a high density. Points that are far away from any other points are
considered noise.

OPTICS (Ordering Points To Identify the Clustering Structure) is another density-based


clustering algorithm. It creates a hierarchical clustering structure and is capable of
identifying clusters of different sizes and shapes.

DENCLUE (DENsity-based CLUstEring) is a clustering algorithm that uses a density


function to model the density distribution of the data. It can identify clusters of different
shapes and sizes, and is particularly useful for high-dimensional datasets.

Requirements of a good clustering algorithm:

1. Scalability: The clustering algorithm should be able to handle large datasets


efficiently.

2. Robustness: The clustering algorithm should be able to handle noisy data and outliers.

3. Flexibility: The clustering algorithm should be able to handle different types of data,
such as continuous, categorical, and mixed data.

4. Interpretablility: The clustering algorithm should produce meaningful and


interpretable results that can be easily understood by users.

Validity: The clustering algorithm should produce valid and reliable results that are
consistent with the underlying data and problem domain.
6. What are the different distance measures used in clusting ?
There are several distance measures that can be used in clustering algorithms,
depending on the type of data being analyzed and the specific needs of the application.
Some commonly used distance measures in clustering include:

Euclidean distance: This is the most commonly used distance measure in clustering. It
calculates the straight-line distance between two points in n-dimensional space.
Euclidean distance is suitable for continuous data, such as temperature or height.

Manhattan distance: Also known as city-block distance, it calculates the distance


between two points by adding the absolute differences of their coordinates. Manhattan
distance is suitable for discrete data, such as location-based data.

Cosine similarity: This distance measure is commonly used in text mining and natural
language processing. It measures the cosine of the angle between two vectors in a high-
dimensional space. Cosine similarity is suitable for text data, where the frequency of
occurrence of words is used to represent the documents.

Hamming distance: This distance measure is used for binary data, where each feature
can only take on two values (0 or 1). It calculates the number of positions at which the
two vectors differ.

Jaccard similarity: This distance measure is used for categorical data, where each
feature can take on a finite number of discrete values. It measures the similarity
between two sets of data by dividing the number of elements that they have in common
by the total number of elements across both sets.

Explain the algorithm: i) STING ii) DBSCAN iii)bIRCH iv)K-means

i) STING (Statistical Information Grid-based Algorithm)

STING is a hierarchical clustering algorithm that constructs a multi-level grid structure to


represent the data space. It consists of the following steps:
1. Partition the data space into a grid structure.

2. Compute the statistical information for each cell in the grid.

3. Merge the cells based on the similarity of their statistical information.

4. Repeat steps 2 and 3 until the desired number of clusters is obtained.

ii) DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a density-based clustering algorithm that groups together points that are
closely packed together, while also identifying noise points that are outside of any
cluster. It consists of the following steps:

1. Select a random point and find all the points within a specified distance (called
epsilon) of that point.

2. If there are enough points within epsilon, create a new cluster and add all the points
to it.

3. For each point in the cluster, find all the points within epsilon distance of that point
and add them to the cluster.

4. Repeat steps 2 and 3 until there are no more points that can be added to the cluster.

5. Repeat steps 1-4 for all unvisited points in the dataset.

iii) BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

BIRCH is a hierarchical clustering algorithm that uses a tree-like structure to represent


the data. It consists of the following steps:

1. Read in the data and create initial clusters based on a threshold value (called the
branching factor).

2. For each cluster, calculate the cluster's centroid and radius.

3. Merge clusters based on their proximity to each other, using the centroid and radius
values.

4. Repeat steps 2 and 3 until there is only one cluster left.


iv) K-means

K-means is a centroid-based clustering algorithm that partitions the data into k clusters
based on the distance between points and the centroid of each cluster. It consists of the
following steps:

1. Select k initial centroids randomly from the dataset.

2. Assign each point to the nearest centroid.

3. Recalculate the centroid of each cluster based on the points assigned to it.

4. Repeat steps 2 and 3 until the centroids no longer move or a maximum number of
iterations is reached.

Explain K-nearest neighbour classification algorithm.

The k-nearest neighbor (k-NN) algorithm is a non-parametric machine learning


algorithm used for classification and regression. It is a type of instance-based learning
where the model doesn't learn a function but memorizes the training data instead.

The k-NN classification algorithm works by assigning a new data point to the class that is
most common among its k nearest neighbors in the training dataset, where k is a
positive integer value. Here is a step-by-step explanation of how the k-NN algorithm
works for classification:

Select the value of k: Choose a positive integer value for k, which represents the number
of nearest neighbors to consider when classifying a new data point.

Calculate distance: Calculate the distance between the new data point and all the
training data points using a distance metric such as Euclidean distance, Manhattan
distance, or Minkowski distance.

Find k-nearest neighbors: Sort the distances in ascending order and select the k nearest
neighbors to the new data point.
Assign class label: Count the number of data points in each class among the k nearest
neighbors and assign the new data point to the class with the highest count.

Return the predicted class: After classifying all new data points, the model can be
evaluated by comparing the predicted class labels to the actual class labels in the test
dataset.

The k-NN algorithm can also be used for regression by predicting the numerical value of
a new data point based on the average of the k nearest neighbors' values.

Explain the metrics used for classifier accuracy and error measures.

Metrics used for classifier accuracy and error measures are important in evaluating the
performance of classification models. Here are some common metrics:

1. Confusion Matrix: A table that shows the number of true positives, false positives,
true negatives, and false negatives in a classification model.

2. Accuracy: The proportion of correctly classified instances out of the total number of
instances. Accuracy = (TP + TN) / (TP + TN + FP + FN).

3. Precision: The proportion of true positives out of the total number of predicted
positives. Precision = TP / (TP + FP).

4. Recall (Sensitivity): The proportion of true positives out of the total number of actual
positives. Recall = TP / (TP + FN).

5. Specificity: The proportion of true negatives out of the total number of actual
negatives. Specificity = TN / (TN + FP).
6. F1 Score: A weighted average of precision and recall, where the F1 score reaches its
best value at 1 and worst at 0. F1 score = 2 * (precision * recall) / (precision + recall).

7. ROC Curve: A graphical representation of the performance of a binary classifier


system as the discrimination threshold is varied. The ROC curve plots the true positive
rate against the false positive rate at different threshold settings.

8. AUC: Area under the ROC curve. It is used to compare different binary classifiers, and
a higher AUC indicates better performance.

9. Error rate: The proportion of misclassified instances out of the total number of
instances. Error rate = (FP + FN) / (TP + TN + FP + FN).

10. Misclassification cost: The cost associated with each type of error (false positive and
false negative). It is used when the cost of misclassification is not the same for all errors.

Classify the various clustering methods.

There are several ways to classify clustering methods. One possible classification is:

1. Hierarchical vs. Partitioning:

- Hierarchical clustering builds a tree-like structure of clusters, starting from individual


points and merging them into larger clusters, either by agglomerative (bottom-up) or
divisive (top-down) methods.

- Partitioning clustering divides the data into non-overlapping groups, or partitions,


usually by optimization of a clustering criterion, such as minimizing the sum of squared
distances within each cluster.

2. Centroid-based vs. Density-based:


- Centroid-based clustering aims to find a central point, or centroid, for each cluster,
such as the mean or median of the data points assigned to the cluster. K-means is a
popular example of this type of clustering.

- Density-based clustering focuses on areas of higher density in the data space, defining
clusters as regions of points that are more closely packed together than the surrounding
areas. DBSCAN and OPTICS are examples of this type of clustering.

3. Prototype-based vs. Model-based:

- Prototype-based clustering creates representative prototypes, or exemplars, for each


cluster, such as medoids or centroids, which can be used to classify new data points
based on their distance to the prototypes. K-means and PAM (Partitioning Around
Medoids) are examples of this type of clustering.

- Model-based clustering assumes that the data points are generated from a statistical
model, such as a mixture of Gaussian distributions, and seeks to estimate the
parameters of the model that best fit the data. Expectation-maximization (EM) is an
example of this type of clustering.

4. Fuzzy vs. Crisp:

- Fuzzy clustering assigns a degree of membership to each point for each cluster,
indicating the degree to which the point belongs to the cluster. Fuzzy C-means is an
example of this type of clustering.

- Crisp clustering assigns each point to a single cluster with a binary membership,
indicating whether the point belongs to the cluster or not.

5. Partition validity-based vs. Cluster validity-based:

- Partition validity-based clustering evaluates the quality of the clustering by using


external criteria, such as the similarity of the clusters to a given ground truth or the
ability of the clusters to predict a certain outcome. Adjusted Rand Index (ARI) and F-
measure are examples of this type of clustering.

- Cluster validity-based clustering evaluates the quality of the clustering by using internal
criteria, such as the compactness and separation of the clusters or the stability of the
clustering algorithm. Silhouette coefficient and Dunn Index are examples of this type of
clustering.
Consider the following data points.
A1 (2, 10), A2 (2, 5), A3 (8, 4)
B1 (5, 8), B2 (7, 5), B3 (6, 4)
C1 (1, 2), C2 (4, 9)
Considering A1 , B1 and C as initial centroids use K-means algorithm to find three
cluster centers
after first round of execution

To apply the K-means algorithm, we first need to determine the initial centroids. In this
case, we are given that A1, B1, and C1 are the initial centroids. We will use these
centroids to create the clusters.

Step 1: Assign each point to the nearest centroid

For each point, we calculate its distance to each centroid and assign it to the closest
centroid. The distance measure used here is the Euclidean distance.

Cluster 1: A1 (2, 10), A2 (2, 5), B1 (5, 8)

Cluster 2: B2 (7, 5), B3 (6, 4)

Cluster 3: C1 (1, 2), C2 (4, 9), A3 (8, 4)

Step 2: Recalculate the centroids

For each cluster, we recalculate the centroid by taking the mean of all the points in the
cluster.

Cluster 1: New centroid = ((2+2+5)/3, (10+5+8)/3) = (3, 7.67)

Cluster 2: New centroid = ((7+6)/2, (5+4)/2) = (6.5, 4.5)

Cluster 3: New centroid = ((1+4+8)/3, (2+9+4)/3) = (4.33, 5)


The new centroids are (3, 7.67), (6.5, 4.5), and (4.33, 5).

Therefore, after the first round of execution, the three cluster centers are (3, 7.67), (6.5,
4.5), and (4.33, 5).

Given the Data Set of BUYS_COMPUTER, find the Information gain of INCOME,
STUDENT and
CREDIT_RATING attribute
13. The following table consists of training data from an employee database. Given a data
tuple having
the values “systems,” “26 . . . 30,” and “46–50K” for the attributes department, age, and
salary,
respectively, what would a naı̈ ve Bayesian classification of the status for the tuple be?

To calculate the information gain of the INCOME, STUDENT and CREDIT_RATING


attributes, we need to first calculate the entropy of the target variable (buys_computer).

The total number of tuples in the dataset = 14

Number of tuples with buys_computer = 'yes' = 9

Number of tuples with buys_computer = 'no' = 5

Entropy(S) = - (9/14) * log2(9/14) - (5/14) * log2(5/14) = 0.940

Now, let's calculate the information gain for each attribute:

1. INCOME:

Number of tuples with income <= 30 = 4

Number of tuples with income > 30 and <= 40 = 4

Number of tuples with income > 40 = 6


a) For income <= 30:

Number of tuples with buys_computer = 'yes' = 2

Number of tuples with buys_computer = 'no' = 2

Entropy(S_income<=30) = - (2/4) * log2(2/4) - (2/4) * log2(2/4) = 1.000

b) For income > 30 and <= 40:

Number of tuples with buys_computer = 'yes' = 3

Number of tuples with buys_computer = 'no' = 1

Entropy(S_income30_40) = - (3/4) * log2(3/4) - (1/4) * log2(1/4) = 0.811

c) For income > 40:

Number of tuples with buys_computer = 'yes' = 4

Number of tuples with buys_computer = 'no' = 2

Entropy(S_income>40) = - (4/6) * log2(4/6) - (2/6) * log2(2/6) = 0.918

Weighted entropy for income:

Weighted entropy(income) = (4/14) * 1.000 + (4/14) * 0.811 + (6/14) * 0.918 = 0.911


Information gain for income:

Information gain(income) = Entropy(S) - Weighted entropy(income) = 0.940 - 0.911 =


0.029

2. STUDENT:

Number of tuples with student = 'yes' = 6

Number of tuples with student = 'no' = 8

a) For student = 'yes':

Number of tuples with buys_computer = 'yes' = 2

Number of tuples with buys_computer = 'no' = 4

Entropy(S_student=yes) = - (2/6) * log2(2/6) - (4/6) * log2(4/6) = 0.918

b) For student = 'no':

Number of tuples with buys_computer = 'yes' = 7

Number of tuples with buys_computer = 'no' = 1

Entropy(S_student=no) = - (7/8) * log2(7/8) - (1/8) * log2(1/8) = 0.544

Weighted entropy for student:

Weighted entropy(student) = (6/14) * 0.918 + (8/14) * 0.544 = 0.693


Information gain for student:

Information gain(student) = Entropy(S) - Weighted entropy(student) = 0.940 - 0.693 =


0.247

You might also like