Unit 3
Unit 3
Supervised and unsupervised learning are two main categories of machine learning
techniques. The key differences between these two approaches are:
1. Supervised learning:
- In supervised learning, the machine learning algorithm is trained on labeled data,
which means that the data has both input features and corresponding output labels.
- The goal of supervised learning is to predict or classify new, unseen data based on the
labeled data.
- Supervised learning algorithms are often used for tasks such as regression,
classification, and object detection.
- Examples of supervised learning algorithms include linear regression, decision trees,
and neural networks.
2. Unsupervised learning:
- In unsupervised learning, the machine learning algorithm is trained on unlabeled data,
which means that the data has only input features and no corresponding output labels.
- The goal of unsupervised learning is to find patterns or structure in the data that can
help us gain insights or make predictions.
- Unsupervised learning algorithms are often used for tasks such as clustering,
dimensionality reduction, and anomaly detection.
- Examples of unsupervised learning algorithms include k-means clustering, principal
component analysis (PCA), and autoencoders.
1. Definition:
- Classification is a type of supervised learning method that involves categorizing input
data into pre-defined classes or categories.
- Prediction is a type of machine learning method that involves estimating the future
outcome or value of a particular variable.
2. Goal:
- The goal of classification is to accurately predict the class or category of a new input
data based on previous examples.
- The goal of prediction is to estimate the future value or outcome of a particular
variable based on historical data.
3. Input Data:
- In classification, the input data is usually pre-labeled and belongs to a specific class
or category.
- In prediction, the input data is usually continuous and not pre-labeled.
4. Output:
- In classification, the output is a category or label that the input data belongs to.
- In prediction, the output is a continuous value that estimates the future value of a
particular variable.
5. Examples:
- Examples of classification include email spam filtering, image recognition, and
sentiment analysis.
- Examples of prediction include stock price forecasting, weather forecasting, and
sales forecasting.
In summary, classification is used to categorize input data into pre-defined classes, while
prediction is used to estimate the future value of a particular variable based on
historical data.
1. Overfitting: Overfitting occurs when a classification algorithm is too complex and tries
to fit the training data too closely. This can result in poor performance on new data and
inaccurate predictions.
3. Imbalanced Data: In some cases, the data used for classification may be imbalanced,
meaning that there are significantly more instances of one class than another. This can
lead to biased models that are better at predicting the majority class and poor at
predicting the minority class.
4. Feature Selection: Choosing the right features for classification is critical. The quality
and relevance of the features can greatly affect the performance of the classification
algorithm.
6. Scalability: Some classification algorithms may not be scalable, meaning that they do
not perform well on large datasets or require significant computational resources.
7. Noise: Noise in data can affect the performance of classification algorithms. This noise
can be in the form of outliers, errors, or irrelevant features.
Cluster analysis is a technique used in machine learning and data mining to group similar
data points together based on their features or attributes. The type of data used in
cluster analysis can be categorized into four types:
1. Interval data: This type of data is numerical and continuous, with equal intervals
between each value. Examples include temperature in Celsius or Fahrenheit, or age in
years.
2. Ratio data: Similar to interval data, ratio data is also numerical and continuous, but it
has a meaningful zero point. Examples include height in centimeters, weight in
kilograms, or income in dollars.
3. Ordinal data: This type of data is categorical and has a natural ordering or ranking
between the categories. Examples include educational level (e.g., high school, bachelor's
degree, master's degree), income level (e.g., low, medium, high), or movie ratings (e.g.,
one star, two stars, three stars).
4. Nominal data: Nominal data is also categorical, but there is no natural order or
ranking between the categories. Examples include gender (male or female),
race/ethnicity (White, Black, Hispanic, etc.), or type of car (sedan, SUV, truck).
In cluster analysis, each data point is assigned to a cluster based on the similarity of its
features to the other data points in the cluster. The type of data used can affect the
clustering algorithm used and the interpretation of the resulting clusters.
2. Robustness: The clustering algorithm should be able to handle noisy data and outliers.
3. Flexibility: The clustering algorithm should be able to handle different types of data,
such as continuous, categorical, and mixed data.
Validity: The clustering algorithm should produce valid and reliable results that are
consistent with the underlying data and problem domain.
6. What are the different distance measures used in clusting ?
There are several distance measures that can be used in clustering algorithms,
depending on the type of data being analyzed and the specific needs of the application.
Some commonly used distance measures in clustering include:
Euclidean distance: This is the most commonly used distance measure in clustering. It
calculates the straight-line distance between two points in n-dimensional space.
Euclidean distance is suitable for continuous data, such as temperature or height.
Cosine similarity: This distance measure is commonly used in text mining and natural
language processing. It measures the cosine of the angle between two vectors in a high-
dimensional space. Cosine similarity is suitable for text data, where the frequency of
occurrence of words is used to represent the documents.
Hamming distance: This distance measure is used for binary data, where each feature
can only take on two values (0 or 1). It calculates the number of positions at which the
two vectors differ.
Jaccard similarity: This distance measure is used for categorical data, where each
feature can take on a finite number of discrete values. It measures the similarity
between two sets of data by dividing the number of elements that they have in common
by the total number of elements across both sets.
DBSCAN is a density-based clustering algorithm that groups together points that are
closely packed together, while also identifying noise points that are outside of any
cluster. It consists of the following steps:
1. Select a random point and find all the points within a specified distance (called
epsilon) of that point.
2. If there are enough points within epsilon, create a new cluster and add all the points
to it.
3. For each point in the cluster, find all the points within epsilon distance of that point
and add them to the cluster.
4. Repeat steps 2 and 3 until there are no more points that can be added to the cluster.
1. Read in the data and create initial clusters based on a threshold value (called the
branching factor).
3. Merge clusters based on their proximity to each other, using the centroid and radius
values.
K-means is a centroid-based clustering algorithm that partitions the data into k clusters
based on the distance between points and the centroid of each cluster. It consists of the
following steps:
3. Recalculate the centroid of each cluster based on the points assigned to it.
4. Repeat steps 2 and 3 until the centroids no longer move or a maximum number of
iterations is reached.
The k-NN classification algorithm works by assigning a new data point to the class that is
most common among its k nearest neighbors in the training dataset, where k is a
positive integer value. Here is a step-by-step explanation of how the k-NN algorithm
works for classification:
Select the value of k: Choose a positive integer value for k, which represents the number
of nearest neighbors to consider when classifying a new data point.
Calculate distance: Calculate the distance between the new data point and all the
training data points using a distance metric such as Euclidean distance, Manhattan
distance, or Minkowski distance.
Find k-nearest neighbors: Sort the distances in ascending order and select the k nearest
neighbors to the new data point.
Assign class label: Count the number of data points in each class among the k nearest
neighbors and assign the new data point to the class with the highest count.
Return the predicted class: After classifying all new data points, the model can be
evaluated by comparing the predicted class labels to the actual class labels in the test
dataset.
The k-NN algorithm can also be used for regression by predicting the numerical value of
a new data point based on the average of the k nearest neighbors' values.
Explain the metrics used for classifier accuracy and error measures.
Metrics used for classifier accuracy and error measures are important in evaluating the
performance of classification models. Here are some common metrics:
1. Confusion Matrix: A table that shows the number of true positives, false positives,
true negatives, and false negatives in a classification model.
2. Accuracy: The proportion of correctly classified instances out of the total number of
instances. Accuracy = (TP + TN) / (TP + TN + FP + FN).
3. Precision: The proportion of true positives out of the total number of predicted
positives. Precision = TP / (TP + FP).
4. Recall (Sensitivity): The proportion of true positives out of the total number of actual
positives. Recall = TP / (TP + FN).
5. Specificity: The proportion of true negatives out of the total number of actual
negatives. Specificity = TN / (TN + FP).
6. F1 Score: A weighted average of precision and recall, where the F1 score reaches its
best value at 1 and worst at 0. F1 score = 2 * (precision * recall) / (precision + recall).
8. AUC: Area under the ROC curve. It is used to compare different binary classifiers, and
a higher AUC indicates better performance.
9. Error rate: The proportion of misclassified instances out of the total number of
instances. Error rate = (FP + FN) / (TP + TN + FP + FN).
10. Misclassification cost: The cost associated with each type of error (false positive and
false negative). It is used when the cost of misclassification is not the same for all errors.
There are several ways to classify clustering methods. One possible classification is:
- Density-based clustering focuses on areas of higher density in the data space, defining
clusters as regions of points that are more closely packed together than the surrounding
areas. DBSCAN and OPTICS are examples of this type of clustering.
- Model-based clustering assumes that the data points are generated from a statistical
model, such as a mixture of Gaussian distributions, and seeks to estimate the
parameters of the model that best fit the data. Expectation-maximization (EM) is an
example of this type of clustering.
- Fuzzy clustering assigns a degree of membership to each point for each cluster,
indicating the degree to which the point belongs to the cluster. Fuzzy C-means is an
example of this type of clustering.
- Crisp clustering assigns each point to a single cluster with a binary membership,
indicating whether the point belongs to the cluster or not.
- Cluster validity-based clustering evaluates the quality of the clustering by using internal
criteria, such as the compactness and separation of the clusters or the stability of the
clustering algorithm. Silhouette coefficient and Dunn Index are examples of this type of
clustering.
Consider the following data points.
A1 (2, 10), A2 (2, 5), A3 (8, 4)
B1 (5, 8), B2 (7, 5), B3 (6, 4)
C1 (1, 2), C2 (4, 9)
Considering A1 , B1 and C as initial centroids use K-means algorithm to find three
cluster centers
after first round of execution
To apply the K-means algorithm, we first need to determine the initial centroids. In this
case, we are given that A1, B1, and C1 are the initial centroids. We will use these
centroids to create the clusters.
For each point, we calculate its distance to each centroid and assign it to the closest
centroid. The distance measure used here is the Euclidean distance.
For each cluster, we recalculate the centroid by taking the mean of all the points in the
cluster.
Therefore, after the first round of execution, the three cluster centers are (3, 7.67), (6.5,
4.5), and (4.33, 5).
Given the Data Set of BUYS_COMPUTER, find the Information gain of INCOME,
STUDENT and
CREDIT_RATING attribute
13. The following table consists of training data from an employee database. Given a data
tuple having
the values “systems,” “26 . . . 30,” and “46–50K” for the attributes department, age, and
salary,
respectively, what would a naı̈ ve Bayesian classification of the status for the tuple be?
1. INCOME:
2. STUDENT: