DMV
DMV
LAB FILE
Description:
Data visualization is a crucial aspect of data analysis, enabling analysts to communicate insights effectively and
intuitively. In this practical session, we will cover four fundamental types of data visualization:
Implementation
Participants will perform experiments using sample datasets to create each type of plot and interpret the
visualizations to extract meaningful insights.
1. Scatter Plot
Load a sample dataset containing two numerical variables, such as 'x' and 'y'.
Create a scatter plot using the `matplotlib` or `seaborn` library.
Analyze the relationship between the variables based on the scatter plot.
2. Pie Plot
Load a sample dataset containing categorical data.
Calculate the frequency or proportion of each category.
Create a pie plot using `matplotlib` or `seaborn` to visualize the composition of the categorical variable.
4. Histogram
Load a sample dataset with a continuous numerical variable.
Create a histogram using `matplotlib` or `seaborn` to visualize the distribution of the variable.
Analyze the shape, central tendency, and spread of the distribution based on the histogram.
Viva Questions
1. What is the primary purpose of using a scatter plot?
Answer: Pie plots display the composition of a categorical variable, while bar plots compare values
across different categories.
3. What does each bar in a bar plot represent?
Answer: The shape of a histogram provides insights into the distribution of data, including whether it
is symmetric, skewed, or multimodal.
5. Can you explain the difference between a histogram and a density plot?
Answer: Histograms represent the frequency or count of data points within predefined intervals, while
density plots represent the probability density function of the data.
a) Scatter Plot
b) Pie Plot
c) Bar Plot
d) Histogram
Answer: d) Histogram
a) Continuous data
b) Categorical data
c) Time series data
d) Ordinal data
Answer: b) Categorical data
a) Scatter Plot
b) Pie Plot
c) Bar Plot
d) Histogram
Answer: c) Bar Plot
5. Which plot is most suitable for identifying the relationship between two numerical variables?
a) Scatter Plot
b) Pie Plot
c) Bar Plot
d) Histogram
Answer: a) Scatter Plot
PRACTICAL 2
Objective – Understanding Statistical Descriptions Measure of Central Tendency (Mean,
Median, mode) and Measure of Variability (Variance, Standard Deviation)
Description
1. Introduction to descriptive statistics -
Descriptive statistics are numbers that are used to describe and summarize the data. They are used to describe
the basic features of the data under consideration. They provide simple summary measures which give an
overview of the dataset. Summary measures that are commonly used to describe a data set are measures of
central tendency and measures of variability or dispersion.
Measures of central tendency include the mean, median and mode. These measures summarize a given data set
by providing a single data point. These measures describe the center position of a distribution for a data set. We
analyze the frequency of each data point in the distribution and describes it using the mean, median or mode.
They provide the average of a data set. They can be either a representation of entire population or a sample of
the population.
Measures of variability or dispersion include the variance or standard deviation, coefficient of variation,
minimum and maximum values, IQR (Interquartile Range), skewness and kurtosis`. These measures help us to
analyze how spread-out the distribution is for a dataset. So, they provide the shape of the data set.
Mean -
The most common measure of central tendency is the mean.
Mean is also known as the simple average.
It is denoted by greek letter µ for population and by ¯x for sample.
We can find mean of a number of elements by adding all the elements in a dataset and then dividing by
the number of elements in the dataset.
It is the most common measure of central tendency but it has a drawback.
The mean is affected by the presence of outliers.
So, mean alone is not enough for making business decisions.
Variance -
Variance measures the dispersion of a set of data points around their mean value.
It is the mean of the squares of the individual deviations.
Variance gives results in the original units squared.
Standard deviation -
Standard deviation is the most common used measure of variability.
It is the square-root of the variance.
For Normally distributed data, approximately 95% of the values lie within 2 s.d. of the mean.
Standard deviation gives results in the original units.
Procedure
1. Load the Dataset:
2. Calculate Mean:
Compute the arithmetic mean (average) of each numerical variable in the dataset.
3. Calculate Median:
Determine the median, which is the middle value of a dataset when arranged in ascending order.
4. Calculate Mode:
Identify the mode, which is the value that appears most frequently in the dataset.
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
5. Calculate Standard Deviation and Variance:
Compute the standard deviation, a measure of the dispersion of values around the mean.
Calculate the variance, which is the square of the standard deviation.
6. Interpretation:
Interpret the computed statistics to understand the central tendency (mean, median, mode) and
variability (standard deviation, variance) of the dataset.
Analyze the spread and distribution of data around the central values.
Viva Questions-
1. What is the difference between mean, median, and mode?
Answer: The mean is the average value of the dataset, the median is the middle value when the data is
ordered, and the mode is the value that appears most frequently.
2. When is the median a better measure of central tendency than the mean?
Answer: The median is preferred when the dataset contains outliers or is skewed, as itis less affected
by extreme values.
Answer: Standard deviation quantifies the dispersion of data points around the mean, with higher
values indicating greater variability.
Answer: High variance indicates that data points are spread out widely from the mean, suggesting
greater variability in the dataset.
Answer: Yes, a dataset can have multiple modes if two or more values occur with the same highest
frequency.
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Multiple Choice Questions (MCQs):
1. Which measure of central tendency is affected by outliers?
a. Mean
b. Median
c. Mode
Answer: a) Mean
a. Mean
b. Median
c. Mode
Answer: c) Mode
a) Central tendency
b) Variability
c) Spread of data
Description
Data cleaning is a critical step in the data preprocessing pipeline, aimed at improving the quality and integrity of
the dataset. Missing values and outliers can adversely affect the results of data analysis and modeling, making it
essential to address them effectively. In this practical session, we will focus on techniques for treating missing
values and outliers.
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and removing
any missing, duplicate, or irrelevant data. The goal of data cleaning is to ensure that the data is accurate,
consistent, and free of errors, as incorrect or inconsistent data can negatively impact the performance of the ML
model. Professional data scientists usually invest a very large portion of their time in this step because of the
belief that “Better data beats fancier algorithms”
Procedure
1. Identify Missing Values:
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Load the dataset and identify missing values in each column.
Determine the extent of missingness and assess its impact on the dataset.
Impute missing values using appropriate techniques such as mean, median, mode imputation, or
predictive modeling.
Alternatively, remove rows or columns with a high proportion of missing values if they cannot be
imputed accurately.
3. Detect Outliers:
Apply techniques such as trimming, winsorization, or transforming the data to mitigate the impact of
outliers.
Alternatively, remove outliers if they are deemed to be erroneous or irrelevant to the analysis.
Assess the impact of data cleaning techniques on the quality and integrity of the dataset.
Validate the effectiveness of the treatment of missing values and outliers in improving data reliability.
Viva Questions -
1. Why is it important to handle missing values in a dataset?
Answer: Missing values can affect the accuracy and reliability of data analysis and modeling results,
leading to biased conclusions and erroneous insights.
Answer: Common techniques include mean, median, mode imputation, predictive modeling, or
removing rows/columns with missing values.
Answer: Outliers can skew statistical measures such as mean and standard deviation, leading to
inaccurate estimates and misleading interpretations of the data.
Answer: Trimming involves removing extreme values from the dataset, while winsorization replaces
extreme values with less extreme values to reduce their impact.
a) Mean imputation
b) Median imputation
c) Mode imputation
d) All of the Above
Answer: d) All of the above
a) Scatter plot
b) Boxplot
c) Histogram
Answer: b) Boxplot
Procedure
PRACTICAL 4
Data Discretization -
Data discretization is the process of converting continuous data into discrete or categorical form. In other words,
it involves partitioning a continuous attribute into a finite number of intervals or bins. This is particularly useful
when dealing with numerical data that has a large range or when working with algorithms that require
categorical inputs.
Data discretization simplifies the data representation, reduces noise, and makes it more understandable for
certain algorithms, especially those that work well with categorical data or have assumptions about the data
distribution.
Data Transformation -
Data transformation involves modifying the original dataset to make it more suitable for analysis or modeling.
This process includes a variety of operations aimed at improving data quality, reducing noise, and preparing the
data for specific algorithms or analyses.
Data transformation is essential for preparing the data for analysis and modeling, improving the performance of
machine learning algorithms, and ensuring that the underlying assumptions of the models are met.
Procedure –
import numpy as np
# import statsmodels.api as sm
import statistics
import math
x =[]
x = list(map(float, input().split()))
X_dict = OrderedDict()
x_old ={}
x_new ={}
for i in range(len(x)):
X_dict[i]= x[i]
x_old[i]= x[i]
# list of lists(bins)
binn =[]
avrg = 0
i=0
k=0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
# performing binning
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
avrg = avrg + h
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
i=i+1
elif(i == num_of_data_in_each_bin):
k=k+1
i=0
avrg = 0
avrg = avrg + h
i=i+1
rem = len(x)% bi
if(rem == 0):
else:
i=0
j=0
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
x_new[g]= binn[j]
i=i+1
else:
i=0
j=j+1
x_new[g]= binn[j]
i=i+1
print(math.ceil(len(x)/bi))
print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))
a) Standardization
b) Normalization
c) Log Transformation
d) Min-Max Scaling
Answer: C) Log Transformation
a) Equal-width binning
b) Equal-frequency binning
c) k-means clustering
d) Decision tree induction
Answer: C) k-means clustering
a) One-Hot Encoding
b) Label Encoding
c) Feature Scaling
d) PCA (Principal Component Analysis)
Answer: A) One-Hot Encoding
Answer: Data transformation involves altering the original dataset to make it more suitable for
analysis or modeling. This process can include various operations such as normalization,
standardization, encoding categorical variables, handling missing values, and transforming skewed
distributions.
2. What are the common techniques used for data transformation, and when would you use each one?
The choice of technique depends on the nature of the data and the requirements of the machine
learning algorithm being used.
3. How does log transformation help in dealing with skewed data distributions?
Answer: Log transformation is used to reduce the variability in data with a skewed distribution. It
compresses the range of the data, making it more symmetrical and approximate to a normal
distribution. This can help improve the performance of models that assume normality or require
normally distributed features.
Data discretization is important for converting continuous data into discrete or categorical form. It helps in
simplifying the data representation, reducing noise, and making the data more understandable for certain
algorithms, especially those that work well with categorical data or have assumptions about the data
distribution.
Answer:- Equal-width binning divides the range of the data into a fixed number of bins of equal
width, while equal-frequency binning divides the data into bins such that each bin contains
approximately the same number of data points.
Equal-width binning may not ensure an equal distribution of data points in each bin, while equal-
frequency binning can handle skewed distributions more effectively.
PRACTICAL 5
Objective – corelation between variables using heat map
Description
Correlation analysis is a fundamental technique used in statistics to determine the strength and direction of the
relationship between two variables. In this practical session, we will focus on visualizing the correlation matrix
of numerical variables using a heat map. The heat map provides a graphical representation of the correlation
coefficients, making it easier to identify patterns and dependencies in the data.
Procedure
1. Load the Dataset:
Compute the correlation matrix using Pearson correlation coefficient or other suitable methods.
Pearson correlation coefficient measures the linear relationship between two variables, ranging from -1
to 1.
Use a data visualization library such as `seaborn` or `matplotlib` to create a heat map of the correlation
matrix.
Customize the heat map to enhance readability, such as annotating each cell with correlation
coefficients.
4. Interpretation:
Analyze the heat map to identify strong positive correlations (values close to 1), indicating that
variables move in the same direction.
Identify strong negative correlations (values close to -1), indicating that variables move in opposite
directions.
Note variables with weak correlations (values close to 0), indicating little to no linear relationship.
Conduct additional analysis based on the insights gained from the correlation heat map, such as feature
selection, regression modeling, or identifying multicollinearity.
Viva Questions
1. What is correlation, and why is it important in data analysis?
Answer: Correlation measures the strength and direction of the relationship between two variables. It helps in
understanding how changes in one variable relate to changes in another variable.
Answer: The correlation coefficient ranges from -1 to 1, where values close to 1 indicate a strong
positive correlation, values close to -1 indicate a strong negative correlation, and values close to 0
indicate weak or no correlation.
Answer: A high positive correlation indicates that as one variable increases, the other variable also
tends to increase.
Answer: Multicollinearity, where independent variables are highly correlated, can lead to inflated
standard errors and unstable coefficients in regression analysis, making it difficult to interpret the
effects of individual predictors.
a) specified range
b) 0 to 1
c) -1 to 1
Answer: c) -1 to 1
a) positive correlation
b) Negative correlation
c) c)None
Answer: b) Negative correlation
a) Scatter plot
b) Violin plot
c) Distribution plot
d) Heat map
Answer: d) Heat map
a) Yes
b) No
Answer: b) No
PRACTICAL – 6
Objective – Implementation of Apriori Algorithm
Apriori Algorithm is a Machine Learning algorithm which is used to gain insight into the structured
relationships between different items involved. The most prominent practical application of the algorithm is to
recommend products based on the products already present in the user’s cart. Walmart especially has made
great use of the algorithm in suggesting products to it’s users.
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# Changing the working location to the location of the file
cd C:\Users\Dev\Desktop\Kaggle\Apriori Algorithm
basket_encoded = basket_UK.applymap(hot_encode)
basket_UK = basket_encoded
basket_encoded = basket_Por.applymap(hot_encode)
basket_Por = basket_encoded
basket_encoded = basket_Sweden.applymap(hot_encode)
basket_Sweden = basket_encoded
# Building the model
frq_items = apriori(basket_France, min_support = 0.05, use_colnames = True)
Viva Questions
1. What is the Apriori algorithm?
Answer: The Apriori algorithm is a classic association rule mining algorithm used to discover
frequent itemsets in transactional databases. It is based on the principle of Apriori property, which
states that any subset of a frequent itemset must also be frequent.
Answer: Association rules are rules that describe relationships between items in transactional datasets.
They consist of an antecedent (left-hand side) and a consequent (right-hand side), separated by a '->'
symbol, indicating a conditional implication.
Answer: Association rules are evaluated based on metrics such as support, confidence, and lift.
Support measures the frequency of occurrence of an itemset, confidence measures the conditional
probability of the consequent given the antecedent, and lift measures the strength of association
between the antecedent and the consequent.
a) Regression analysis
b) Classification
c) Frequent itemset mining
d) Clustering
Answer: c) Frequent itemset mining
iris = load_iris()
X = iris.data
y = iris.target
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# comparing actual response values (y_test) with predicted response values (y_pred)
Viva Questions -
1. What is the Naive Bayes classifier?
Answer: The Naive Bayes classifier is a probabilistic machine learning algorithm based on Bayes'
theorem, which assumes independence between features. It is commonly used for classification tasks,
especially in text classification and spam filtering.
Answer: The Naive Bayes classifier assumes that the features are conditionally independent given the
class label. This means that the presence of a particular feature in a class is independent of the
presence of other features.
Answer: The Naive Bayes classifier calculates the probability of each class label given the input
features using Bayes' theorem and selects the class label with the highest posterior probability as the
predicted class.
Multinomial Naive Bayes: Suitable for discrete features (e.g., word counts in text
classification).
Bernoulli Naive Bayes: Assumes features are binary (e.g., presence or absence of a
feature).
5. How does the choice of Naive Bayes variant affect classification performance?
Answer: The choice of Naive Bayes variant depends on the nature of the features and the underlying
distribution of the data. Gaussian Naive Bayes is suitable for continuous features, while Multinomial
and Bernoulli Naive Bayes are suitable for discrete features.
2. What type of distribution does Gaussian Naive Bayes assume for features?
a) Normal distribution
b) Uniform distribution
c) Binomial distribution
d) Poisson distribution
Answer: a) Normal distribution
4. What does the Bernoulli Naive Bayes classifier assume about features?
PRACTICAL 8
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Objective – Implementation of K-Nearest-Neighbour Algorithm
This algorithm is used to solve the classification model problems. K-nearest neighbor or K-NN algorithm
basically creates an imaginary boundary to classify the data. When new data points come in, the algorithm will
try to predict that to the nearest of the boundary line.
Therefore, larger k value means smother curves of separation resulting in less complex models. Whereas,
smaller k value tends to overfit the data and resulting in complex models.
In the example shown above following steps are performed:
1. The k-nearest neighbor algorithm is imported from the scikit-learn package.
2. Create feature and target variables.
3. Split data into training and test data.
4. Generate a k-NN model using neighbors value.
5. Train or fit the data into the model.
6. Predict the future.
# Loading data
irisData = load_iris()
knn = KNeighborsClassifier(n_neighbors=7)
irisData = load_iris()
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Generate plot
plt.plot(neighbors, test_accuracy, label = 'Testing dataset Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training dataset Accuracy')
plt.legend()
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.show()
Viva Questions
1. What is the k-nearest neighbor (KNN) classifier?
Answer: The k-nearest neighbor (KNN) classifier is a supervised machine learning algorithm that
classifies new data points based on the majority class of their k nearest neighbors in the feature space.
2. How does the KNN algorithm classify new data points?
Answer: To classify a new data point, the KNN algorithm finds its k nearest neighbors in the feature
space based on a distance metric (e.g., Euclidean distance) and assigns the majority class among these
neighbors to the new data point.
3. What are the key hyperparameters of the KNN algorithm?
Answer: The choice of the number of neighbors (k) affects the bias-variance trade-off of the KNN
classifier. Smaller values of k result in low bias but high variance, leading to overfitting, while larger
values of k result in high bias but low variance, leading to underfitting.
5. What are the advantages and disadvantages of the KNN algorithm?
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Answer: Advantages of the KNN algorithm include simplicity, effectiveness for multi-class
classification, and ability to handle non-linear decision boundaries. Disadvantages include sensitivity
to the choice of k, computational inefficiency for large datasets, and susceptibility to irrelevant
features.
PRACTICAL – 9
Objective – Implementation of K-means Clustering Algorithm
Viva Questions -
1. What is K-means clustering?
Answer: K-means clustering is a partitioning method that divides a dataset into 'K' distinct, non-
overlapping clusters. It aims to minimize the within-cluster variance by iteratively assigning data
points to the nearest cluster centroid and updating the centroids based on the mean of data points
assigned to each cluster.
2. What are the key steps involved in the K-means clustering algorithm?
Assignment: Assign each data point to the nearest centroid to form 'K' clusters.
Update centroids: Recalculate the centroid of each cluster based on the mean of data points
assigned to that cluster.
Repeat: Iterate the assignment and centroid update steps until convergence, where centroids no
longer change significantly
Answer: K-means clustering is sensitive to outliers, as they can disproportionately affect the position
of cluster centroids. Outliers may form separate clusters or be assigned to the nearest cluster, leading
to suboptimal clustering results.
a) Randomly
b) At the center of the dataset
c) At the mean of each feature
d) At the maximum distance from each other
Answer: a) Randomly
5. Which method can be used to determine the optimal number of clusters in K-means?
a) Elbow method
b) Silhouette score
c) Gap statistics
d) All of the above
Answer: d) All of the above
A hierarchical clustering approach is based on the determination of successive clusters based on previously
defined clusters. It's a technique aimed more toward grouping data into a tree of clusters called dendrograms,
which graphically represents the hierarchical relationship between the underlying clusters.
This step is a more general one for most of the machine learning tasks.
1. Compute the distance matrix containing the distance between each pair of data points using a particular
distance metric such as Euclidean distance, Manhattan distance, or cosine similarity. But the default distance
metric is the Euclidean one.
4. Repeat steps 1, 2, and 3 until all the clusters are merged together to create a single cluster.
import pandas as pd
loan_data = pd.read_csv("loan_data.csv")
loan_data.head()
loan_data.info()
percent_missing =round(100*(loan_data.isnull().sum())/len(loan_data),2)
percent_missing
cleaned_data.info()
def show_boxplot(df):
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
plt.rcParams['figure.figsize'] = [14,6]
show_boxplot(cleaned_data)
def show_boxplot(df):
plt.rcParams['figure.figsize'] = [14,6]
show_boxplot(cleaned_data)
def remove_outliers(data):
df = data.copy()
Q1 = df[str(col)].quantile(0.05)
Q3 = df[str(col)].quantile(0.95)
IQR = Q3 Q1
lower_bound = Q1 1.5*IQR
upper_bound = Q3 + 1.5*IQR
return df
without_outliers = remove_outliers(cleaned_data)
without_outliers.shape
data_scaler = StandardScaler()
scaled_data = data_scaler.fit_transform(without_outliers)
scaled_data.shape
dendrogram(complete_clustering)
plt.show()
dendrogram(average_clustering)
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
plt.show()
dendrogram(single_clustering)
plt.show()
Viva Questions:
1. What are hierarchical clustering methods?
Answer: Hierarchical clustering methods are algorithms that build a hierarchy of clusters by
recursively dividing or merging data points based on their similarity or dissimilarity.
Answer: The two main approaches to hierarchical clustering are agglomerative (bottom-up) and
divisive (top-down) clustering.
Answer: Common distance metrics include Euclidean distance, Manhattan distance, and cosine
similarity, depending on the nature of the data and the problem domain.
Answer: Clusters are represented as branches in a dendrogram, with the height of each branch
indicating the distance or dissimilarity between clusters at each step of the clustering process.
a) Divisive clustering
b) Agglomerative clustering
c) K-means clustering
d) DBSCAN clustering
Answer: b) Agglomerative clustering
3. What is the linkage criterion used to determine the distance between clusters in hierarchical clustering?
a) Centroid linkage
b) Complete linkage
c) Single linkage
d) Ward's linkage
Answer: All of the above
PRACTICAL - 11
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Objective – density based methods DBSCAN
Clustering analysis or simply Clustering is basically an Unsupervised learning method that divides the data
points into a number of specific batches or groups, such that the data points in the same groups have similar
properties and data points in different groups have different properties in some sense. It comprises many
different methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-shift (distance between
points), DBSCAN (distance between nearest points), Gaussian mixtures (Mahalanobis distance to centers),
Spectral clustering (graph distance), etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities and then we use
it to cluster the data points into groups or batches. Here we will focus on the Density-based spatial clustering
of applications with noise (DBSCAN) clustering method
1. Find all the neighbor points within eps and identify the core points or visited with more than MinPts
neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the core point.
A point a and b are said to be density connected if there exists a point c which has a sufficient number of
points in its neighbors and both points a and b are within the eps distance. This is a chaining process. So, if
b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster
are noise.
import matplotlib.pyplot as plt
import numpy as np
# Load data in X
cluster_std=0.50, random_state=0)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_
# Plot result
unique_labels = set(labels)
print(colors)
if k == -1:
col = 'k'
class_member_mask = (labels == k)
markeredgecolor='k',
markersize=6)
markeredgecolor='k',
markersize=6)
plt.show()
Viva Questions -
1. What are density-based methods in data mining?
Answer: Density-based methods are clustering algorithms that partition the dataset based on the
density of data points in the feature space. These methods identify clusters as regions of high data
density separated by regions of low density.
Answer: DBSCAN identifies clusters by grouping together data points that are closely packed and
separated by regions of low density. It defines clusters as areas of sufficiently high density, separated
by areas of low density or noise.
Minimum points (MinPts): The minimum number of points required to form a dense region (core
point) in the dataset.
Answer: DBSCAN identifies outliers or noise as data points that do not belong to any cluster. These
points are not considered as part of any cluster and are labeled as noise.
a) K-means
b) DBSCAN
c) Hierarchical clustering
d) Mean-shift
Answer: b) DBSCAN
2. What is the key parameter in DBSCAN that defines the maximum distance between two points for
them to be considered as part of the same cluster?
a) K
b) Epsilon (ε)
c) MinPts
d) Silhouette coefficient
Answer: b) Epsilon (ε)