0% found this document useful (0 votes)
19 views

DMV

The document discusses different types of data visualization techniques including scatter plots, pie charts, bar plots, and histograms. It provides examples of how to create each type of plot using Python libraries and how to interpret the visualizations. The document also covers descriptive statistics measures like mean, median, mode, variance and standard deviation.

Uploaded by

gohesa8202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

DMV

The document discusses different types of data visualization techniques including scatter plots, pie charts, bar plots, and histograms. It provides examples of how to create each type of plot using Python libraries and how to interpret the visualizations. The document also covers descriptive statistics measures like mean, median, mode, variance and standard deviation.

Uploaded by

gohesa8202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore

Shri Vaishnav Institute of Information Technology


Department of Computer Science & Engineering

LAB FILE

Subject: Data Mining and Visualization


Course Code: BTDSE613N
III-Year/VI-Semester - Section (B)

Submitted By:- Submitted to:-


Name: Jai Soni Prof. Shyam Sundar Meena
Enrollment Number:21100BTCSE09852
Data Mining and Visualization BTDSE613N
PRACTICAL 1
Objective data visualization using scatter plot , pie plot , bar plot , histogram

Description:
Data visualization is a crucial aspect of data analysis, enabling analysts to communicate insights effectively and
intuitively. In this practical session, we will cover four fundamental types of data visualization:

1. Scatter Plot Used to visualize the relationship between two variables.


A scattered plot chart is used to know the behaviour of dependent data in response to the behaviour of
independent data. The potential relationship between the two variables are plotted, and the problem is then
solved. Scattered plot charts are used for the comparison of two or more data at a time.

2. Pie Plot Suitable for displaying the composition of a categorical variable.


A pie chart is circular in shape with slices of different sizes. It is mostly used in marketing. It consists of the
value of each variable as a slice of the circle, and various colours are used to separate the categories. From the
area of a slice, the minimum and maximum values are recognised. Pie charts are more effective when used in
3D form.

3. Bar Plot Effective for comparing values across different categories.


Bar charts are used in economics, statistics and marketing to analyse big data. The X-axis
represents the category, while the Y-axis represents value. The length of bars gives the
idea of maximum and minimum value with respect to the category.
4. Histogram Ideal for representing the distribution of a continuous variable.
Histograms are used in statistics, business and economics where numerical data plays a crucial role. A typical
histogram looks like a bar chart. However, a bar chart provides comparisons of fixed values of a category, while
in a histogram, each bar represents a range of value such as age in the range of 25-40. Histograms are generally
used to summarise big data.

Implementation
Participants will perform experiments using sample datasets to create each type of plot and interpret the
visualizations to extract meaningful insights.

1. Scatter Plot
 Load a sample dataset containing two numerical variables, such as 'x' and 'y'.
 Create a scatter plot using the `matplotlib` or `seaborn` library.
 Analyze the relationship between the variables based on the scatter plot.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

2. Pie Plot
 Load a sample dataset containing categorical data.
 Calculate the frequency or proportion of each category.
 Create a pie plot using `matplotlib` or `seaborn` to visualize the composition of the categorical variable.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
3. Bar Plot
 Load a sample dataset with categorical and numerical variables.
 Aggregate numerical data based on categories.
 Create a bar plot to compare the aggregated values across categories using `matplotlib` or `seaborn`.

4. Histogram
 Load a sample dataset with a continuous numerical variable.
 Create a histogram using `matplotlib` or `seaborn` to visualize the distribution of the variable.
 Analyze the shape, central tendency, and spread of the distribution based on the histogram.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Viva Questions
1. What is the primary purpose of using a scatter plot?

 Answer: To visualize the relationship between two variables.

2. How does a pie plot differ from a bar plot?

 Answer: Pie plots display the composition of a categorical variable, while bar plots compare values
across different categories.
3. What does each bar in a bar plot represent?

 Answer: Frequency of data points.

4. How can you interpret the shape of a histogram?

 Answer: The shape of a histogram provides insights into the distribution of data, including whether it
is symmetric, skewed, or multimodal.

5. Can you explain the difference between a histogram and a density plot?

 Answer: Histograms represent the frequency or count of data points within predefined intervals, while
density plots represent the probability density function of the data.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Multiple Choice Questions (MCQs)


1. Which type of plot is suitable for visualizing the distribution of a continuous variable?

a) Scatter Plot
b) Pie Plot
c) Bar Plot
d) Histogram
 Answer: d) Histogram

2. What type of data is best represented using a pie plot?

a) Continuous data
b) Categorical data
c) Time series data
d) Ordinal data
 Answer: b) Categorical data

3. Which plot is used to compare values across different categories?

a) Scatter Plot
b) Pie Plot
c) Bar Plot
d) Histogram
 Answer: c) Bar Plot

4. What does each bar in a histogram represent?

a) Frequency of data points


b) Proportion of data points
c) Mean of the dataset
d) Standard deviation of the dataset
 Answer: a) Frequency of data points

5. Which plot is most suitable for identifying the relationship between two numerical variables?

a) Scatter Plot
b) Pie Plot
c) Bar Plot
d) Histogram
 Answer: a) Scatter Plot

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

PRACTICAL 2
Objective – Understanding Statistical Descriptions Measure of Central Tendency (Mean,
Median, mode) and Measure of Variability (Variance, Standard Deviation)

Description
1. Introduction to descriptive statistics -
Descriptive statistics are numbers that are used to describe and summarize the data. They are used to describe
the basic features of the data under consideration. They provide simple summary measures which give an
overview of the dataset. Summary measures that are commonly used to describe a data set are measures of
central tendency and measures of variability or dispersion.

Measures of central tendency include the mean, median and mode. These measures summarize a given data set
by providing a single data point. These measures describe the center position of a distribution for a data set. We
analyze the frequency of each data point in the distribution and describes it using the mean, median or mode.
They provide the average of a data set. They can be either a representation of entire population or a sample of
the population.

Measures of variability or dispersion include the variance or standard deviation, coefficient of variation,
minimum and maximum values, IQR (Interquartile Range), skewness and kurtosis`. These measures help us to
analyze how spread-out the distribution is for a dataset. So, they provide the shape of the data set.

2. Measures of central tendency -


Central tendency means a central value which describe a probability distribution. It may also be called a center
or location of the distribution. The most common measures of central tendency are mean, median and mode.
The most common measure of central tendency is the mean. For skewed distribution or when there is concern
about outliers, the median may be preferred. So, median is more robust measure than the mean.

Mean -
 The most common measure of central tendency is the mean.
 Mean is also known as the simple average.
 It is denoted by greek letter µ for population and by ¯x for sample.
 We can find mean of a number of elements by adding all the elements in a dataset and then dividing by
the number of elements in the dataset.
 It is the most common measure of central tendency but it has a drawback.
 The mean is affected by the presence of outliers.
 So, mean alone is not enough for making business decisions.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
Median -
 Median is the number which divides the dataset into two equal halves.
 To calculate the median, we have to arrange our dataset of n numbers in ascending order.
 The median of this dataset is the number at (n+1)/2 th position, if n is odd.
 If n is even, then the median is the average of the (n/2)th number and (n+2)/2 th number.
 Median is robust to outliers.
 So, for skewed distribution or when there is concern about outliers, the median may be preferred.
Mode
 Mode of a dataset is the value that occurs most often in the dataset.
 Mode is the value that has the highest frequency of occurrence in the dataset.
There is no best measure that give us the complete picture. So, these measures of central tendency (mean,
median and mode) should be used together to represent the full picture.

3. Measures of dispersion or variability -


Dispersion is an indicator of how far away from the center, we can find the data values. The most common
measures of dispersion are variance, standard deviation and interquartile range (IQR). Variance is the
standard measure of spread. The standard deviation is the square root of the variance. The variance and
standard deviation are two useful measures of spread.

Variance -
 Variance measures the dispersion of a set of data points around their mean value.
 It is the mean of the squares of the individual deviations.
 Variance gives results in the original units squared.
Standard deviation -
 Standard deviation is the most common used measure of variability.
 It is the square-root of the variance.
 For Normally distributed data, approximately 95% of the values lie within 2 s.d. of the mean.
 Standard deviation gives results in the original units.
Procedure
1. Load the Dataset:

 Load a sample dataset containing numerical variables of interest.

2. Calculate Mean:

 Compute the arithmetic mean (average) of each numerical variable in the dataset.

3. Calculate Median:

 Determine the median, which is the middle value of a dataset when arranged in ascending order.

4. Calculate Mode:

 Identify the mode, which is the value that appears most frequently in the dataset.
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
5. Calculate Standard Deviation and Variance:

 Compute the standard deviation, a measure of the dispersion of values around the mean.
 Calculate the variance, which is the square of the standard deviation.

6. Interpretation:

 Interpret the computed statistics to understand the central tendency (mean, median, mode) and
variability (standard deviation, variance) of the dataset.
 Analyze the spread and distribution of data around the central values.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Viva Questions-
1. What is the difference between mean, median, and mode?

 Answer: The mean is the average value of the dataset, the median is the middle value when the data is
ordered, and the mode is the value that appears most frequently.

2. When is the median a better measure of central tendency than the mean?

 Answer: The median is preferred when the dataset contains outliers or is skewed, as itis less affected
by extreme values.

3. How does standard deviation measure the spread of data?

 Answer: Standard deviation quantifies the dispersion of data points around the mean, with higher
values indicating greater variability.

4. What does a high variance imply about the data?

 Answer: High variance indicates that data points are spread out widely from the mean, suggesting
greater variability in the dataset.

5. Can a dataset have more than one mode?

 Answer: Yes, a dataset can have multiple modes if two or more values occur with the same highest
frequency.
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Multiple Choice Questions (MCQs):
1. Which measure of central tendency is affected by outliers?

a. Mean
b. Median
c. Mode
 Answer: a) Mean

2. What is the value that appears most frequently in a dataset called?

a. Mean
b. Median
c. Mode
 Answer: c) Mode

3. What does standard deviation measure?

a) Central tendency

b) Variability

c) Spread of data

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
PRACTICAL 3
Objective : Understanding Data Cleaning (Treatment of outlier and missing value)

Description
Data cleaning is a critical step in the data preprocessing pipeline, aimed at improving the quality and integrity of
the dataset. Missing values and outliers can adversely affect the results of data analysis and modeling, making it
essential to address them effectively. In this practical session, we will focus on techniques for treating missing
values and outliers.

Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and removing
any missing, duplicate, or irrelevant data. The goal of data cleaning is to ensure that the data is accurate,
consistent, and free of errors, as incorrect or inconsistent data can negatively impact the performance of the ML
model. Professional data scientists usually invest a very large portion of their time in this step because of the
belief that “Better data beats fancier algorithms”

Procedure
1. Identify Missing Values:
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
 Load the dataset and identify missing values in each column.
 Determine the extent of missingness and assess its impact on the dataset.

2. Handle Missing Values:

 Impute missing values using appropriate techniques such as mean, median, mode imputation, or
predictive modeling.
 Alternatively, remove rows or columns with a high proportion of missing values if they cannot be
imputed accurately.

3. Detect Outliers:

 Use statistical methods or visualization techniques to detect outliers in the dataset.


 Plot boxplots or histograms to visualize the distribution of numerical variables and identify potential
outliers.
4. Handle Outliers:

 Apply techniques such as trimming, winsorization, or transforming the data to mitigate the impact of
outliers.
 Alternatively, remove outliers if they are deemed to be erroneous or irrelevant to the analysis.

5. Evaluate Data Quality:

 Assess the impact of data cleaning techniques on the quality and integrity of the dataset.
 Validate the effectiveness of the treatment of missing values and outliers in improving data reliability.

Viva Questions -
1. Why is it important to handle missing values in a dataset?

 Answer: Missing values can affect the accuracy and reliability of data analysis and modeling results,
leading to biased conclusions and erroneous insights.

2. What are some common techniques for handling missing values?

 Answer: Common techniques include mean, median, mode imputation, predictive modeling, or
removing rows/columns with missing values.

3. How can outliers affect statistical analysis?

 Answer: Outliers can skew statistical measures such as mean and standard deviation, leading to
inaccurate estimates and misleading interpretations of the data.

4. What is the difference between trimming and winsorization?

 Answer: Trimming involves removing extreme values from the dataset, while winsorization replaces
extreme values with less extreme values to reduce their impact.

5. How do you assess the effectiveness of data cleaning techniques?

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
 Answer: The effectiveness of data cleaning techniques can be evaluated by comparing summary
statistics, visualizations, or model performance before and after cleaning the data.

Multiple Choice Questions (MCQs):


1. Which technique is commonly used to impute missing values in a dataset?

a) Mean imputation
b) Median imputation
c) Mode imputation
d) All of the Above
 Answer: d) All of the above

2. What visualization technique can be used to detect outliers in numerical variables?

a) Scatter plot
b) Boxplot
c) Histogram
 Answer: b) Boxplot

3. What is the purpose of winsorization?

a) To remove outliers from the dataset


b) To replace outliers with less extreme values
c) To transform the data distribution
 Answer: b) To replace outliers with less extreme values

4. When is it appropriate to remove outliers from a dataset?

a) When they are valid data points


b) When they are erroneous or irrelevant to the analysis
c) When they are within 1.5 times the interquartile range
 Answer: b) When they are erroneous or irrelevant to the analysis

5. What is the primary goal of data cleaning?

a) To increase the size of the dataset


b) To improve the quality and integrity of the dataset
c) To introduce bias into the dataset
 Answer: b) To improve the quality and integrity of the dataset

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Procedure

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

PRACTICAL 4
Data Discretization -
Data discretization is the process of converting continuous data into discrete or categorical form. In other words,
it involves partitioning a continuous attribute into a finite number of intervals or bins. This is particularly useful
when dealing with numerical data that has a large range or when working with algorithms that require
categorical inputs.

Data discretization simplifies the data representation, reduces noise, and makes it more understandable for
certain algorithms, especially those that work well with categorical data or have assumptions about the data
distribution.

Data Transformation -
Data transformation involves modifying the original dataset to make it more suitable for analysis or modeling.
This process includes a variety of operations aimed at improving data quality, reducing noise, and preparing the
data for specific algorithms or analyses.

Data transformation is essential for preparing the data for analysis and modeling, improving the performance of
machine learning algorithms, and ensuring that the underlying assumptions of the models are met.

Procedure –
import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn import linear_model

# import statsmodels.api as sm
import statistics

import math

from collections import OrderedDict

x =[]

print("enter the data")

x = list(map(float, input().split()))

print("enter the number of bins")

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
bi = int(input())

# X_dict will store the data in sorted order

X_dict = OrderedDict()

# x_old will store the original data

x_old ={}

# x_new will store the data after binning

x_new ={}

for i in range(len(x)):

X_dict[i]= x[i]

x_old[i]= x[i]

x_dict = sorted(X_dict.items(), key = lambda x: x[1])

# list of lists(bins)

binn =[]

# a variable to find the mean of each bin

avrg = 0

i=0

k=0

num_of_data_in_each_bin = int(math.ceil(len(x)/bi))

# performing binning

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

avrg = avrg + h
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
i=i+1

elif(i == num_of_data_in_each_bin):

k=k+1

i=0

binn.append(round(avrg / num_of_data_in_each_bin, 3))

avrg = 0

avrg = avrg + h

i=i+1

rem = len(x)% bi

if(rem == 0):

binn.append(round(avrg / num_of_data_in_each_bin, 3))

else:

binn.append(round(avrg / rem, 3))

# store the new value of each data

i=0

j=0

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

x_new[g]= binn[j]

i=i+1

else:

i=0

j=j+1

x_new[g]= binn[j]

i=i+1

print("number of data in each bin")

print(math.ceil(len(x)/bi))

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
for i in range(0, len(x)):

print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))

Multiple Choice Questions -


1. Which of the following techniques is used for transforming skewed data to approximate a normal
distribution?

a) Standardization
b) Normalization
c) Log Transformation
d) Min-Max Scaling
 Answer: C) Log Transformation

2. What is the purpose of data discretization?


a) To transform continuous data into categorical data
b) To remove outliers from the dataset
c) To standardize the data
d) To reduce the dimensionality of the dataset
 Answer: A) To transform continuous data into categorical data

3. Which of the following methods is NOT used for data discretization?

a) Equal-width binning
b) Equal-frequency binning
c) k-means clustering
d) Decision tree induction
 Answer: C) k-means clustering

4. Which technique is used to transform categorical variables into numerical variables?

a) One-Hot Encoding
b) Label Encoding
c) Feature Scaling
d) PCA (Principal Component Analysis)
 Answer: A) One-Hot Encoding

5. In which scenario would you prefer normalization over standardization?

a) When the data has outliers


b) When the data is normally distributed
c) When the data has a non-linear distribution
d) When the data is already in a similar scale
 Answer: D) When the data is already in a similar scale

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
Viva Questions -
1. Can you explain the concept of data transformation in the context of machine learning?

 Answer: Data transformation involves altering the original dataset to make it more suitable for
analysis or modeling. This process can include various operations such as normalization,
standardization, encoding categorical variables, handling missing values, and transforming skewed
distributions.

2. What are the common techniques used for data transformation, and when would you use each one?

 Answer: Common techniques include:

- Normalization and standardization for scaling numerical features.

- One-hot encoding for converting categorical variables into numerical format.

- Log transformation for handling skewed data distributions.

The choice of technique depends on the nature of the data and the requirements of the machine
learning algorithm being used.

3. How does log transformation help in dealing with skewed data distributions?

 Answer: Log transformation is used to reduce the variability in data with a skewed distribution. It
compresses the range of the data, making it more symmetrical and approximate to a normal
distribution. This can help improve the performance of models that assume normality or require
normally distributed features.

4. Discuss the importance of data discretization in data preprocessing.

Data discretization is important for converting continuous data into discrete or categorical form. It helps in
simplifying the data representation, reducing noise, and making the data more understandable for certain
algorithms, especially those that work well with categorical data or have assumptions about the data
distribution.

5. Explain the difference between equal-width binning and equal-frequency binning.

 Answer:- Equal-width binning divides the range of the data into a fixed number of bins of equal
width, while equal-frequency binning divides the data into bins such that each bin contains
approximately the same number of data points.

Equal-width binning may not ensure an equal distribution of data points in each bin, while equal-
frequency binning can handle skewed distributions more effectively.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

PRACTICAL 5
Objective – corelation between variables using heat map

Description
Correlation analysis is a fundamental technique used in statistics to determine the strength and direction of the
relationship between two variables. In this practical session, we will focus on visualizing the correlation matrix
of numerical variables using a heat map. The heat map provides a graphical representation of the correlation
coefficients, making it easier to identify patterns and dependencies in the data.

Procedure
1. Load the Dataset:

 Load a sample dataset containing numerical variables of interest.

2. Calculate Correlation Matrix:

 Compute the correlation matrix using Pearson correlation coefficient or other suitable methods.
 Pearson correlation coefficient measures the linear relationship between two variables, ranging from -1
to 1.

3. Create Heat Map:

 Use a data visualization library such as `seaborn` or `matplotlib` to create a heat map of the correlation
matrix.
 Customize the heat map to enhance readability, such as annotating each cell with correlation
coefficients.
4. Interpretation:

 Analyze the heat map to identify strong positive correlations (values close to 1), indicating that
variables move in the same direction.
 Identify strong negative correlations (values close to -1), indicating that variables move in opposite
directions.
 Note variables with weak correlations (values close to 0), indicating little to no linear relationship.

5. Explore Further Analysis:

 Conduct additional analysis based on the insights gained from the correlation heat map, such as feature
selection, regression modeling, or identifying multicollinearity.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Viva Questions
1. What is correlation, and why is it important in data analysis?

Answer: Correlation measures the strength and direction of the relationship between two variables. It helps in
understanding how changes in one variable relate to changes in another variable.

2. How is correlation coefficient interpreted in a heat map?

 Answer: The correlation coefficient ranges from -1 to 1, where values close to 1 indicate a strong
positive correlation, values close to -1 indicate a strong negative correlation, and values close to 0
indicate weak or no correlation.

3. What does a high positive correlation imply?

 Answer: A high positive correlation indicates that as one variable increases, the other variable also
tends to increase.

4. Can correlation analysis determine causation between variables?

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
 Answer: No, correlation analysis only identifies the strength and direction of the relationship between
variables. It does not imply causation.

5. How can multicollinearity affect regression analysis?

 Answer: Multicollinearity, where independent variables are highly correlated, can lead to inflated
standard errors and unstable coefficients in regression analysis, making it difficult to interpret the
effects of individual predictors.

Multiple Choice Questions


1. What range of values does the correlation coefficient lie between?

a) specified range
b) 0 to 1
c) -1 to 1
 Answer: c) -1 to 1

2. A correlation coefficient close to 1 indicates:

a) Strong positive correlation


b) Low positive correlation
c) strong negative correlation
 Answer: a) Strong positive correlation

3. Which type of correlation indicates that variables move in opposite directions?

a) positive correlation
b) Negative correlation
c) c)None
 Answer: b) Negative correlation

4. What technique is commonly used to visualize the correlation matrix?

a) Scatter plot
b) Violin plot
c) Distribution plot
d) Heat map
 Answer: d) Heat map

5. Can correlation analysis determine the cause-and-effect relationship between variables?

a) Yes
b) No
 Answer: b) No

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

PRACTICAL – 6
Objective – Implementation of Apriori Algorithm
Apriori Algorithm is a Machine Learning algorithm which is used to gain insight into the structured
relationships between different items involved. The most prominent practical application of the algorithm is to
recommend products based on the products already present in the user’s cart. Walmart especially has made
great use of the algorithm in suggesting products to it’s users.
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# Changing the working location to the location of the file
cd C:\Users\Dev\Desktop\Kaggle\Apriori Algorithm

# Loading the Data


data = pd.read_excel('Online_Retail.xlsx')
data.head()
# Exploring the columns of the data
data.columns
# Exploring the different regions of transactions
data.Country.unique()
# Stripping extra spaces in the description
data['Description'] = data['Description'].str.strip()

# Dropping the rows without any invoice number


data.dropna(axis = 0, subset =['InvoiceNo'], inplace = True)
data['InvoiceNo'] = data['InvoiceNo'].astype('str')

# Dropping all transactions which were done on credit


data = data[~data['InvoiceNo'].str.contains('C')]
# Transactions done in France
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
basket_France = (data[data['Country'] =="France"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

# Transactions done in the United Kingdom


basket_UK = (data[data['Country'] =="United Kingdom"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

# Transactions done in Portugal


basket_Por = (data[data['Country'] =="Portugal"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

basket_Sweden = (data[data['Country'] =="Sweden"]


.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
# Defining the hot encoding function to make the data suitable
# for the concerned libraries
def hot_encode(x):
if(x<= 0):
return 0
if(x>= 1):
return 1

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
# Encoding the datasets
basket_encoded = basket_France.applymap(hot_encode)
basket_France = basket_encoded

basket_encoded = basket_UK.applymap(hot_encode)
basket_UK = basket_encoded

basket_encoded = basket_Por.applymap(hot_encode)
basket_Por = basket_encoded

basket_encoded = basket_Sweden.applymap(hot_encode)
basket_Sweden = basket_encoded
# Building the model
frq_items = apriori(basket_France, min_support = 0.05, use_colnames = True)

# Collecting the inferred rules in a dataframe


rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head())

frq_items = apriori(basket_UK, min_support = 0.01, use_colnames = True)


rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head())

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

frq_items = apriori(basket_Por, min_support = 0.05, use_colnames = True)


rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head())

frq_items = apriori(basket_Sweden, min_support = 0.05, use_colnames = True)


rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head())

Viva Questions
1. What is the Apriori algorithm?

 Answer: The Apriori algorithm is a classic association rule mining algorithm used to discover
frequent itemsets in transactional databases. It is based on the principle of Apriori property, which
states that any subset of a frequent itemset must also be frequent.

2. What is the objective of the Apriori algorithm?

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
 Answer: The objective of the Apriori algorithm is to identify frequent itemsets, which are sets of
items that frequently co-occur in transactions, and to generate association rules that describe
relationships between items

3. How does the Apriori algorithm work?

 Answer: The Apriori algorithm works in two main steps:


- 1. Generating candidate itemsets: Initially, individual items are considered frequent. Then, candidate
itemsets of length k are generated from frequent itemsets of length k-1 by joining and pruning.
- 2. Pruning infrequent itemsets: Candidate itemsets are scanned through the transaction database to count
their support (occurrence frequency). Itemsets below a specified minimum support threshold are pruned.

4. What are association rules in the context of the Apriori algorithm?

 Answer: Association rules are rules that describe relationships between items in transactional datasets.
They consist of an antecedent (left-hand side) and a consequent (right-hand side), separated by a '->'
symbol, indicating a conditional implication.

5. How are association rules evaluated in the Apriori algorithm?

 Answer: Association rules are evaluated based on metrics such as support, confidence, and lift.
Support measures the frequency of occurrence of an itemset, confidence measures the conditional
probability of the consequent given the antecedent, and lift measures the strength of association
between the antecedent and the consequent.

Multiple Choice Questions (MCQs) -


1. What is the main objective of the Apriori algorithm?

a) Regression analysis
b) Classification
c) Frequent itemset mining
d) Clustering
 Answer: c) Frequent itemset mining

2. How does the Apriori algorithm generate candidate itemsets?

a) By randomly selecting items


b) By joining and pruning frequent itemsets
c) By calculating item frequencies
d) By sorting items alphabetically
 Answer: b) By joining and pruning frequent itemsets

3. What is the Apriori property?

a) Any subset of a frequent itemset is frequent


b) Any subset of a frequent itemset is infrequent
c) Any frequent itemset is a subset of another frequent itemset
d) Any frequent itemset is a superset of another frequent itemset
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
PRACTICAL – 7
Objective – Implementation of Bayes Classification Algorithm

What is Naive Bayes Classifiers?


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single
algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features
being classified is independent of each other. To start with, let us consider a dataset.
One of the most simple and effective classification algorithms, the Naïve Bayes classifier aids in the rapid
development of machine learning models with rapid prediction capabilities.
Naïve Bayes algorithm is used for classification problems. It is highly used in text classification. In text
classification tasks, data contains high dimension (as each word represent one feature in the data). It is used in
spam filtering, sentiment detection, rating classification etc. The advantage of using naïve Bayes is its speed. It
is fast and making prediction is easy with high dimension of data.
# load the iris dataset

from sklearn.datasets import load_iris

iris = load_iris()

# store the feature matrix (X) and response vector (y)

X = iris.data

y = iris.target

# splitting X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# training the model on training set

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)

# making predictions on the testing set

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
y_pred = gnb.predict(X_test)

# comparing actual response values (y_test) with predicted response values (y_pred)

from sklearn import metrics

print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

Gaussian Naive Bayes model accuracy(in %): 95.0

Viva Questions -
1. What is the Naive Bayes classifier?

 Answer: The Naive Bayes classifier is a probabilistic machine learning algorithm based on Bayes'
theorem, which assumes independence between features. It is commonly used for classification tasks,
especially in text classification and spam filtering.

2. What is the underlying assumption of the Naive Bayes classifier?

 Answer: The Naive Bayes classifier assumes that the features are conditionally independent given the
class label. This means that the presence of a particular feature in a class is independent of the
presence of other features.

3. How does the Naive Bayes classifier make predictions?

 Answer: The Naive Bayes classifier calculates the probability of each class label given the input
features using Bayes' theorem and selects the class label with the highest posterior probability as the
predicted class.

4. What are the different types of Naive Bayes classifiers?

 Answer: The different types of Naive Bayes classifiers include:

Gaussian Naive Bayes: Assumes features follow a Gaussian distribution.

Multinomial Naive Bayes: Suitable for discrete features (e.g., word counts in text
classification).

Bernoulli Naive Bayes: Assumes features are binary (e.g., presence or absence of a
feature).

5. How does the choice of Naive Bayes variant affect classification performance?

 Answer: The choice of Naive Bayes variant depends on the nature of the features and the underlying
distribution of the data. Gaussian Naive Bayes is suitable for continuous features, while Multinomial
and Bernoulli Naive Bayes are suitable for discrete features.

Multiple Choice Questions (MCQs)


Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
1. What is the main principle behind the Naive Bayes classifier?

a) Minimizing within-cluster variance


b) Maximizing between-cluster variance
c) Maximizing posterior probability
d) Fitting a decision boundary
 Answer: c) Maximizing posterior probability

2. What type of distribution does Gaussian Naive Bayes assume for features?

a) Normal distribution
b) Uniform distribution
c) Binomial distribution
d) Poisson distribution
 Answer: a) Normal distribution

3. When is Multinomial Naive Bayes typically used?

a) When features are continuous


b) When features are binary
c) When features are counts or frequencies
d) When features are categorical
 Answer: c) When features are counts or frequencies

4. What does the Bernoulli Naive Bayes classifier assume about features?

a) Features follow a normal distribution


b) Features are binary
c) Features are continuous
d) Features are categorical
 Answer: b) Features are binary

PRACTICAL 8
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Objective – Implementation of K-Nearest-Neighbour Algorithm
This algorithm is used to solve the classification model problems. K-nearest neighbor or K-NN algorithm
basically creates an imaginary boundary to classify the data. When new data points come in, the algorithm will
try to predict that to the nearest of the boundary line.
Therefore, larger k value means smother curves of separation resulting in less complex models. Whereas,
smaller k value tends to overfit the data and resulting in complex models.
In the example shown above following steps are performed:
1. The k-nearest neighbor algorithm is imported from the scikit-learn package.
2. Create feature and target variables.
3. Split data into training and test data.
4. Generate a k-NN model using neighbors value.
5. Train or fit the data into the model.
6. Predict the future.

# Import necessary modules


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Loading data
irisData = load_iris()

# Create feature and target arrays


X = irisData.data
y = irisData.target

# Split into training and test set


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=7)

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
knn.fit(X_train, y_train)

# Calculate the accuracy of the model


print(knn.score(X_test, y_test))
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt

irisData = load_iris()

# Create feature and target arrays


X = irisData.data
y = irisData.target

# Split into training and test set


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state=42)

neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over K values


for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

# Compute training and test data accuracy


train_accuracy[i] = knn.score(X_train, y_train)
test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.plot(neighbors, test_accuracy, label = 'Testing dataset Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training dataset Accuracy')

plt.legend()
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.show()

Viva Questions
1. What is the k-nearest neighbor (KNN) classifier?

 Answer: The k-nearest neighbor (KNN) classifier is a supervised machine learning algorithm that
classifies new data points based on the majority class of their k nearest neighbors in the feature space.
2. How does the KNN algorithm classify new data points?

 Answer: To classify a new data point, the KNN algorithm finds its k nearest neighbors in the feature
space based on a distance metric (e.g., Euclidean distance) and assigns the majority class among these
neighbors to the new data point.
3. What are the key hyperparameters of the KNN algorithm?

 Answer: The key hyperparameters of the KNN algorithm include:


Number of neighbors (k)
Distance metric (e.g., Euclidean distance, Manhattan distance)
4. How does the choice of the number of neighbors (k) affect the KNN classifier?

 Answer: The choice of the number of neighbors (k) affects the bias-variance trade-off of the KNN
classifier. Smaller values of k result in low bias but high variance, leading to overfitting, while larger
values of k result in high bias but low variance, leading to underfitting.
5. What are the advantages and disadvantages of the KNN algorithm?
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
 Answer: Advantages of the KNN algorithm include simplicity, effectiveness for multi-class
classification, and ability to handle non-linear decision boundaries. Disadvantages include sensitivity
to the choice of k, computational inefficiency for large datasets, and susceptibility to irrelevant
features.

Multiple Choice Questions (MCQs)


1. What is the main principle behind the KNN algorithm?
a) Minimizing within-cluster variance
b) Maximizing between-cluster variance
c) Assigning majority class based on nearest neighbors
d) Fitting a decision boundary
 Answer: c) Assigning majority class based on nearest neighbors
2. How is the distance between data points calculated in the KNN algorithm?
a) Euclidean distance
b) Manhattan distance
c) Minkowski distance
d) All of the above
 Answer: d) All of the above
3. What is the role of the number of neighbors (k) in the KNN algorithm?
a) Determines the size of the feature space
b) Affects the bias-variance trade-off
c) Defines the decision boundary
d) Determines the distance metric

PRACTICAL – 9
Objective – Implementation of K-means Clustering Algorithm

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
What is K-means Clustering?
Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and
enabling the algorithm to operate on that data without supervision. Without any previous data training, the
machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations.
K means clustering, assigns data points to one of the K clusters depending on their distance from the center of
the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign to one
of the cluster based on its distance from centroid of the cluster. After assigning each point to one of the cluster,
new cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the analysis we
assume that number of cluster is given in advanced and we have to put points in one of the group.
The algorithm works as follows:
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s coordinates, which are the averages
of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our clusters.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Viva Questions -
1. What is K-means clustering?

 Answer: K-means clustering is a partitioning method that divides a dataset into 'K' distinct, non-
overlapping clusters. It aims to minimize the within-cluster variance by iteratively assigning data
points to the nearest cluster centroid and updating the centroids based on the mean of data points
assigned to each cluster.

2. What are the key steps involved in the K-means clustering algorithm?

 Answer: The key steps in the K-means clustering algorithm include:

Initialization: Randomly select 'K' centroids as the initial cluster centers.

Assignment: Assign each data point to the nearest centroid to form 'K' clusters.

Update centroids: Recalculate the centroid of each cluster based on the mean of data points
assigned to that cluster.

Repeat: Iterate the assignment and centroid update steps until convergence, where centroids no
longer change significantly

3. How does K-means handle outliers or noisy data points?

 Answer: K-means clustering is sensitive to outliers, as they can disproportionately affect the position
of cluster centroids. Outliers may form separate clusters or be assigned to the nearest cluster, leading
to suboptimal clustering results.

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N

Multiple Choice Questions (MCQs) -


1. What is the main objective of the K-means clustering algorithm?

a) Maximize within-cluster variance


b) Minimize within-cluster variance
c) Maximize between-cluster variance
d) Minimize between-cluster variance
 Answer: b) Minimize within-cluster variance

2. How are cluster centroids initialized in the K-means algorithm?

a) Randomly
b) At the center of the dataset
c) At the mean of each feature
d) At the maximum distance from each other
 Answer: a) Randomly

3. What is the convergence criterion in K-means clustering?

a) Maximum number of iterations


b) Minimum change in centroids
c) Maximum change in centroids
d) Minimum distance between centroids
 Answer: b) Minimum change in centroids

4. What happens to outliers in the K-means clustering process?

a) Assigned to the nearest cluster


b) Ignored
c) Form separate clusters
d) Removed from the dataset
 Answer: a) Assigned to the nearest cluster

5. Which method can be used to determine the optimal number of clusters in K-means?

a) Elbow method
b) Silhouette score
c) Gap statistics
d) All of the above
 Answer: d) All of the above

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
PRACTICAL – 10
Objective – Implementing Hierarchical Methods

Definition of Hierarchical Clustering

A hierarchical clustering approach is based on the determination of successive clusters based on previously
defined clusters. It's a technique aimed more toward grouping data into a tree of clusters called dendrograms,
which graphically represents the hierarchical relationship between the underlying clusters.

This step is a more general one for most of the machine learning tasks.

1. Compute the distance matrix containing the distance between each pair of data points using a particular
distance metric such as Euclidean distance, Manhattan distance, or cosine similarity. But the default distance
metric is the Euclidean one.

2.Merge the two clusters that are the closest in distance.

3. Update the distance matrix with regard to the new clusters.

4. Repeat steps 1, 2, and 3 until all the clusters are merged together to create a single cluster.

pip install scikit-learn

pip install pandas

pip install matplotlib seaborn

pip install scipy

import pandas as pd

loan_data = pd.read_csv("loan_data.csv")

loan_data.head()

loan_data.info()

percent_missing =round(100*(loan_data.isnull().sum())/len(loan_data),2)

percent_missing

cleaned_data = loan_data.drop(['purpose', 'not.fully.paid'], axis=1)

cleaned_data.info()

def show_boxplot(df):
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
plt.rcParams['figure.figsize'] = [14,6]

sns.boxplot(data = df, orient="v")

plt.title("Outliers Distribution", fontsize = 16)

plt.ylabel("Range", fontweight = 'bold')

plt.xlabel("Attributes", fontweight = 'bold')

show_boxplot(cleaned_data)

def show_boxplot(df):

plt.rcParams['figure.figsize'] = [14,6]

sns.boxplot(data = df, orient="v")

plt.title("Outliers Distribution", fontsize = 16)

plt.ylabel("Range", fontweight = 'bold')

plt.xlabel("Attributes", fontweight = 'bold')

show_boxplot(cleaned_data)

def remove_outliers(data):

df = data.copy()

for col in list(df.columns):

Q1 = df[str(col)].quantile(0.05)

Q3 = df[str(col)].quantile(0.95)

IQR = Q3 Q1

lower_bound = Q1 1.5*IQR

upper_bound = Q3 + 1.5*IQR

df = df[(df[str(col)] >= lower_bound) &

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
(df[str(col)] <= upper_bound)]

return df

without_outliers = remove_outliers(cleaned_data)

without_outliers.shape

from sklearn.preprocessing import StandardScaler

data_scaler = StandardScaler()

scaled_data = data_scaler.fit_transform(without_outliers)

scaled_data.shape

from scipy.cluster.hierarchy import linkage, dendrogram

complete_clustering = linkage(scaled_data, method="complete", metric="euclidean")

average_clustering = linkage(scaled_data, method="average", metric="euclidean")

single_clustering = linkage(scaled_data, method="single", metric="euclidean")

dendrogram(complete_clustering)

plt.show()

dendrogram(average_clustering)
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
plt.show()

dendrogram(single_clustering)

plt.show()

Viva Questions:
1. What are hierarchical clustering methods?

 Answer: Hierarchical clustering methods are algorithms that build a hierarchy of clusters by
recursively dividing or merging data points based on their similarity or dissimilarity.

2. What are the two main approaches to hierarchical clustering?

 Answer: The two main approaches to hierarchical clustering are agglomerative (bottom-up) and
divisive (top-down) clustering.

3. How does agglomerative hierarchical clustering work?


Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
 Answer: Agglomerative hierarchical clustering starts with each data point as a separate cluster and
then iteratively merges the most similar clusters until only one cluster remains.

4. What distance metrics are commonly used in hierarchical clustering?

 Answer: Common distance metrics include Euclidean distance, Manhattan distance, and cosine
similarity, depending on the nature of the data and the problem domain.

5. How are clusters represented in a dendrogram?

 Answer: Clusters are represented as branches in a dendrogram, with the height of each branch
indicating the distance or dissimilarity between clusters at each step of the clustering process.

Multiple Choice Questions (MCQs) -


1. Which approach to hierarchical clustering starts with each data point as a separate cluster?

a) Divisive clustering
b) Agglomerative clustering
c) K-means clustering
d) DBSCAN clustering
 Answer: b) Agglomerative clustering

2. What is the main goal of hierarchical clustering?

a) Minimize within-cluster variance


b) Maximize between-cluster variance
c) Maximize within-cluster variance
d) Minimize between-cluster variance
 Answer: d) Minimize between-cluster variance

3. What is the linkage criterion used to determine the distance between clusters in hierarchical clustering?

a) Centroid linkage
b) Complete linkage
c) Single linkage
d) Ward's linkage
 Answer: All of the above

4. In hierarchical clustering, how is the distance between clusters typically measured?

a) By the mean distance between all pairs of points in the clusters


b) By the maximum distance between any two points in the clusters
c) By the minimum distance between any two points in the clusters
d) By the sum of squared distances between all pairs of points in the clusters
 Answer: All of the above

PRACTICAL - 11
Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
Objective – density based methods DBSCAN
Clustering analysis or simply Clustering is basically an Unsupervised learning method that divides the data
points into a number of specific batches or groups, such that the data points in the same groups have similar
properties and data points in different groups have different properties in some sense. It comprises many
different methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-shift (distance between
points), DBSCAN (distance between nearest points), Gaussian mixtures (Mahalanobis distance to centers),
Spectral clustering (graph distance), etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities and then we use
it to cluster the data points into groups or batches. Here we will focus on the Density-based spatial clustering
of applications with noise (DBSCAN) clustering method

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with more than MinPts
neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the core point.
A point a and b are said to be density connected if there exists a point c which has a sufficient number of
points in its neighbors and both points a and b are within the eps distance. This is a chaining process. So, if
b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster
are noise.
import matplotlib.pyplot as plt

import numpy as np

from sklearn.cluster import DBSCAN

from sklearn import metrics

from sklearn.datasets import make_blobs

from sklearn.preprocessing import StandardScaler

from sklearn import datasets

# Load data in X

X, y_true = make_blobs(n_samples=300, centers=4,

cluster_std=0.50, random_state=0)

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)


Jai Soni 21100BTCSE09852
Data Mining and Visualization BTDSE613N
core_samples_mask[db.core_sample_indices_] = True

labels = db.labels_

# Number of clusters in labels, ignoring noise if present.

n_clusters_ = len(set(labels)) (1 if -1 in labels else 0)

# Plot result

# Black removed and is used for noise instead.

unique_labels = set(labels)

colors = ['y', 'b', 'g', 'r']

print(colors)

for k, col in zip(unique_labels, colors):

if k == -1:

# Black used for noise.

col = 'k'

class_member_mask = (labels == k)

xy = X[class_member_mask & core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,

markeredgecolor='k',

markersize=6)

xy = X[class_member_mask & ~core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,

markeredgecolor='k',

markersize=6)

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
plt.title('number of clusters: %d' % n_clusters_)

plt.show()

Viva Questions -
1. What are density-based methods in data mining?

 Answer: Density-based methods are clustering algorithms that partition the dataset based on the
density of data points in the feature space. These methods identify clusters as regions of high data
density separated by regions of low density.

2. Can you name a popular density-based clustering algorithm?

 Answer: One popular density-based clustering algorithm is DBSCAN (Density-Based Spatial


Clustering of Applications with Noise).

3. How does DBSCAN determine clusters in the dataset?

 Answer: DBSCAN identifies clusters by grouping together data points that are closely packed and
separated by regions of low density. It defines clusters as areas of sufficiently high density, separated
by areas of low density or noise.

4. What are the key parameters of DBSCAN?

 Answer: The key parameters of DBSCAN are:

Jai Soni 21100BTCSE09852


Data Mining and Visualization BTDSE613N
Epsilon (ε): The maximum distance between two points for them to be considered as part of the
same cluster.

Minimum points (MinPts): The minimum number of points required to form a dense region (core
point) in the dataset.

5. How does DBSCAN handle outliers or noise in the dataset?

 Answer: DBSCAN identifies outliers or noise as data points that do not belong to any cluster. These
points are not considered as part of any cluster and are labeled as noise.

Multiple Choice Questions (MCQs) -


1. Which clustering algorithm is based on the density of data points?

a) K-means
b) DBSCAN
c) Hierarchical clustering
d) Mean-shift
 Answer: b) DBSCAN

2. What is the key parameter in DBSCAN that defines the maximum distance between two points for
them to be considered as part of the same cluster?

a) K
b) Epsilon (ε)
c) MinPts
d) Silhouette coefficient
 Answer: b) Epsilon (ε)

3. What does a core point represent in DBSCAN?

a) A point in the dataset


b) A point with the maximum density in a cluster
c) A noisy point
d) A point with the minimum density in a cluster
 Answer: b) A point with the maximum density in a cluster

Jai Soni 21100BTCSE09852

You might also like