0% found this document useful (0 votes)
14 views54 pages

DMC - Record

The document describes creating and normalizing tables in Python. It includes algorithms for identifying entities, attributes, functional dependencies and normalizing tables to remove partial and transitive dependencies to achieve third normal form. Sample Python code demonstrates normalizing a sample dataset by removing rows, imputing missing values, and transforming continuous data using discretization.

Uploaded by

mrsanthoosh.edu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views54 pages

DMC - Record

The document describes creating and normalizing tables in Python. It includes algorithms for identifying entities, attributes, functional dependencies and normalizing tables to remove partial and transitive dependencies to achieve third normal form. Sample Python code demonstrates normalizing a sample dataset by removing rows, imputing missing values, and transforming continuous data using discretization.

Uploaded by

mrsanthoosh.edu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

SRM INSTITUTE OF SCIENCE & TECHNOLOGY

(Deemed to be University)
VADAPALANI CAMPUS

Faculty of Engineering and Technology


Department of Computer Science and Engineering

LABORATORY PRACTICAL RECORD


20CSE526J - DATA MINING CONCEPTS

STUDENT NAME : YOUR NAME

REGISTER NO. : RA23120050400--

DEGREE / BRANCH : M.Tech / CSE

YEAR / SEM : I / II
BONAFIDE CERTIFICATE

Certified to be the bonafide record of work done by “YOUR NAME

(RA23120050400--)” of Computer Science Engineering department, M.Tech

degree course in the Practical of 20CSE526J - Data Mining Concepts in

SRMIST, Vadapalani during the academic year 2023 - 2024.

Date: Faculty In-Charge HOD-CSE

Submitted for End Semester Examination held in _____________________

________________________________________ SRMIST, Vadapalani.

Date: Examiner - I Examiner - II


INDEX
Exp.No. Date Title of the Experiment Page No. Signature

01 29.01.2024 Creation of Tables and Preprocessing 01

02 05.02.2024 Normalizing the Tables Using Knowledge Flow 04

03 12.02.2024 Finding Frequent Data Item Sets 07

Discretization Filtering and Resample Filtering on


04 19.02.2024 10
Sample Data

05 26.02.2024 Construction of Decision Tree 12

Implementation of Aprior Algorithm With


06 11.02.2024 16
Different Support and Confidence Values

Deriving Interesting Insights and Observe the


07 11.02.2024 Effect of Discretization in the Rule Generated 18
Using Apriori Algorithm

Implementation of ID3 Classification


08 18.03.2024 25
Algorithm

Naive-Bayes Classification and K-Nearest


09 18.03.2024 30
Neighbor Classification

Compare the Performance of ID3, Naive


10 25.03.2024 Bayes and K-Nearest Neighbor Classification 34
Algorithms

Implementation of K-Means Clustering


11 25.03.2024 37
Algorithm

12 02.04.2024 Exploring Visualization Features 40

Building Linear Regression Model and


13 02.04.2024 44
Derive Interesting Patterns

14 10.04.2024 Building of Multi Linear Regression Model 47

15 18.04.2024 Classification and Clustering of a Sample Data Set 49


EX NO : 01 CREATION OF TABLES AND PREPROCESSING

DATE :

AIM
To implement a python program for
a) creation of tables
b) using data preprocessing methods for preparation of data for machine
learning using scikit learn using Python

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library

CONCEPT
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.

ALGORITHM :
1. Define the table structure:
● Determine the columns or fields required for your table.
● Define the data types for each column (e.g., text, numeric, date).

2. Create the table:


● Depending on your programming language and database system, use the
appropriate commands or libraries to create a table with the defined structure.
● Specify any additional constraints or indexing requirements.

3. Load or import data:


● Retrieve the data from the desired source (e.g., CSV file, database, API).
● Parse and read the data into memory.

4. Data cleaning:
● Identify and handle missing or null values.
● Remove or handle any outliers or inconsistencies in the data.
● Standardize or normalize data formats (e.g., converting dates to a consistent
format).

5. Data transformation:
● Apply any necessary transformations to the data based on your analysis
requirements (e.g., converting categorical variables to numeric, scaling
numeric features).
● Perform feature engineering if needed (e.g., creating new columns,
aggregating data).

1
6. Data integration:
● If you have multiple data sources, merge or join them based on common
columns or keys.
● Ensure data consistency and resolve any conflicts or duplicate entries.

7. Data sampling or splitting:


● If needed, divide the data into training, validation, and testing sets for
machine learning or analysis purposes.
● Randomly sample data if necessary.

8. Save or export the preprocessed data:


● Depending on your needs, save the preprocessed data to a file format (e.g.,
CSV, Excel) or store it in a database for further analysis.

PROGRAM CODE:
import pandas as pd
# create a sample table
data = {'id': [1, 2, 3, 4, 5],
'name': ['John', 'Mary', 'Bob', 'Alice', 'Jane'],
'age': [25, 30, 40, 20, 35],
'gender': ['M', 'F', 'M', 'F', 'F'],
'income': [50000, 60000, 80000, 40000, 70000]}
df = pd.DataFrame(data)

# display the original table


print('Original Table:')
print(df)

# remove the 'id' column


df = df.drop('id', axis=1)

# encode the 'gender' column as binary values


df['gender'] = df['gender'].map({'M': 0, 'F': 1})

# normalize the 'age' and 'income' columns


df['age'] = (df['age'] - df['age'].mean()) / df['age'].std()
df['income'] = (df['income'] - df['income'].min()) / (df['income'].max() -
df['income'].min())

# display the preprocessed table


print('\nPreprocessed Table:')
print(df)

2
OUTPUT:

RESULT:
Thus, the programs are implemented successfully for table creation and data
pre-processing.

3
EX NO : 02 NORMALIZING THE TABLES USING KNOWLEDGE FLOW

DATE :

AIM
To write and implement a python program to impute missing values with various
techniques on given dataset.
a) Remove rows/ attributes
b) Replace with mean or mode
c) Write a python program to perform transformation of data using
Discretization (Binning) on given dataset.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library

ALGORITHM :

1. Identify the entity:


● Determine the entity or object that the table represents. For example, if you
have a table of employees, the entity could be "Employee."

2. Identify the attributes:


● Identify all the attributes (columns) of the table that describe the entity. For
the employee table, attributes might include "EmployeeID," "FirstName,"
"LastName," "Email," and so on.

3. Identify the functional dependencies:


● Determine the functional dependencies between the attributes. A functional
dependency exists when the value of one attribute determines the value of
another attribute. For example, in the employee table, "EmployeeID"
determines "FirstName," "LastName," and "Email."

4. Create a primary key:


● Select a primary key for the table. The primary key uniquely identifies each
record in the table. It can be a single attribute or a combination of attributes.
In the employee table, "EmployeeID" could be chosen as the primary key.

5. Remove partial dependencies:


● Identify any partial dependencies, where an attribute depends on only part of
the primary key. If any exist, create separate tables to eliminate these
dependencies.

4
6. Remove transitive dependencies:
● Identify any transitive dependencies, where an attribute depends on another
attribute that is not part of the primary key. If any exist, create separate tables
to eliminate these dependencies.

7. Create foreign keys:


● Identify relationships between the tables and create foreign keys to maintain
referential integrity.

8. Repeat the process:


● If there are still dependencies remaining, repeat steps 5-7 until all tables are
normalized to the desired level (typically third normal form, 3NF).

9. Finalize the schema:


● Once all tables are normalized, review the schema and make any necessary
adjustments or optimizations based on the specific requirements of your
database system and application.

PROGRAM CODE:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Create a sample dataset with continuous values
data = {'feature1': [1.5, 3.0, 2.0, 4.5],
'feature2': [10, 20, 30, 40]}
df = pd.DataFrame(data) # Instantiate the StandardScaler object
scaler = StandardScaler() # Fit and transform the data
df_normalized = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)

# Print the original and normalized dataframes


print('Original Data:')
print(df.head())
print('\nNormalized Data:')
print(df_normalized.head())

5
OUTPUT:

RESULT:
Thus, the values are normalized using knowledge flow.

6
EX NO : 03 FINDING FREQUENT DATA ITEM SETS

DATE :

AIM
a) To write and implement a python program to find frequent data itemsets.
b) Also, a program to find frequent elements from a list of numbers.

REQUIREMENTS
1. Python 3.7.0
2. Install: pip installer, pandas, SciPy library, apriori

CONCEPT
Association Mining searches for frequent items in the data-set. In frequent mining
usually the interesting associations and correlations between item sets in transactional and
relational databases are found. In short, Frequent Mining shows which items appear together
in a transaction or relation.

ALGORITHM:
1. Initialize support threshold:
● Determine the minimum support threshold, denoted as min_support, which
represents the minimum occurrence frequency for an itemset to be considered
frequent. This value is typically defined by the user.

2. Generate frequent 1-itemsets:


● Scan the dataset to identify the frequency of each individual item (e.g.,
product, word).
● Select the items that meet the min_support threshold to form frequent
1-itemsets.

3. Generate candidate itemsets:


● Generate candidate k-itemsets (where k > 1) based on the frequent
(k-1)-itemsets found in the previous iteration.
● To generate candidates, join each frequent (k-1)-itemset with itself and
perform a pruning step to eliminate candidates that contain subsets that are
not frequent.
4. Scan the dataset:
● Scan the dataset to count the occurrence of each candidate itemset generated
in the previous step.
● Increment the support count for each candidate itemset found in the dataset.

5. Prune infrequent itemsets:


● Eliminate candidate itemsets that do not meet the min_support threshold, as
they are considered infrequent.
● The remaining candidate itemsets are considered frequent k-itemsets.

7
6. Repeat steps 3-5:
● Continue generating candidate itemsets, scanning the dataset, and pruning
until no new frequent itemsets are found.

7. Output frequent itemsets:


● Collect all the frequent itemsets found in the previous iterations.
● These frequent itemsets represent sets of items that occur together frequently
in the dataset.

PROGRAM CODE:
import pandas as pd
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

from mlxtend.preprocessing import TransactionEncoder


from mlxtend.frequent_patterns import apriori, association_rules
tr = TransactionEncoder()
tr_arr = tr.fit(dataset).transform(dataset)
df = pd.DataFrame(tr_arr, columns=tr.columns_)
df
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support = 0.6, use_colnames = True)
frequent_itemsets

OUTPUT:

8
Distributions

Values

RESULT:
Thus the implementation of finding frequent data item sets are done.

9
EX NO : 04 DISCRETIZATION FILTERING AND RESAMPLE FILTERING ON
SAMPLE DATA
DATE :

AIM
To write and implement a python program to impute missing values with various
techniques on given dataset.
a) Remove rows/ attributes
b) Replace with mean or mode
c) Write a python program to perform transformation of data using
Discretization (Binning) on given dataset.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library

ALGORITHM:
1. Input:
● Obtain the sample data that you want to apply discretization and resampling
filtering to.

2. Discretization Filtering:
● Choose the attribute(s) or feature(s) in the sample data that you want to
discretize.
● Determine the discretization method based on your requirements (e.g., equal
width, equal frequency, clustering).
● Apply the chosen discretization method to the selected attribute(s) and
transform the continuous values into discrete bins or intervals.

3. Resampling Filtering:
● Determine the resampling technique that you want to use (e.g.,
undersampling, oversampling, SMOTE).
● If the sample data is imbalanced (i.e., one class dominates the dataset),
consider using resampling techniques to address the class imbalance.
● Apply the selected resampling technique to the sample data, generating a
balanced or adjusted dataset.

4. Output:
● The output of the algorithm will be the filtered and transformed sample data,
where the selected attributes have been discretized and any class imbalance
has been addressed through resampling.

10
PROGRAM CODE:
import pandas as pd
# Create a sample dataset with continuous values
data = {'timestamp': pd.date_range('2022-01-01', periods=100, freq='H'),
'temperature': [25.5, 26.0, 25.7, 24.9, 25.2, 25.6, 26.2, 26.5, 27.1, 27.6] * 10}
df = pd.DataFrame(data)

# Define the number of bins for discretization


num_bins = 3

# Perform discretization on the 'temperature' column using equal width binning


df['temperature_discretized'] = pd.cut(df['temperature'], bins=num_bins,
labels=[f'temp_{i}' for i in range(num_bins)])

# Resample the dataframe to hourly frequency using mean aggregation


df_resampled = df.resample('H', on='timestamp').mean()

# Print the original and resampled dataframes


print('Original Data:')
print(df.head())
print('\nResampled Data:')
print(df_resampled.head())

OUTPUT:

RESULT:
Hence the implementation of discretization filtering and resample filtering on sample
data is done.

11
EX NO : 05 CONSTRUCTION OF DECISION TREE

DATE :

AIM
To write and implement a python program to classify using a decision tree model.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
c) Download dataset from kaggle or uci repository

CONCEPT
Decision tree is a type of supervised learning algorithm (having a predefined target
variable) that is mostly used in classification problems. It works for both categorical and
continuous input and output variables. In this technique, we split the population or sample
into two or more homogeneous sets (or sub-populations) based on most significant splitter /
differentiator in input variables.

ALGORITHM
1. Input:
● Obtain the training dataset consisting of labeled examples, where each
example is represented by a set of attributes and a corresponding class label.

2. Select the root node:


● Determine the attribute that best splits the dataset based on a certain criterion
(e.g., information gain, Gini index, gain ratio).
● Set this attribute as the root node of the decision tree.

3. Split the dataset:


● Split the dataset into subsets based on the possible values of the chosen
attribute.
● Create a branch for each value of the attribute from the root node.

4. Recursion:
● For each branch created in the previous step:
● If all examples in the subset belong to the same class, create a leaf
node with that class label.
● If the subset is empty, create a leaf node with the majority class label
from the parent node's examples.
● Otherwise, select the best attribute from the remaining attributes and
recursively repeat steps 3-4 for the subset.

12
5. Output:
● The decision tree is constructed once all the attributes have been exhausted or
a stopping condition is met (e.g., reaching a maximum depth, minimum
number of examples per leaf node).

PROGRAM CODE:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

# Loading the dataset


iris = load_iris()

#converting the data to a pandas dataframe


data = pd.DataFrame(data = iris.data, columns = iris.feature_names)

#creating a separate column for the target variable of iris dataset


data['Species'] = iris.target

#replacing the categories of target variable with the actual names of the species
target = np.unique(iris.target)
target_n = np.unique(iris.target_names)
target_dict = dict(zip(target, target_n))
data['Species'] = data['Species'].replace(target_dict)

# Separating the independent dependent variables of the dataset


x = data.drop(columns = "Species")
y = data["Species"]
names_features = x.columns
target_labels = y.unique()

# Splitting the dataset into training and testing datasets


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state =
93)

# Importing the Decision Tree classifier class from sklearn


from sklearn.tree import DecisionTreeClassifier

13
# Creating an instance of the classifier class
dtc = DecisionTreeClassifier(max_depth = 3, random_state = 93)

# Fitting the training dataset to the model


dtc.fit(x_train, y_train)

# Plotting the Decision Tree


plt.figure(figsize = (30, 10), facecolor = 'b')
Tree = tree.plot_tree(dtc, feature_names = names_features, class_names =
target_labels, rounded = True, filled = True, fontsize = 14)
plt.show()
y_pred = dtc.predict(x_test)

# Finding the confusion matrix


confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
matrix = pd.DataFrame(confusion_matrix)
axis = plt.axes()
sns.set(font_scale = 1.3)
plt.figure(figsize = (10,7))

# Plotting heatmap
sns.heatmap(matrix, annot = True, fmt = "g", ax = axis, cmap = "magma")
axis.set_title('Confusion Matrix')
axis.set_xlabel("Predicted Values", fontsize = 10)
axis.set_xticklabels([''] + target_labels)
axis.set_ylabel( "True Labels", fontsize = 10)
axis.set_yticklabels(list(target_labels), rotation = 0)
plt.show()

OUTPUT:

14
RESULT:
Thus to implement a python program to classify using a decision tree model is done.

15
EX NO : 06 IMPLEMENTATION OF APRIORI ALGORITHM WITH
DIFFERENT SUPPORT AND CONFIDENCE VALUES
DATE :

AIM
To write and implement a python program to perform Market Basket Analysis using
the Apriori algorithm with different support and confidence values.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, scikit-learn library
c) Dataset: Groceries dataset.csv

CONCEPT
The Apriori algorithm is the most popular algorithm for mining association rules.
The Apriori algorithm allows to mine for frequent itemset and learns association rules
between items over large datasets. based on 3 important factors:

❖ Support: the probability that X and Y meet


❖ Confidence: the conditional probability that Y knows x. In other words, how often
does Y occur when X came first.
❖ Lift: the relationship between support and confidence. An increase of 2 means that
the probability of buying X and Y together is twice as high as the probability of
simply buying Y.

The algorithm identifies the frequent individual items in the database. It extends
them to larger and larger item sets as long as those item sets appear sufficiently often in the
database.

ALGORITHM

1. Start with itemsets containing just a single item (Individual items)


2. Determine the support for itemsets
3. Keep the itemsets that meet the minimum support threshold and remove itemsets that
do not support minimum support
4. Using the itemsets that is kept from Step 1, generate all the possible itemset
combinations.
5. Repeat steps 1 and 2 until there are no new itemsets.

16
PROGRAM CODE:
import itertools
def frequent_itemsets(transactions, support_threshold):
"""
Finds frequent itemsets in the given transactions, using the Apriori algorithm.
"""
items = sorted(set(item for transaction in transactions for item in transaction))
freq_sets = []
for k in range(1, len(items) + 1):
candidate_sets = itertools.combinations(items, k)
freq_sets_k = []
for c in candidate_sets:
support = sum(1 for transaction in transactions if set(c).issubset(transaction))
if support >= support_threshold:
freq_sets_k.append(c)
freq_sets.extend(freq_sets_k)
return freq_sets
transactions = [
{'milk', 'bread', 'eggs'},
{'bread', 'cheese'},
{'milk', 'bread', 'cheese'},
{'milk', 'bread', 'eggs', 'cheese'},
{'milk', 'eggs'}
]
support_threshold = 3
freq_sets = frequent_itemsets(transactions, support_threshold)
print("Frequent itemsets:")
for freq_set in freq_sets:
print(freq_set)

OUTPUT:

RESULT:
Thus the implementation of apriori algorithm with different support and confidence
values are done.

17
EX NO : 07 DERIVING INTERESTING INSIGHTS AND OBSERVE THE
EFFECT OF DISCRETIZATION IN THE RULE GENERATED
USING APRIORI ALGORITHM
DATE :

AIM
To derive interesting insights and observe the effect of discretization in the rule
generated using Apriori algorithm.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library

ALGORITHM
To derive interesting insights and observe the effect of discretization in the rules
generated using the Apriori algorithm, you can follow the following algorithmic steps:
1. Input:
➢ Obtain the dataset containing transactional or itemset data.
2. Discretization:
➢ Select the attribute(s) in the dataset that you want to discretize.
➢ Determine the appropriate discretization method based on the nature of the
attribute(s) and your analysis goals (e.g., equal width, equal frequency,
clustering).
➢ Apply the chosen discretization method to transform the continuous
attribute(s) into discrete bins or intervals.
3. Apply the Apriori algorithm:
➢ Use the discretized dataset as input for the Apriori algorithm to generate
frequent itemsets and association rules.
➢ Set the desired support threshold and confidence threshold for mining the
frequent itemsets and association rules.
4. Analyze generated rules:
➢ Examine the association rules generated by the Apriori algorithm, including
the antecedent, consequent, support, and confidence measures.
➢ Identify the interesting insights based on the support, confidence, and lift (or
other interestingness measures) of the rules.
➢ Explore patterns, dependencies, or relationships among the discrete
attribute(s) and other attributes in the rules.

5. Observe the effect of discretization:


➢ Compare and analyze the association rules derived from the discretized
dataset with the rules derived from the original continuous dataset (before
discretization).

18
➢ Observe the differences in the patterns, support, confidence, or other
interestingness measures between the two sets of rules.
➢ Evaluate the impact of discretization on the discovered insights, the quality of
the rules, and the interpretability of the results.
6. Iterate and refine:
➢ Depending on the observed effects and the objectives of your analysis,
iteratively refine the discretization process and the mining parameters (e.g.,
support threshold, confidence threshold) to uncover more meaningful insights
or explore different aspects of the data.

PROGRAM CODE:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Load the sample dataset (online retail)


data=pd.read_excel("http://archive.ics.uci.edu/ml/machine-learning-databases/00352
/Online%20Retail.xlsx")

# Clean the dataset (remove missing values)


data.dropna(inplace=True)

# Convert the dataset into a transactional format


basket = (data.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

# Convert the quantities to binary values


def encode_units(x):
if x <= 0:
return 0
else:
return 1

basket_sets = basket.applymap(encode_units)

# Generate frequent itemsets using different support values


frequent_itemsets_01 = apriori(basket_sets, min_support=0.01, use_colnames=True)
frequent_itemsets_03 = apriori(basket_sets, min_support=0.03, use_colnames=True)
frequent_itemsets_05 = apriori(basket_sets, min_support=0.05, use_colnames=True)

19
# Generate association rules using different confidence values
rules_50 = association_rules(frequent_itemsets_01, metric="confidence",
min_threshold=0.5)
rules_60 = association_rules(frequent_itemsets_01, metric="confidence",
min_threshold=0.6)
rules_70 = association_rules(frequent_itemsets_01, metric="confidence",
min_threshold=0.7)

# Print the results


print("Frequent itemsets with minimum support 0.01:\n", frequent_itemsets_01)
print("\nFrequent itemsets with minimum support 0.03:\n", frequent_itemsets_03)
print("\nFrequent itemsets with minimum support 0.05:\n", frequent_itemsets_05)
print("\nAssociation rules with minimum confidence 0.5:\n", rules_50)
print("\nAssociation rules with minimum confidence 0.6:\n", rules_60)
print("\nAssociation rules with minimum confidence 0.7:\n", rules_70)

20
OUTPUT:

Frequent itemsets with minimum support 0.01:


support itemsets
0 0.010906 (10 COLOUR SPACEBOY PEN)
1 0.012438 (12 PENCIL SMALL TUBE WOODLAND)
2 0.013835 (12 PENCILS SMALL TUBE RED RETROSPOT)
3 0.013114 (12 PENCILS SMALL TUBE SKULL)
4 0.010545 (12 PENCILS TALL TUBE RED RETROSPOT)
.. ... ...
714 0.010320 (LUNCH BAG RED RETROSPOT, LUNCH BAG SUKI DESIG...
715 0.011627 (LUNCH BAG RED RETROSPOT, LUNCH BAG SPACEBOY D...
716 0.010005 (LUNCH BAG RED RETROSPOT, LUNCH BAG SUKI DESIG...
717 0.011942 (ROSES REGENCY TEACUP AND SAUCER , PINK REGENC...
718 0.010771 (GREEN REGENCY TEACUP AND SAUCER, REGENCY CAKE...
[719 rows x 2 columns]

Frequent itemsets with minimum support 0.03:


support itemsets
0 0.032717 (6 RIBBONS RUSTIC CHARM)
1 0.035557 (ALARM CLOCK BAKELIKE GREEN)
2 0.039522 (ALARM CLOCK BAKELIKE RED )
3 0.061965 (ASSORTED COLOUR BIRD ORNAMENT)
4 0.038486 (BAKING SET 9 PIECE RETROSPOT )
5 0.031546 (CHOCOLATE HOT WATER BOTTLE)
6 0.034069 (GARDENERS KNEELING PAD KEEP CALM )
7 0.031140 (GREEN REGENCY TEACUP AND SAUCER)
8 0.036097 (HEART OF WICKER LARGE)
9 0.043308 (HEART OF WICKER SMALL)
10 0.031140 (HOME BUILDING BLOCK WORD)
11 0.032402 (HOT WATER BOTTLE KEEP CALM)
12 0.039117 (JAM MAKING SET PRINTED)
13 0.039612 (JAM MAKING SET WITH JARS)
14 0.033889 (JUMBO BAG ALPHABET)
15 0.032357 (JUMBO BAG APPLES)
16 0.039252 (JUMBO BAG PINK POLKADOT)
17 0.072105 (JUMBO BAG RED RETROSPOT)
18 0.034790 (JUMBO BAG VINTAGE LEAF)
19 0.035602 (JUMBO SHOPPER VINTAGE RED PAISLEY)
20 0.034881 (JUMBO STORAGE BAG SUKI)

Frequent itemsets with minimum support 0.05:


support itemsets
0 0.061965 (ASSORTED COLOUR BIRD ORNAMENT)
1 0.072105 (JUMBO BAG RED RETROSPOT)
2 0.058044 (LUNCH BAG RED RETROSPOT)
3 0.062190 (PARTY BUNTING)
4 0.076791 (REGENCY CAKESTAND 3 TIER)
5 0.051645 (SET OF 3 CAKE TINS PANTRY DESIGN )
6 0.088824 (WHITE HANGING HEART T-LIGHT HOLDER)

21
Association rules with minimum confidence 0.5: antecedents \
0 (60 TEATIME FAIRY CAKE CASES)
1 (ALARM CLOCK BAKELIKE CHOCOLATE)
2 (ALARM CLOCK BAKELIKE IVORY)
3 (ALARM CLOCK BAKELIKE PINK)
4 (ALARM CLOCK BAKELIKE RED )
.. ...
167 (GREEN REGENCY TEACUP AND SAUCER, REGENCY CAKE...
168 (GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY...
169 (ROSES REGENCY TEACUP AND SAUCER , REGENCY CAK...
170 (PINK REGENCY TEACUP AND SAUCER, REGENCY CAKES...
171 (ROSES REGENCY TEACUP AND SAUCER , PINK REGENC...
consequents antecedent support \
0 (PACK OF 72 RETROSPOT CAKE CASES) 0.029608
1 (ALARM CLOCK BAKELIKE RED ) 0.014511
2 (ALARM CLOCK BAKELIKE GREEN) 0.021091
3 (ALARM CLOCK BAKELIKE GREEN) 0.027625
4 (ALARM CLOCK BAKELIKE GREEN) 0.039522
.. ... ...
167 (ROSES REGENCY TEACUP AND SAUCER , PINK REGENC... 0.016854
168 (ROSES REGENCY TEACUP AND SAUCER , REGENCY CAK... 0.020730
169 (GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY... 0.018927
170 (GREEN REGENCY TEACUP AND SAUCER, ROSES REGENC... 0.013925
171 (GREEN REGENCY TEACUP AND SAUCER, REGENCY CAKE... 0.019648
consequent support support confidence lift leverage conviction
0 0.046372 0.014872 0.502283 10.831547 0.013499 1.916004
1 0.039522 0.010185 0.701863 17.758663 0.009611 3.221602
2 0.035557 0.012393 0.587607 16.525977 0.011643 2.338650
3 0.035557 0.015457 0.559543 15.736710 0.014475 2.189644
4 0.035557 0.023885 0.604333 16.996386 0.022479 2.437513
.. ... ... ... ... ... ...
167 0.019648 0.010771 0.639037 32.523488 0.010439 2.715937
168 0.018927 0.010771 0.519565 27.450362 0.010378 2.042051
169 0.020730 0.010771 0.569048 27.450362 0.010378 2.272339
170 0.024380 0.010771 0.773463 31.724841 0.010431 4.306664
171 0.016854 0.010771 0.548165 32.523488 0.010439 2.175896
[172 rows x 9 columns]
Association rules with minimum confidence 0.6:
antecedents \
0 (ALARM CLOCK BAKELIKE CHOCOLATE)
1 (ALARM CLOCK BAKELIKE RED )
2 (ALARM CLOCK BAKELIKE GREEN)
3 (ALARM CLOCK BAKELIKE IVORY)
4 (ALARM CLOCK BAKELIKE ORANGE)
.. ...
79 (GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY...
80 (GREEN REGENCY TEACUP AND SAUCER, ROSES REGENC...
81 (PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY...
82 (GREEN REGENCY TEACUP AND SAUCER, REGENCY CAKE...
83 (PINK REGENCY TEACUP AND SAUCER, REGENCY CAKES...

22
consequents antecedent support \
0 (ALARM CLOCK BAKELIKE RED ) 0.014511
1 (ALARM CLOCK BAKELIKE GREEN) 0.039522
2 (ALARM CLOCK BAKELIKE RED ) 0.035557
3 (ALARM CLOCK BAKELIKE RED ) 0.021091
4 (ALARM CLOCK BAKELIKE RED ) 0.016178
.. ... ...
79 (ROSES REGENCY TEACUP AND SAUCER ) 0.012213
80 (REGENCY CAKESTAND 3 TIER) 0.017575
81 (GREEN REGENCY TEACUP AND SAUCER) 0.011942
82 (ROSES REGENCY TEACUP AND SAUCER , PINK REGENC... 0.016854
83 (GREEN REGENCY TEACUP AND SAUCER, ROSES REGENC... 0.013925

consequent support support confidence lift leverage conviction


0 0.039522 0.010185 0.701863 17.758663 0.009611 3.221602
1 0.035557 0.023885 0.604333 16.996386 0.022479 2.437513
2 0.039522 0.023885 0.671736 16.996386 0.022479 2.925934
3 0.039522 0.014151 0.670940 16.976240 0.013317 2.918854
4 0.039522 0.011086 0.685237 17.337975 0.010447 3.051429
.. ... ... ... ... ... ...
79 0.035286 0.010771 0.881919 24.993332 0.010340 8.169920
80 0.076791 0.010771 0.612821 7.980333 0.009421 2.384446
81 0.031140 0.010771 0.901887 28.962182 0.010399 9.874918
82 0.019648 0.010771 0.639037 32.523488 0.010439 2.715937
83 0.024380 0.010771 0.773463 31.724841 0.010431 4.306664
[84 rows x 9 columns]

Association rules with minimum confidence 0.7: antecedents \


0 (ALARM CLOCK BAKELIKE CHOCOLATE)
1 (PAINTED METAL PEARS ASSORTED)
2 (BAKING SET SPACEBOY DESIGN)
3 (TOILET METAL SIGN)
4 (PINK HAPPY BIRTHDAY BUNTING)
5 (BLUE HAPPY BIRTHDAY BUNTING)
6 (CANDLEHOLDER PINK HANGING HEART)
7 (GARDENERS KNEELING PAD CUP OF TEA )
8 (PINK REGENCY TEACUP AND SAUCER)
9 (GREEN REGENCY TEACUP AND SAUCER)
10 (PINK REGENCY TEACUP AND SAUCER)
11 (POPPY'S PLAYHOUSE KITCHEN)
12 (POPPY'S PLAYHOUSE BEDROOM )
13 (REGENCY TEA PLATE GREEN )
14 (SET/6 RED SPOTTY PAPER PLATES)
15 (SET/6 RED SPOTTY PAPER CUPS)
16 (SET/6 RED SPOTTY PAPER PLATES)
17 (SMALL MARSHMALLOWS PINK BOWL)
18 (WOODEN STAR CHRISTMAS SCANDINAVIAN)
19 (WOODEN TREE CHRISTMAS SCANDINAVIAN)
20 (ALARM CLOCK BAKELIKE PINK, ALARM CLOCK BAKELI...

23
consequents antecedent support \
0 (ALARM CLOCK BAKELIKE RED ) 0.014511
1 (ASSORTED COLOUR BIRD ORNAMENT) 0.015863
2 (BAKING SET 9 PIECE RETROSPOT ) 0.019108
3 (BATHROOM METAL SIGN) 0.013610
4 (BLUE HAPPY BIRTHDAY BUNTING) 0.016178
5 (PINK HAPPY BIRTHDAY BUNTING) 0.015998
6 (WHITE HANGING HEART T-LIGHT HOLDER) 0.015457
7 (GARDENERS KNEELING PAD KEEP CALM ) 0.028616
8 (GREEN REGENCY TEACUP AND SAUCER) 0.025056
9 (ROSES REGENCY TEACUP AND SAUCER ) 0.031140
10 (ROSES REGENCY TEACUP AND SAUCER ) 0.025056
11 (POPPY'S PLAYHOUSE BEDROOM ) 0.015593
12 (POPPY'S PLAYHOUSE KITCHEN) 0.014241
13 (REGENCY TEA PLATE ROSES ) 0.012168
14 (SET/20 RED RETROSPOT PAPER NAPKINS ) 0.014646
15 (SET/6 RED SPOTTY PAPER PLATES) 0.012844
16 (SET/6 RED SPOTTY PAPER CUPS) 0.014646
17 (SMALL DOLLY MIX DESIGN ORANGE BOWL) 0.012979
18 (WOODEN HEART CHRISTMAS SCANDINAVIAN) 0.019513
19 (WOODEN STAR CHRISTMAS SCANDINAVIAN) 0.012438
20 (ALARM CLOCK BAKELIKE RED ) 0.015457

consequent support support confidence lift leverage conviction


0 0.039522 0.010185 0.701863 17.758663 0.009611 3.221602
1 0.061965 0.011492 0.724432 11.691012 0.010509 3.404004
2 0.038486 0.014015 0.733491 19.058730 0.013280 3.607805
3 0.017260 0.010140 0.745033 43.165234 0.009905 3.854383
4 0.015998 0.011447 0.707521 44.225038 0.011188 3.364349
5 0.016178 0.011447 0.715493 44.225038 0.011188 3.457987
6 0.088824 0.011402 0.737609 8.304186 0.010029 3.472594
7 0.034069 0.020865 0.729134 21.401429 0.019890 3.566081
8 0.031140 0.020730 0.827338 26.568210 0.019950 5.611313
9 0.035286 0.024380 0.782923 22.187826 0.023282 4.444115
10 0.035286 0.019648 0.784173 22.223233 0.018764 4.469841
11 0.014241 0.011447 0.734104 51.549901 0.011225 3.707312
12 0.015593 0.011447 0.803797 51.549901 0.011225 5.017302
13 0.014781 0.010320 0.848148 57.379291 0.010140 6.488025
14 0.032988 0.010320 0.704615 21.359857 0.009837 3.273739
15 0.014646 0.010635 0.828070 56.538084 0.010447 5.731139
16 0.012844 0.010635 0.726154 56.538084 0.010447 3.604785
17 0.016269 0.010140 0.781250 48.021988 0.009929 4.497058
18 0.020865 0.014376 0.736721 35.308486 0.013969 3.718994
19 0.019513 0.010185 0.818841 41.963216 0.009942 5.412287
20 0.039522 0.012032 0.778426 19.695856 0.011422 4.334787

RESULT:
Thus to derive interesting insights and observe the effect of discretization in the rule
generated using Apriori algorithm is done.

24
EX NO : 08 IMPLEMENTATION OF ID3 CLASSIFICATION ALGORITHM

DATE :

AIM
To write and implement a python program to classify a dataset using ID3 algorithm.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library

CONCEPT
ID3 decision trees use a greedy search approach to determine decision node
selection, meaning that it picks an ideal attribute once and does not reconsider or modify its
previous choices. ID3 algorithms use entropy and information gain to determine which
attributes best split the data. In other words, if an attribute perfectly classifies the set of data
then ID3 training stops; otherwise, it recursively iterates over the n number of subsets of
data for that attribute until the subset becomes pure. For this reason, the process of decision
node selection is fundamental in constructing an ID3 algorithm.

ALGORITHM

1. Compute the entropy for data-set


2. for every attribute/feature:
➢ Calculate entropy for all categorical values
➢ Take average information entropy for the current attribute
➢ Calculate gain for the current attribute
3. pick the highest gain attribute.
4. Repeat until we get the tree we desired .

PROGRAM CODE:
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

25
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Function importing Dataset


def importdata():
balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-'+
'databases/balance-scale/balance-scale.data',
sep= ',', header = None)

# Printing the dataswet shape


print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)

# Printing the dataset obseravtions


print ("Dataset: ",balance_data.head())
return balance_data

# Function to split the dataset


def splitdataset(balance_data):

# Separating the target variable


X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]

# Splitting the dataset into train and test


X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)

return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.


def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object


clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(

26
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)

# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy

# Function to make predictions


def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

# Function to calculate accuracy


def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print ("Accuracy : ",
accuracy_score(y_test,y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))

# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)

# Calling main function


if __name__=="__main__":
main()

27
OUTPUT:

28
RESULT:
Hence the implementation of a python program to classify a dataset using ID3
algorithm.

29
EX NO : 09 NAIVE-BAYES CLASSIFICATION AND K-NEAREST NEIGHBOR
CLASSIFICATION
DATE :

AIM
To write and implement a python program for naïve bayes classification k-nearest
neighbor classification.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library
c) Dataset: titanic dataset

CONCEPT
Naive Bayes is a classification technique based on Bayes' Theorem with the
assumption that all feature variables are independant. Thus a Naive Bayes classifier
assumes that the presence of a particular feature in a class is unrelated to the presence of
any other feature. Naive Bayes classifier combines 3 terms to compute the probability of a
class: the class probability in the dataset, multiplied by the probability of the example
feature variables occuring given the current class, divided by the probability of those
particular example feature variables occuring in general. To compute the probability of
particular feature variables occuring there are 3 main optional techniques. One can assume
that the value of a particular variable is Gaussian distributed which can be a common case
and thus this method is useful when the variables are real numbers. Multinomial division is
good for feature variables that are categorical as it computes the probability based on
histogram bins. The final option is to use a Bernouli probability model when the data is
binary. Naive Bayes is simplistic and easy to use yet can outperform other more complex
classification algorithms. Is has fast computation and thus is well suited for application on
large datasets.

ALGORITHM

1. Input:
➢ Obtain a labeled training dataset consisting of examples with their
corresponding class labels.
➢ Preprocess the data as needed (e.g., handle missing values, normalize
features).

2. Calculate class probabilities:


➢ Compute the prior probability of each class label by counting the frequency
of each class in the training dataset.
➢ Calculate the total number of training examples.

30
3. Calculate feature probabilities:
For each feature in the dataset:
➢ Calculate the likelihood of the feature given each class label using the
training examples.
➢ Compute the conditional probability of the feature given the class by
counting the frequency of the feature-value pairs within each class.

4. Apply the Naive Bayes classifier:


For each unlabeled example in the test dataset:
➢ Calculate the posterior probability of each class label given the example
using Bayes' theorem.
➢ Multiply the prior probability of the class with the conditional probabilities of
the features given the class to obtain the joint probability.
➢ Normalize the joint probabilities for all classes to sum up to 1 by dividing
each joint probability by the sum of all joint probabilities.

5. Make predictions:
➢ Assign the class label with the highest posterior probability as the predicted
class for each test example.
6. Output:
➢ The output of the algorithm will be the predicted class labels for the test
examples.

7. For getting the predicted class, iterate from 1 to total number of training data points
8. Calculate the distance between test data and each row of training data. Here we will
use Euclidean distance as our distance metric since it’s the most popular method. The
other metrics that can be used are Chebyshev, cosine, etc.
9. Sort the calculated distances in ascending order based on distance values
10. Get top k rows from the sorted array
11. Get the most frequent class of these rows
12. Return the predicted class

31
PROGRAM CODE:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset


iris = load_iris()

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2,
random_state=42)

# Train the Naive Bayes classifier


clf = GaussianNB()
clf.fit(X_train, y_train)

# Make predictions on the test set


y_pred = clf.predict(X_test)

# Print the accuracy score


print("Accuracy score:", accuracy_score(y_test, y_pred))

from sklearn.neighbors import KNeighborsClassifier


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset


iris = load_iris()

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2,
random_state=42)

# Train the k-Nearest Neighbor classifier


clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)

# Make predictions on the test set


y_pred = clf.predict(X_test)

# Print the accuracy score


print("Accuracy score:", accuracy_score(y_test, y_pred))

32
OUTPUT:

RESULT:
Thus the implementation of naïve bayes classification and k-nearest neighbor
classification is done.

33
EX NO : 10 COMPARE THE PERFORMANCE OF ID3, NAÏVE BAYES AND
K-NEAREST NEIGHBOR CLASSIFICATION ALGORITHMS
DATE :

AIM
To compare the Performance of ID3, Naïve-Bayes and K-Nearest Neighbor
Classification algorithms using python program.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library

ALGORITHM

1. Input:
➢ Obtain a labeled dataset consisting of examples with their corresponding
class labels.
➢ Preprocess the data as needed (e.g., handle missing values, normalize
features).
➢ Split the dataset into training and testing sets for evaluation.
2. Train the models:
➢ For each algorithm (ID3, Naïve Bayes, k-NN):
➢ Initialize the model with appropriate parameters.
➢ Train the model using the training dataset.
3. Test the models:
For each algorithm:
➢ Apply the trained model to the testing dataset to make predictions.
➢ Compare the predicted class labels with the true class labels.
4. Evaluate performance metrics:
➢ Calculate evaluation metrics to compare the performance of the algorithms.
➢ Commonly used metrics include accuracy, precision, recall, F1 score, and
area under the ROC curve (AUC-ROC).
➢ Calculate and compare the performance metrics for each algorithm.
5. Repeat and validate:
➢ Perform steps 2-4 multiple times using different training and testing splits
(e.g., cross-validation) to validate the results and reduce bias.
6. Analyze and compare results:
➢ Compare the performance metrics obtained from each algorithm.
➢ Look for patterns or trends in the results to identify which algorithm performs
better overall or in specific scenarios.
➢ Consider factors such as accuracy, computational efficiency, interpretability,
and requirements of your specific problem.

34
7. Fine-tuning:
➢ Explore parameter tuning for each algorithm to improve their performance
further.
➢ Adjust algorithm-specific parameters to find the optimal settings for your
dataset.
8. Select the best algorithm:
➢ Based on the performance metrics, select the algorithm that provides the best
results for your particular classification task.

PROGRAM CODE:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset


iris = load_iris()

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2,
random_state=42)

# Train the ID3 decision tree classifier


clf_id3 = DecisionTreeClassifier(criterion="entropy")
clf_id3.fit(X_train, y_train)

# Train the Naive Bayes classifier


clf_nb = GaussianNB()
clf_nb.fit(X_train, y_train)

# Train the k-Nearest Neighbor classifier


clf_knn = KNeighborsClassifier(n_neighbors=3)
clf_knn.fit(X_train, y_train)
# Make predictions on the test set using each classifier
y_pred_id3 = clf_id3.predict(X_test)
y_pred_nb = clf_nb.predict(X_test)
y_pred_knn = clf_knn.predict(X_test)

# Print the accuracy scores for each classifier


print("ID3 accuracy score:", accuracy_score(y_test, y_pred_id3))
print("Naive Bayes accuracy score:", accuracy_score(y_test, y_pred_nb))
print("k-Nearest Neighbor accuracy score:", accuracy_score(y_test, y_pred_knn))

35
OUTPUT:

RESULT:
Thus, the comparison of ID3, Naïve-Bayes and K-Nearest classification are
performed using a python program.

36
EX NO : 11 IMPLEMENTATION OF K-MEANS CLUSTERING ALGORITHM

DATE :

AIM
To write and implement a python program for k-means clustering algorithm.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library

CONCEPT
It is a type of unsupervised algorithm which solves the clustering problem. Its
procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters). Data points inside a cluster are homogeneous and
heterogeneous to peer groups.

ALGORITHM

1. Input:
➢ Obtain a dataset consisting of data points that you want to cluster.
➢ Determine the value of K, the number of clusters you want to create.
2. Initialize centroids:
➢ Randomly select K data points from the dataset as initial centroid positions.
➢ These initial centroids will serve as the center points of the clusters.
3. Assign data points to clusters:
For each data point in the dataset:
➢ Calculate the distance between the data point and each centroid.
➢ Assign the data point to the cluster with the closest centroid.
4. Update centroids:
For each cluster:
➢ Calculate the new centroid position as the mean of all data points belonging
to that cluster.
➢ Update the centroid position accordingly.
5. Repeat steps 3-4:
➢ Iterate the assignment and centroid update steps until convergence or until a
maximum number of iterations is reached.
➢ Convergence is reached when the centroid positions stabilize, and the data
points no longer change their assigned clusters significantly.
6. Output:
➢ The output of the algorithm is the final set of K clusters, along with their
respective data points.

37
PROGRAM CODE:
import numpy as np
import random

from sklearn.datasets import make_blobs


import matplotlib.pyplot as plt

# Generate sample data


X, _ = make_blobs(n_samples=200, centers=4, n_features=2, random_state=42)

# Initialize and fit the model


model = KMeans(k=4)
model.fit(X)

# Predict cluster labels for the data


y_pred = model.predict(X)

# Plot the results


plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.scatter(model.centroids[:, 0], model.centroids[:, 1], marker='x', s=200,
linewidths=3, color='r')
plt.show()

class KMeans:
def __init__(self, k, max_iterations=100):
self.k = k
self.max_iterations = max_iterations

def fit(self, X):


self.centroids = []

# Initialize centroids randomly


for i in range(self.k):
self.centroids.append(X[random.randint(0, len(X)-1)])

for iteration in range(self.max_iterations):


# Initialize empty clusters
clusters = [[] for _ in range(self.k)]

# Assign each data point to the nearest centroid


for xi in X:
distances = [np.linalg.norm(xi - c) for c in self.centroids]
cluster_index = np.argmin(distances)
clusters[cluster_index].append(xi)

38
# Recalculate centroids as the mean of all data points in the cluster
new_centroids = []
for cluster in clusters:
if cluster:
new_centroids.append(np.mean(cluster, axis=0))
else:
new_centroids.append(self.centroids[random.randint(0,
len(self.centroids)-1)])

# Check if the centroids have changed, and stop if they haven't


if np.allclose(self.centroids, new_centroids):
break
else:
self.centroids = new_centroids

def predict(self, X):


y_pred = []
for xi in X:
distances = [np.linalg.norm(xi - c) for c in self.centroids]
cluster_index = np.argmin(distances)
y_pred.append(cluster_index)
return y_pred

OUTPUT:

RESULT:
Thus the implementation of k means clustering is done.

39
EX NO : 12 EXPLORING VISUALIZATION FEATURES

DATE :

AIM
To apply and explore various plotting functions using matplotlib and seaborn
packages on the Adult UCI dataset.
a) Normal curves
b) Density and contour plots
c) Correlation and scatter plots
d) Histograms
e) Three-dimensional plotting

DATASET:
The adult UCI dataset is one of the popular datasets for practice. It is a Supervised
binary classification problem that predicts whether a person makes over 50k a year. The
dataset contains a mix of categorical and numeric type data.

CONCEPT

Categorical Attributes:
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State
gov, Without-pay, Never-worked.
Individual work category
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,
Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
Individual’s highest education degree
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed,
Married-spouse-absent, Married-AF-spouse.
Individual marital status
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,
Prof-specialty, Handlers-cleaners, Machine-op-inspect, Adm-clerical, Farming fishing,
Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
Individual’s occupation
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
Individual’s relation in a family
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
Race of Individual
sex: Female, Male.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras,
Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala,
Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinidad Tobago, Peru, Hong,
Holland-Netherlands.
Individual’s native country

40
Continuous Attributes:
age: continuous.
Age of an individual
fnlwgt: final weight, continuous.
The weights on the CPS files are controlled to independent estimates of the
civilian noninstitutional population of the US. These are prepared monthly for us by
Population Division here at the Census Bureau.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
Individual’s working hours per week

PROGRAM CODE:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Adult UCI dataset
df=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adul
t.data', header=None)
# Add column names
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
'native-country', 'income']
# Drop missing values
df = df.dropna()
# Convert categorical variables to numerical
df['workclass'] = pd.factorize(df['workclass'])[0]
df['education'] = pd.factorize(df['education'])[0]
df['marital-status'] = pd.factorize(df['marital-status'])[0]
df['occupation'] = pd.factorize(df['occupation'])[0]
df['relationship'] = pd.factorize(df['relationship'])[0]
df['race'] = pd.factorize(df['race'])[0]
df['sex'] = pd.factorize(df['sex'])[0]
df['native-country'] = pd.factorize(df['native-country'])[0]
df['income'] = pd.factorize(df['income'])[0]
# Plot a histogram of age
sns.histplot(data=df, x='age')
plt.show()
# Plot a scatterplot of age and education
sns.scatterplot(data=df, x='age', y='education')
plt.show()
# Plot a boxplot of hours-per-week by income
sns.boxplot(data=df, x='income', y='hours-per-week')
plt.show()
# Plot a heatmap of the correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()

41
OUTPUT:

42
RESULT:
Thus, various plotting functions using matplotlib, seaborn, and plotly packages on
the Adult UCI dataset were applied and explored.

43
EX NO : 13 BUILDING LINEAR REGRESSION MODEL AND DERIVE
INTERESTING PATTERNS
DATE :

AIM
To write and implement a python program for linear regression algorithm.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library

CONCEPT
It is used to estimate real values (cost of houses, number of calls, total sales etc.)
based on continuous variable(s). Here, we establish relationship between independent and
dependent variables by fitting a best line. This best fit line is known as regression line and
represented by a linear equation Y= a *X + b.

ALGORITHM

1. 1.Input:
➢ Obtain a dataset consisting of variables/features and corresponding target
values for regression analysis.
2. Data preprocessing:
➢ Clean the data by handling missing values, outliers, and inconsistencies.
➢ Split the dataset into training and testing sets for model evaluation.
3. Feature selection:
➢ Identify relevant features that may influence the target variable by analyzing
their correlations, domain knowledge, or feature importance techniques.
➢ Select a subset of features to use in the linear regression model.
4. Train the linear regression model:
➢ Initialize the model with appropriate parameters.
➢ Fit the model to the training dataset by estimating the coefficients and
intercept that best fit the data using techniques like ordinary least squares or
gradient descent.
5. Evaluate the model:
➢ Apply the trained model to the testing dataset to make predictions.
➢ Calculate evaluation metrics such as mean squared error, mean absolute error,
or R-squared to assess the model's performance.
6. Derive interesting patterns:
➢ Analyze the learned coefficients of the linear regression model.
➢ Identify the features with significant positive or negative coefficients,
indicating their influence on the target variable.
➢ Look for patterns or relationships between the features and the target
variable.

44
➢ Consider domain-specific knowledge or hypotheses to interpret the patterns
and draw meaningful insights.
7. Refine the model:
➢ Based on the patterns and insights derived, you can iteratively refine the
linear regression model.
➢ Add or remove features, perform feature engineering, or try different
regression techniques to improve the model's performance and capture more
interesting patterns.
8. Output:
➢ The output of the algorithm is the trained linear regression model, evaluation
metrics, and the derived interesting patterns

PROGRAM CODE:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate some sample data


x = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))
y = np.array([2, 4, 5, 4, 5])

# Create a linear regression object


model = LinearRegression()

# Train the model on the data


model.fit(x, y)

# Get the slope and intercept of the regression line


slope = model.coef_[0]
intercept = model.intercept_

# Print the equation of the regression line


print(f'y = {slope:.2f}x + {intercept:.2f}')

# Predict the values of y for new values of x


x_new = np.array([6, 7, 8]).reshape((-1, 1))
y_new = model.predict(x_new)

# Plot the data and the regression line


plt.scatter(x, y)
plt.plot(x, slope * x + intercept, color='red')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

45
OUTPUT:

RESULT:
Thus the implementation of building linear regression models and derive interesting
patterns.

46
EX NO : 14 BUILDING OF MULTI LINEAR REGRESSION MODEL

DATE :

AIM
To write and implement a python program for multi-linear regression model.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library

CONCEPT
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes known
simply as multiple regression, and it is an extension of linear regression. The variable that
we want to predict is known as the dependent variable, while the variables we use to predict
the value of the dependent variable are known as independent or explanatory variables.

ALGORITHM

1. Input:
➢ Obtain a dataset consisting of variables/features and corresponding target
values for regression analysis.
2. Data preprocessing:
➢ Clean the data by handling missing values, outliers, and inconsistencies.
➢ Split the dataset into training and testing sets for model evaluation.
3. Feature selection:
➢ Identify relevant independent variables/features that may influence the target
variable by analyzing their correlations, domain knowledge, or feature
importance techniques.
➢ Select a subset of features to use in the multiple linear regression model.
4. Train the multiple linear regression model:
➢ Initialize the model with appropriate parameters.
➢ Fit the model to the training dataset by estimating the coefficients and
intercept that best fit the data using techniques like ordinary least squares or
gradient descent.
5. Evaluate the model:
➢ Apply the trained model to the testing dataset to make predictions.
➢ Calculate evaluation metrics such as mean squared error, mean absolute error,
or R-squared to assess the model's performance.
6. Assess feature significance:
➢ Examine the statistical significance of the estimated coefficients for each
independent variable in the multiple linear regression model.
➢ Utilize hypothesis testing techniques or p-values to determine the
significance of each variable's contribution.

47
7. Refine the model:
➢ Based on the statistical significance and performance evaluation, you can
iteratively refine the multiple linear regression model.
➢ Add or remove independent variables, perform feature engineering, or try
different regression techniques to improve the model's performance and
interpretability.
8. Output:
➢ The output of the algorithm is the trained multiple linear regression model,
evaluation metrics, and the assessment of feature significance.

PROGRAM CODE:
import numpy as np
from sklearn.linear_model import LinearRegression

# Define the input features and target variable


X = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
y = np.array([4, 8, 12, 16]) # Initialize a linear regression model
model = LinearRegression() # Fit the model using the input features and target
variable
model.fit(X, y)
# Print the coefficients and intercept
print("Coefficients: ", model.coef_)
print("Intercept: ", model.intercept_)

# Use the model to predict the target variable for new input features
new_X = np.array([[13, 14, 15],
[16, 17, 18]])
predicted_y = model.predict(new_X)
print("Predicted y: ", predicted_y)

OUTPUT:

RESULT:
Thus, the python program for multi-linear regression algorithm is implemented
successfully.

48
EX NO : 15 CLASSIFICATION AND CLUSTERING OF A SAMPLE DATA SET

DATE :

AIM
To implement a python program to classify and cluster on a student dataset from
kaggle.

REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library, seaborn, os, matplotlib
c) Dataset link:
https://www.kaggle.com/code/yoghurtpatil/clustering-and-classification
on-student-data/data

CONCEPT
K-Means Clustering: The most common methods used for identifying clusters or classes in
unlabelled data are:
1) K-Means Clustering
2) Hierarchical Clustering.

While both are used for the same purpose, their underlying techniques are different.

Comparison:
It is natural to wonder which method to choose when performing a clustering task.
There are several points of comparison between the two: While Hierarchical Clustering is
highly interpretable by looking at the dendograms, it has a higher time complexiy O(n^2) as
compated to K-Means Clustering which has a linear time complexity O(n). Even by iterating
K-Means for different initial clusters, it would be more efficient for clustering large amounts
of data. In contrast, K-Means clustering requires the data to be continuous while
Hierarchical Clustering can be run on categorical data by defining a similarity metric rather
than distance.

# Note: If one of the features has a range of values much larger than the others, clustering
will be completely dominated by that one feature. Hence, it is important to ensure that the
range of the variables is similar by normalizing the data before clustering.

Number of Clusters:
Sometimes we might know exactly what is the number of clusters required for
further analysis. For example, while clustering the data for physical features of people for
clustering them into small, medium and large sized, we know that k is 3. However in some
cases we might not be pre-decided about the number of clusters. In those cases, if using
K-Means Clustering, we may use the 'elbow method' to choose the optimal number of
clusters or use our judgement to choose where to draw the line in the dendograms obtained
from Hierarchical Clustering.

49
In the elbow method, the optimal number of clusters is chosen as the point beyond
which the rate of decrease of the within clusters sum of squares starts to fall significantly. In
some cases, we need not use the elbow method if we are certain about the number of
clusters required. For example, in this case, suppose that we wanted to form 3 clusters of
student's knowledge to be able to classify them in three different groups and potentially use
different strategies to help them better their knowledge.

Next, we want to perform classification on unseen data and the new categorical
target values of class. We can use multiclass classification methods in Machine Learning on
this data. The data appears to be well separated in space as seen from the plots. First we will
split the data into training and test sets. Then, we will train the Machine Learning models on
the trainnig data and evaluate their performance on the test data. There are numerous ways
to evaluate performance of the model. Here, we will use the most simple metric, accuracy to
evaluate our models.

# The algorithms to be used for this multi-class classification task and the reason why they
were selected from the list of all algorithms are stated below:

KNN (K-Nearest Neighbors) -


KNN uses distance as the metric and the labels for the dataset were also obtained
using distance as the metric when we applied K-Means Clustering. Thus, KNN may
perform well on this dataset.

Decision Tree Classifier -


We almost always want to apply a few Machine Learning methods to any dataset
and compare them based on a suitable evaluation metric rather than selecting one final
model based only on intusion. Although decision tess may not perform best on a small data
such as this one, they are highly interpretable.

Naive Bayes -
Based on assumption that variables are independent and making a probabilistic
estimation using amaximum likelihood hypothesis, this algorithm is highly efficient as
compared to other Machine Lerning models.

50
PROGRAM CODE:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans

# Load the iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Train a logistic regression model on the training set
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)
# Predict the labels of the testing set using the trained model
y_pred = clf.predict(X_test)

# Print the classification accuracy of the model


print("Classification accuracy:", np.mean(y_pred == y_test))

# Train a KMeans clustering model on the entire dataset


kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Print the cluster centroids


print("Cluster centroids:\n", kmeans.cluster_centers_)

OUTPUT:

RESULT:
Thus the implementation of classification and clustering of a sample data set.

51

You might also like