DMC - Record
DMC - Record
(Deemed to be University)
VADAPALANI CAMPUS
YEAR / SEM : I / II
BONAFIDE CERTIFICATE
DATE :
AIM
To implement a python program for
a) creation of tables
b) using data preprocessing methods for preparation of data for machine
learning using scikit learn using Python
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library
CONCEPT
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
ALGORITHM :
1. Define the table structure:
● Determine the columns or fields required for your table.
● Define the data types for each column (e.g., text, numeric, date).
4. Data cleaning:
● Identify and handle missing or null values.
● Remove or handle any outliers or inconsistencies in the data.
● Standardize or normalize data formats (e.g., converting dates to a consistent
format).
5. Data transformation:
● Apply any necessary transformations to the data based on your analysis
requirements (e.g., converting categorical variables to numeric, scaling
numeric features).
● Perform feature engineering if needed (e.g., creating new columns,
aggregating data).
1
6. Data integration:
● If you have multiple data sources, merge or join them based on common
columns or keys.
● Ensure data consistency and resolve any conflicts or duplicate entries.
PROGRAM CODE:
import pandas as pd
# create a sample table
data = {'id': [1, 2, 3, 4, 5],
'name': ['John', 'Mary', 'Bob', 'Alice', 'Jane'],
'age': [25, 30, 40, 20, 35],
'gender': ['M', 'F', 'M', 'F', 'F'],
'income': [50000, 60000, 80000, 40000, 70000]}
df = pd.DataFrame(data)
2
OUTPUT:
RESULT:
Thus, the programs are implemented successfully for table creation and data
pre-processing.
3
EX NO : 02 NORMALIZING THE TABLES USING KNOWLEDGE FLOW
DATE :
AIM
To write and implement a python program to impute missing values with various
techniques on given dataset.
a) Remove rows/ attributes
b) Replace with mean or mode
c) Write a python program to perform transformation of data using
Discretization (Binning) on given dataset.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
ALGORITHM :
4
6. Remove transitive dependencies:
● Identify any transitive dependencies, where an attribute depends on another
attribute that is not part of the primary key. If any exist, create separate tables
to eliminate these dependencies.
PROGRAM CODE:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Create a sample dataset with continuous values
data = {'feature1': [1.5, 3.0, 2.0, 4.5],
'feature2': [10, 20, 30, 40]}
df = pd.DataFrame(data) # Instantiate the StandardScaler object
scaler = StandardScaler() # Fit and transform the data
df_normalized = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)
5
OUTPUT:
RESULT:
Thus, the values are normalized using knowledge flow.
6
EX NO : 03 FINDING FREQUENT DATA ITEM SETS
DATE :
AIM
a) To write and implement a python program to find frequent data itemsets.
b) Also, a program to find frequent elements from a list of numbers.
REQUIREMENTS
1. Python 3.7.0
2. Install: pip installer, pandas, SciPy library, apriori
CONCEPT
Association Mining searches for frequent items in the data-set. In frequent mining
usually the interesting associations and correlations between item sets in transactional and
relational databases are found. In short, Frequent Mining shows which items appear together
in a transaction or relation.
ALGORITHM:
1. Initialize support threshold:
● Determine the minimum support threshold, denoted as min_support, which
represents the minimum occurrence frequency for an itemset to be considered
frequent. This value is typically defined by the user.
7
6. Repeat steps 3-5:
● Continue generating candidate itemsets, scanning the dataset, and pruning
until no new frequent itemsets are found.
PROGRAM CODE:
import pandas as pd
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
OUTPUT:
8
Distributions
Values
RESULT:
Thus the implementation of finding frequent data item sets are done.
9
EX NO : 04 DISCRETIZATION FILTERING AND RESAMPLE FILTERING ON
SAMPLE DATA
DATE :
AIM
To write and implement a python program to impute missing values with various
techniques on given dataset.
a) Remove rows/ attributes
b) Replace with mean or mode
c) Write a python program to perform transformation of data using
Discretization (Binning) on given dataset.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
ALGORITHM:
1. Input:
● Obtain the sample data that you want to apply discretization and resampling
filtering to.
2. Discretization Filtering:
● Choose the attribute(s) or feature(s) in the sample data that you want to
discretize.
● Determine the discretization method based on your requirements (e.g., equal
width, equal frequency, clustering).
● Apply the chosen discretization method to the selected attribute(s) and
transform the continuous values into discrete bins or intervals.
3. Resampling Filtering:
● Determine the resampling technique that you want to use (e.g.,
undersampling, oversampling, SMOTE).
● If the sample data is imbalanced (i.e., one class dominates the dataset),
consider using resampling techniques to address the class imbalance.
● Apply the selected resampling technique to the sample data, generating a
balanced or adjusted dataset.
4. Output:
● The output of the algorithm will be the filtered and transformed sample data,
where the selected attributes have been discretized and any class imbalance
has been addressed through resampling.
10
PROGRAM CODE:
import pandas as pd
# Create a sample dataset with continuous values
data = {'timestamp': pd.date_range('2022-01-01', periods=100, freq='H'),
'temperature': [25.5, 26.0, 25.7, 24.9, 25.2, 25.6, 26.2, 26.5, 27.1, 27.6] * 10}
df = pd.DataFrame(data)
OUTPUT:
RESULT:
Hence the implementation of discretization filtering and resample filtering on sample
data is done.
11
EX NO : 05 CONSTRUCTION OF DECISION TREE
DATE :
AIM
To write and implement a python program to classify using a decision tree model.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
c) Download dataset from kaggle or uci repository
CONCEPT
Decision tree is a type of supervised learning algorithm (having a predefined target
variable) that is mostly used in classification problems. It works for both categorical and
continuous input and output variables. In this technique, we split the population or sample
into two or more homogeneous sets (or sub-populations) based on most significant splitter /
differentiator in input variables.
ALGORITHM
1. Input:
● Obtain the training dataset consisting of labeled examples, where each
example is represented by a set of attributes and a corresponding class label.
4. Recursion:
● For each branch created in the previous step:
● If all examples in the subset belong to the same class, create a leaf
node with that class label.
● If the subset is empty, create a leaf node with the majority class label
from the parent node's examples.
● Otherwise, select the best attribute from the remaining attributes and
recursively repeat steps 3-4 for the subset.
12
5. Output:
● The decision tree is constructed once all the attributes have been exhausted or
a stopping condition is met (e.g., reaching a maximum depth, minimum
number of examples per leaf node).
PROGRAM CODE:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree
#replacing the categories of target variable with the actual names of the species
target = np.unique(iris.target)
target_n = np.unique(iris.target_names)
target_dict = dict(zip(target, target_n))
data['Species'] = data['Species'].replace(target_dict)
13
# Creating an instance of the classifier class
dtc = DecisionTreeClassifier(max_depth = 3, random_state = 93)
# Plotting heatmap
sns.heatmap(matrix, annot = True, fmt = "g", ax = axis, cmap = "magma")
axis.set_title('Confusion Matrix')
axis.set_xlabel("Predicted Values", fontsize = 10)
axis.set_xticklabels([''] + target_labels)
axis.set_ylabel( "True Labels", fontsize = 10)
axis.set_yticklabels(list(target_labels), rotation = 0)
plt.show()
OUTPUT:
14
RESULT:
Thus to implement a python program to classify using a decision tree model is done.
15
EX NO : 06 IMPLEMENTATION OF APRIORI ALGORITHM WITH
DIFFERENT SUPPORT AND CONFIDENCE VALUES
DATE :
AIM
To write and implement a python program to perform Market Basket Analysis using
the Apriori algorithm with different support and confidence values.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, scikit-learn library
c) Dataset: Groceries dataset.csv
CONCEPT
The Apriori algorithm is the most popular algorithm for mining association rules.
The Apriori algorithm allows to mine for frequent itemset and learns association rules
between items over large datasets. based on 3 important factors:
The algorithm identifies the frequent individual items in the database. It extends
them to larger and larger item sets as long as those item sets appear sufficiently often in the
database.
ALGORITHM
16
PROGRAM CODE:
import itertools
def frequent_itemsets(transactions, support_threshold):
"""
Finds frequent itemsets in the given transactions, using the Apriori algorithm.
"""
items = sorted(set(item for transaction in transactions for item in transaction))
freq_sets = []
for k in range(1, len(items) + 1):
candidate_sets = itertools.combinations(items, k)
freq_sets_k = []
for c in candidate_sets:
support = sum(1 for transaction in transactions if set(c).issubset(transaction))
if support >= support_threshold:
freq_sets_k.append(c)
freq_sets.extend(freq_sets_k)
return freq_sets
transactions = [
{'milk', 'bread', 'eggs'},
{'bread', 'cheese'},
{'milk', 'bread', 'cheese'},
{'milk', 'bread', 'eggs', 'cheese'},
{'milk', 'eggs'}
]
support_threshold = 3
freq_sets = frequent_itemsets(transactions, support_threshold)
print("Frequent itemsets:")
for freq_set in freq_sets:
print(freq_set)
OUTPUT:
RESULT:
Thus the implementation of apriori algorithm with different support and confidence
values are done.
17
EX NO : 07 DERIVING INTERESTING INSIGHTS AND OBSERVE THE
EFFECT OF DISCRETIZATION IN THE RULE GENERATED
USING APRIORI ALGORITHM
DATE :
AIM
To derive interesting insights and observe the effect of discretization in the rule
generated using Apriori algorithm.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
ALGORITHM
To derive interesting insights and observe the effect of discretization in the rules
generated using the Apriori algorithm, you can follow the following algorithmic steps:
1. Input:
➢ Obtain the dataset containing transactional or itemset data.
2. Discretization:
➢ Select the attribute(s) in the dataset that you want to discretize.
➢ Determine the appropriate discretization method based on the nature of the
attribute(s) and your analysis goals (e.g., equal width, equal frequency,
clustering).
➢ Apply the chosen discretization method to transform the continuous
attribute(s) into discrete bins or intervals.
3. Apply the Apriori algorithm:
➢ Use the discretized dataset as input for the Apriori algorithm to generate
frequent itemsets and association rules.
➢ Set the desired support threshold and confidence threshold for mining the
frequent itemsets and association rules.
4. Analyze generated rules:
➢ Examine the association rules generated by the Apriori algorithm, including
the antecedent, consequent, support, and confidence measures.
➢ Identify the interesting insights based on the support, confidence, and lift (or
other interestingness measures) of the rules.
➢ Explore patterns, dependencies, or relationships among the discrete
attribute(s) and other attributes in the rules.
18
➢ Observe the differences in the patterns, support, confidence, or other
interestingness measures between the two sets of rules.
➢ Evaluate the impact of discretization on the discovered insights, the quality of
the rules, and the interpretability of the results.
6. Iterate and refine:
➢ Depending on the observed effects and the objectives of your analysis,
iteratively refine the discretization process and the mining parameters (e.g.,
support threshold, confidence threshold) to uncover more meaningful insights
or explore different aspects of the data.
PROGRAM CODE:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
basket_sets = basket.applymap(encode_units)
19
# Generate association rules using different confidence values
rules_50 = association_rules(frequent_itemsets_01, metric="confidence",
min_threshold=0.5)
rules_60 = association_rules(frequent_itemsets_01, metric="confidence",
min_threshold=0.6)
rules_70 = association_rules(frequent_itemsets_01, metric="confidence",
min_threshold=0.7)
20
OUTPUT:
21
Association rules with minimum confidence 0.5: antecedents \
0 (60 TEATIME FAIRY CAKE CASES)
1 (ALARM CLOCK BAKELIKE CHOCOLATE)
2 (ALARM CLOCK BAKELIKE IVORY)
3 (ALARM CLOCK BAKELIKE PINK)
4 (ALARM CLOCK BAKELIKE RED )
.. ...
167 (GREEN REGENCY TEACUP AND SAUCER, REGENCY CAKE...
168 (GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY...
169 (ROSES REGENCY TEACUP AND SAUCER , REGENCY CAK...
170 (PINK REGENCY TEACUP AND SAUCER, REGENCY CAKES...
171 (ROSES REGENCY TEACUP AND SAUCER , PINK REGENC...
consequents antecedent support \
0 (PACK OF 72 RETROSPOT CAKE CASES) 0.029608
1 (ALARM CLOCK BAKELIKE RED ) 0.014511
2 (ALARM CLOCK BAKELIKE GREEN) 0.021091
3 (ALARM CLOCK BAKELIKE GREEN) 0.027625
4 (ALARM CLOCK BAKELIKE GREEN) 0.039522
.. ... ...
167 (ROSES REGENCY TEACUP AND SAUCER , PINK REGENC... 0.016854
168 (ROSES REGENCY TEACUP AND SAUCER , REGENCY CAK... 0.020730
169 (GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY... 0.018927
170 (GREEN REGENCY TEACUP AND SAUCER, ROSES REGENC... 0.013925
171 (GREEN REGENCY TEACUP AND SAUCER, REGENCY CAKE... 0.019648
consequent support support confidence lift leverage conviction
0 0.046372 0.014872 0.502283 10.831547 0.013499 1.916004
1 0.039522 0.010185 0.701863 17.758663 0.009611 3.221602
2 0.035557 0.012393 0.587607 16.525977 0.011643 2.338650
3 0.035557 0.015457 0.559543 15.736710 0.014475 2.189644
4 0.035557 0.023885 0.604333 16.996386 0.022479 2.437513
.. ... ... ... ... ... ...
167 0.019648 0.010771 0.639037 32.523488 0.010439 2.715937
168 0.018927 0.010771 0.519565 27.450362 0.010378 2.042051
169 0.020730 0.010771 0.569048 27.450362 0.010378 2.272339
170 0.024380 0.010771 0.773463 31.724841 0.010431 4.306664
171 0.016854 0.010771 0.548165 32.523488 0.010439 2.175896
[172 rows x 9 columns]
Association rules with minimum confidence 0.6:
antecedents \
0 (ALARM CLOCK BAKELIKE CHOCOLATE)
1 (ALARM CLOCK BAKELIKE RED )
2 (ALARM CLOCK BAKELIKE GREEN)
3 (ALARM CLOCK BAKELIKE IVORY)
4 (ALARM CLOCK BAKELIKE ORANGE)
.. ...
79 (GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY...
80 (GREEN REGENCY TEACUP AND SAUCER, ROSES REGENC...
81 (PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY...
82 (GREEN REGENCY TEACUP AND SAUCER, REGENCY CAKE...
83 (PINK REGENCY TEACUP AND SAUCER, REGENCY CAKES...
22
consequents antecedent support \
0 (ALARM CLOCK BAKELIKE RED ) 0.014511
1 (ALARM CLOCK BAKELIKE GREEN) 0.039522
2 (ALARM CLOCK BAKELIKE RED ) 0.035557
3 (ALARM CLOCK BAKELIKE RED ) 0.021091
4 (ALARM CLOCK BAKELIKE RED ) 0.016178
.. ... ...
79 (ROSES REGENCY TEACUP AND SAUCER ) 0.012213
80 (REGENCY CAKESTAND 3 TIER) 0.017575
81 (GREEN REGENCY TEACUP AND SAUCER) 0.011942
82 (ROSES REGENCY TEACUP AND SAUCER , PINK REGENC... 0.016854
83 (GREEN REGENCY TEACUP AND SAUCER, ROSES REGENC... 0.013925
23
consequents antecedent support \
0 (ALARM CLOCK BAKELIKE RED ) 0.014511
1 (ASSORTED COLOUR BIRD ORNAMENT) 0.015863
2 (BAKING SET 9 PIECE RETROSPOT ) 0.019108
3 (BATHROOM METAL SIGN) 0.013610
4 (BLUE HAPPY BIRTHDAY BUNTING) 0.016178
5 (PINK HAPPY BIRTHDAY BUNTING) 0.015998
6 (WHITE HANGING HEART T-LIGHT HOLDER) 0.015457
7 (GARDENERS KNEELING PAD KEEP CALM ) 0.028616
8 (GREEN REGENCY TEACUP AND SAUCER) 0.025056
9 (ROSES REGENCY TEACUP AND SAUCER ) 0.031140
10 (ROSES REGENCY TEACUP AND SAUCER ) 0.025056
11 (POPPY'S PLAYHOUSE BEDROOM ) 0.015593
12 (POPPY'S PLAYHOUSE KITCHEN) 0.014241
13 (REGENCY TEA PLATE ROSES ) 0.012168
14 (SET/20 RED RETROSPOT PAPER NAPKINS ) 0.014646
15 (SET/6 RED SPOTTY PAPER PLATES) 0.012844
16 (SET/6 RED SPOTTY PAPER CUPS) 0.014646
17 (SMALL DOLLY MIX DESIGN ORANGE BOWL) 0.012979
18 (WOODEN HEART CHRISTMAS SCANDINAVIAN) 0.019513
19 (WOODEN STAR CHRISTMAS SCANDINAVIAN) 0.012438
20 (ALARM CLOCK BAKELIKE RED ) 0.015457
RESULT:
Thus to derive interesting insights and observe the effect of discretization in the rule
generated using Apriori algorithm is done.
24
EX NO : 08 IMPLEMENTATION OF ID3 CLASSIFICATION ALGORITHM
DATE :
AIM
To write and implement a python program to classify a dataset using ID3 algorithm.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
CONCEPT
ID3 decision trees use a greedy search approach to determine decision node
selection, meaning that it picks an ideal attribute once and does not reconsider or modify its
previous choices. ID3 algorithms use entropy and information gain to determine which
attributes best split the data. In other words, if an attribute perfectly classifies the set of data
then ID3 training stops; otherwise, it recursively iterates over the n number of subsets of
data for that attribute until the subset becomes pure. For this reason, the process of decision
node selection is fundamental in constructing an ID3 algorithm.
ALGORITHM
PROGRAM CODE:
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
25
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
26
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)
27
OUTPUT:
28
RESULT:
Hence the implementation of a python program to classify a dataset using ID3
algorithm.
29
EX NO : 09 NAIVE-BAYES CLASSIFICATION AND K-NEAREST NEIGHBOR
CLASSIFICATION
DATE :
AIM
To write and implement a python program for naïve bayes classification k-nearest
neighbor classification.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library
c) Dataset: titanic dataset
CONCEPT
Naive Bayes is a classification technique based on Bayes' Theorem with the
assumption that all feature variables are independant. Thus a Naive Bayes classifier
assumes that the presence of a particular feature in a class is unrelated to the presence of
any other feature. Naive Bayes classifier combines 3 terms to compute the probability of a
class: the class probability in the dataset, multiplied by the probability of the example
feature variables occuring given the current class, divided by the probability of those
particular example feature variables occuring in general. To compute the probability of
particular feature variables occuring there are 3 main optional techniques. One can assume
that the value of a particular variable is Gaussian distributed which can be a common case
and thus this method is useful when the variables are real numbers. Multinomial division is
good for feature variables that are categorical as it computes the probability based on
histogram bins. The final option is to use a Bernouli probability model when the data is
binary. Naive Bayes is simplistic and easy to use yet can outperform other more complex
classification algorithms. Is has fast computation and thus is well suited for application on
large datasets.
ALGORITHM
1. Input:
➢ Obtain a labeled training dataset consisting of examples with their
corresponding class labels.
➢ Preprocess the data as needed (e.g., handle missing values, normalize
features).
30
3. Calculate feature probabilities:
For each feature in the dataset:
➢ Calculate the likelihood of the feature given each class label using the
training examples.
➢ Compute the conditional probability of the feature given the class by
counting the frequency of the feature-value pairs within each class.
5. Make predictions:
➢ Assign the class label with the highest posterior probability as the predicted
class for each test example.
6. Output:
➢ The output of the algorithm will be the predicted class labels for the test
examples.
7. For getting the predicted class, iterate from 1 to total number of training data points
8. Calculate the distance between test data and each row of training data. Here we will
use Euclidean distance as our distance metric since it’s the most popular method. The
other metrics that can be used are Chebyshev, cosine, etc.
9. Sort the calculated distances in ascending order based on distance values
10. Get top k rows from the sorted array
11. Get the most frequent class of these rows
12. Return the predicted class
31
PROGRAM CODE:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
32
OUTPUT:
RESULT:
Thus the implementation of naïve bayes classification and k-nearest neighbor
classification is done.
33
EX NO : 10 COMPARE THE PERFORMANCE OF ID3, NAÏVE BAYES AND
K-NEAREST NEIGHBOR CLASSIFICATION ALGORITHMS
DATE :
AIM
To compare the Performance of ID3, Naïve-Bayes and K-Nearest Neighbor
Classification algorithms using python program.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library
ALGORITHM
1. Input:
➢ Obtain a labeled dataset consisting of examples with their corresponding
class labels.
➢ Preprocess the data as needed (e.g., handle missing values, normalize
features).
➢ Split the dataset into training and testing sets for evaluation.
2. Train the models:
➢ For each algorithm (ID3, Naïve Bayes, k-NN):
➢ Initialize the model with appropriate parameters.
➢ Train the model using the training dataset.
3. Test the models:
For each algorithm:
➢ Apply the trained model to the testing dataset to make predictions.
➢ Compare the predicted class labels with the true class labels.
4. Evaluate performance metrics:
➢ Calculate evaluation metrics to compare the performance of the algorithms.
➢ Commonly used metrics include accuracy, precision, recall, F1 score, and
area under the ROC curve (AUC-ROC).
➢ Calculate and compare the performance metrics for each algorithm.
5. Repeat and validate:
➢ Perform steps 2-4 multiple times using different training and testing splits
(e.g., cross-validation) to validate the results and reduce bias.
6. Analyze and compare results:
➢ Compare the performance metrics obtained from each algorithm.
➢ Look for patterns or trends in the results to identify which algorithm performs
better overall or in specific scenarios.
➢ Consider factors such as accuracy, computational efficiency, interpretability,
and requirements of your specific problem.
34
7. Fine-tuning:
➢ Explore parameter tuning for each algorithm to improve their performance
further.
➢ Adjust algorithm-specific parameters to find the optimal settings for your
dataset.
8. Select the best algorithm:
➢ Based on the performance metrics, select the algorithm that provides the best
results for your particular classification task.
PROGRAM CODE:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
35
OUTPUT:
RESULT:
Thus, the comparison of ID3, Naïve-Bayes and K-Nearest classification are
performed using a python program.
36
EX NO : 11 IMPLEMENTATION OF K-MEANS CLUSTERING ALGORITHM
DATE :
AIM
To write and implement a python program for k-means clustering algorithm.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, Scikit-learn library
CONCEPT
It is a type of unsupervised algorithm which solves the clustering problem. Its
procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters). Data points inside a cluster are homogeneous and
heterogeneous to peer groups.
ALGORITHM
1. Input:
➢ Obtain a dataset consisting of data points that you want to cluster.
➢ Determine the value of K, the number of clusters you want to create.
2. Initialize centroids:
➢ Randomly select K data points from the dataset as initial centroid positions.
➢ These initial centroids will serve as the center points of the clusters.
3. Assign data points to clusters:
For each data point in the dataset:
➢ Calculate the distance between the data point and each centroid.
➢ Assign the data point to the cluster with the closest centroid.
4. Update centroids:
For each cluster:
➢ Calculate the new centroid position as the mean of all data points belonging
to that cluster.
➢ Update the centroid position accordingly.
5. Repeat steps 3-4:
➢ Iterate the assignment and centroid update steps until convergence or until a
maximum number of iterations is reached.
➢ Convergence is reached when the centroid positions stabilize, and the data
points no longer change their assigned clusters significantly.
6. Output:
➢ The output of the algorithm is the final set of K clusters, along with their
respective data points.
37
PROGRAM CODE:
import numpy as np
import random
class KMeans:
def __init__(self, k, max_iterations=100):
self.k = k
self.max_iterations = max_iterations
38
# Recalculate centroids as the mean of all data points in the cluster
new_centroids = []
for cluster in clusters:
if cluster:
new_centroids.append(np.mean(cluster, axis=0))
else:
new_centroids.append(self.centroids[random.randint(0,
len(self.centroids)-1)])
OUTPUT:
RESULT:
Thus the implementation of k means clustering is done.
39
EX NO : 12 EXPLORING VISUALIZATION FEATURES
DATE :
AIM
To apply and explore various plotting functions using matplotlib and seaborn
packages on the Adult UCI dataset.
a) Normal curves
b) Density and contour plots
c) Correlation and scatter plots
d) Histograms
e) Three-dimensional plotting
DATASET:
The adult UCI dataset is one of the popular datasets for practice. It is a Supervised
binary classification problem that predicts whether a person makes over 50k a year. The
dataset contains a mix of categorical and numeric type data.
CONCEPT
Categorical Attributes:
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State
gov, Without-pay, Never-worked.
Individual work category
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,
Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
Individual’s highest education degree
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed,
Married-spouse-absent, Married-AF-spouse.
Individual marital status
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,
Prof-specialty, Handlers-cleaners, Machine-op-inspect, Adm-clerical, Farming fishing,
Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
Individual’s occupation
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
Individual’s relation in a family
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
Race of Individual
sex: Female, Male.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras,
Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala,
Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinidad Tobago, Peru, Hong,
Holland-Netherlands.
Individual’s native country
40
Continuous Attributes:
age: continuous.
Age of an individual
fnlwgt: final weight, continuous.
The weights on the CPS files are controlled to independent estimates of the
civilian noninstitutional population of the US. These are prepared monthly for us by
Population Division here at the Census Bureau.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
Individual’s working hours per week
PROGRAM CODE:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Adult UCI dataset
df=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adul
t.data', header=None)
# Add column names
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
'native-country', 'income']
# Drop missing values
df = df.dropna()
# Convert categorical variables to numerical
df['workclass'] = pd.factorize(df['workclass'])[0]
df['education'] = pd.factorize(df['education'])[0]
df['marital-status'] = pd.factorize(df['marital-status'])[0]
df['occupation'] = pd.factorize(df['occupation'])[0]
df['relationship'] = pd.factorize(df['relationship'])[0]
df['race'] = pd.factorize(df['race'])[0]
df['sex'] = pd.factorize(df['sex'])[0]
df['native-country'] = pd.factorize(df['native-country'])[0]
df['income'] = pd.factorize(df['income'])[0]
# Plot a histogram of age
sns.histplot(data=df, x='age')
plt.show()
# Plot a scatterplot of age and education
sns.scatterplot(data=df, x='age', y='education')
plt.show()
# Plot a boxplot of hours-per-week by income
sns.boxplot(data=df, x='income', y='hours-per-week')
plt.show()
# Plot a heatmap of the correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()
41
OUTPUT:
42
RESULT:
Thus, various plotting functions using matplotlib, seaborn, and plotly packages on
the Adult UCI dataset were applied and explored.
43
EX NO : 13 BUILDING LINEAR REGRESSION MODEL AND DERIVE
INTERESTING PATTERNS
DATE :
AIM
To write and implement a python program for linear regression algorithm.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
CONCEPT
It is used to estimate real values (cost of houses, number of calls, total sales etc.)
based on continuous variable(s). Here, we establish relationship between independent and
dependent variables by fitting a best line. This best fit line is known as regression line and
represented by a linear equation Y= a *X + b.
ALGORITHM
1. 1.Input:
➢ Obtain a dataset consisting of variables/features and corresponding target
values for regression analysis.
2. Data preprocessing:
➢ Clean the data by handling missing values, outliers, and inconsistencies.
➢ Split the dataset into training and testing sets for model evaluation.
3. Feature selection:
➢ Identify relevant features that may influence the target variable by analyzing
their correlations, domain knowledge, or feature importance techniques.
➢ Select a subset of features to use in the linear regression model.
4. Train the linear regression model:
➢ Initialize the model with appropriate parameters.
➢ Fit the model to the training dataset by estimating the coefficients and
intercept that best fit the data using techniques like ordinary least squares or
gradient descent.
5. Evaluate the model:
➢ Apply the trained model to the testing dataset to make predictions.
➢ Calculate evaluation metrics such as mean squared error, mean absolute error,
or R-squared to assess the model's performance.
6. Derive interesting patterns:
➢ Analyze the learned coefficients of the linear regression model.
➢ Identify the features with significant positive or negative coefficients,
indicating their influence on the target variable.
➢ Look for patterns or relationships between the features and the target
variable.
44
➢ Consider domain-specific knowledge or hypotheses to interpret the patterns
and draw meaningful insights.
7. Refine the model:
➢ Based on the patterns and insights derived, you can iteratively refine the
linear regression model.
➢ Add or remove features, perform feature engineering, or try different
regression techniques to improve the model's performance and capture more
interesting patterns.
8. Output:
➢ The output of the algorithm is the trained linear regression model, evaluation
metrics, and the derived interesting patterns
PROGRAM CODE:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
45
OUTPUT:
RESULT:
Thus the implementation of building linear regression models and derive interesting
patterns.
46
EX NO : 14 BUILDING OF MULTI LINEAR REGRESSION MODEL
DATE :
AIM
To write and implement a python program for multi-linear regression model.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
CONCEPT
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes known
simply as multiple regression, and it is an extension of linear regression. The variable that
we want to predict is known as the dependent variable, while the variables we use to predict
the value of the dependent variable are known as independent or explanatory variables.
ALGORITHM
1. Input:
➢ Obtain a dataset consisting of variables/features and corresponding target
values for regression analysis.
2. Data preprocessing:
➢ Clean the data by handling missing values, outliers, and inconsistencies.
➢ Split the dataset into training and testing sets for model evaluation.
3. Feature selection:
➢ Identify relevant independent variables/features that may influence the target
variable by analyzing their correlations, domain knowledge, or feature
importance techniques.
➢ Select a subset of features to use in the multiple linear regression model.
4. Train the multiple linear regression model:
➢ Initialize the model with appropriate parameters.
➢ Fit the model to the training dataset by estimating the coefficients and
intercept that best fit the data using techniques like ordinary least squares or
gradient descent.
5. Evaluate the model:
➢ Apply the trained model to the testing dataset to make predictions.
➢ Calculate evaluation metrics such as mean squared error, mean absolute error,
or R-squared to assess the model's performance.
6. Assess feature significance:
➢ Examine the statistical significance of the estimated coefficients for each
independent variable in the multiple linear regression model.
➢ Utilize hypothesis testing techniques or p-values to determine the
significance of each variable's contribution.
47
7. Refine the model:
➢ Based on the statistical significance and performance evaluation, you can
iteratively refine the multiple linear regression model.
➢ Add or remove independent variables, perform feature engineering, or try
different regression techniques to improve the model's performance and
interpretability.
8. Output:
➢ The output of the algorithm is the trained multiple linear regression model,
evaluation metrics, and the assessment of feature significance.
PROGRAM CODE:
import numpy as np
from sklearn.linear_model import LinearRegression
# Use the model to predict the target variable for new input features
new_X = np.array([[13, 14, 15],
[16, 17, 18]])
predicted_y = model.predict(new_X)
print("Predicted y: ", predicted_y)
OUTPUT:
RESULT:
Thus, the python program for multi-linear regression algorithm is implemented
successfully.
48
EX NO : 15 CLASSIFICATION AND CLUSTERING OF A SAMPLE DATA SET
DATE :
AIM
To implement a python program to classify and cluster on a student dataset from
kaggle.
REQUIREMENTS
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library, seaborn, os, matplotlib
c) Dataset link:
https://www.kaggle.com/code/yoghurtpatil/clustering-and-classification
on-student-data/data
CONCEPT
K-Means Clustering: The most common methods used for identifying clusters or classes in
unlabelled data are:
1) K-Means Clustering
2) Hierarchical Clustering.
While both are used for the same purpose, their underlying techniques are different.
Comparison:
It is natural to wonder which method to choose when performing a clustering task.
There are several points of comparison between the two: While Hierarchical Clustering is
highly interpretable by looking at the dendograms, it has a higher time complexiy O(n^2) as
compated to K-Means Clustering which has a linear time complexity O(n). Even by iterating
K-Means for different initial clusters, it would be more efficient for clustering large amounts
of data. In contrast, K-Means clustering requires the data to be continuous while
Hierarchical Clustering can be run on categorical data by defining a similarity metric rather
than distance.
# Note: If one of the features has a range of values much larger than the others, clustering
will be completely dominated by that one feature. Hence, it is important to ensure that the
range of the variables is similar by normalizing the data before clustering.
Number of Clusters:
Sometimes we might know exactly what is the number of clusters required for
further analysis. For example, while clustering the data for physical features of people for
clustering them into small, medium and large sized, we know that k is 3. However in some
cases we might not be pre-decided about the number of clusters. In those cases, if using
K-Means Clustering, we may use the 'elbow method' to choose the optimal number of
clusters or use our judgement to choose where to draw the line in the dendograms obtained
from Hierarchical Clustering.
49
In the elbow method, the optimal number of clusters is chosen as the point beyond
which the rate of decrease of the within clusters sum of squares starts to fall significantly. In
some cases, we need not use the elbow method if we are certain about the number of
clusters required. For example, in this case, suppose that we wanted to form 3 clusters of
student's knowledge to be able to classify them in three different groups and potentially use
different strategies to help them better their knowledge.
Next, we want to perform classification on unseen data and the new categorical
target values of class. We can use multiclass classification methods in Machine Learning on
this data. The data appears to be well separated in space as seen from the plots. First we will
split the data into training and test sets. Then, we will train the Machine Learning models on
the trainnig data and evaluate their performance on the test data. There are numerous ways
to evaluate performance of the model. Here, we will use the most simple metric, accuracy to
evaluate our models.
# The algorithms to be used for this multi-class classification task and the reason why they
were selected from the list of all algorithms are stated below:
Naive Bayes -
Based on assumption that variables are independent and making a probabilistic
estimation using amaximum likelihood hypothesis, this algorithm is highly efficient as
compared to other Machine Lerning models.
50
PROGRAM CODE:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
OUTPUT:
RESULT:
Thus the implementation of classification and clustering of a sample data set.
51