0% found this document useful (0 votes)

58 views19 pages

AIDS - DM Using Python - Lab Programs

The document discusses implementing various data preprocessing and modeling techniques using Python. It covers tasks like data loading, handling missing values, feature scaling, and splitting data into training and test sets. It also demonstrates similarity measures, linear regression modeling, decision tree modeling, and Naive Bayes classification on sample datasets.

Uploaded by

yelubandirenukavidyadhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views19 pages

AIDS - DM Using Python - Lab Programs

Uploaded by

yelubandirenukavidyadhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Sagi Rama Krishnam Raju Engineering College(A): Bhimavaram

Department of Information Technology

DATA MINING USING PYTHON LAB
Program: AIDS, Class: ¾ B Tech 2nd Semester, Course Code: B20AD3210
Course Outcomes:
By the end of the course, students will be able to:
Serial Course Outcome Knowledge Level
Number
1 Apply pre-processing techniques on real world datasets. K3
2 Apply Apriori algorithm to generate frequent item sets. K3
3 Apply Classification and clustering algorithms on different K3
datasets.

1. Demonstrate the following data pre-processing tasks using python libraries. a) Loading the
dataset. b) Identifying the dependent and independent variables. c) Dealing with missing data.
Aim: To implement the following pre-processing tasks:
a) Loading the dataset b) Identifying the dependent and independent variables c) Dealing with missing
data
Description: Data pre-processing is an important preliminary step in the data mining process. Real
world data is messy and is often created for various business problems. As a result, the data may
contain missing values, duplicate data or some errors. The results of any data mining task depend
upon the quality of data fed to the mining algorithm. Hence, various forms of data pre-processing are
applied on data before it is used.
Dependent and independent variables are identified based on semantics of the data and finding the
correlation among various attributes. Missing data are handled using several approaches. One among
them is replacing the missing values with mean value of the remaining values of that column.
Description about the dataset:
The dataset contains 30 records and 3 columns. There are two independent variables namely
‘YearsExperience’ which describes the experience of an individual in his/her job in years , ‘Age’
which describes the age of an individual in years. The target variable is ‘Salary’ which describes the
salary of an individual in rupees.
Dataset:
Program:

a. Loading the dataset

#Loading the dataset

df = pd.read_csv('D:\Salary_dataset.csv')

display(df.head())

o/p :
b. Identifying the dependent and independent variables

#identifiying the independent and dependent variables

x = pd.DataFrame(df.iloc[:,:-1].values,columns=df.columns[:-1])

y = pd.DataFrame(df.iloc[:,-1].values,columns=['Salary'])

print("displaying the values of the independent features for the first five records")

display(x[:5])

print("displaying the values of the dependent feature for the first five records")

display(y[:5])

o/p:

c. Dealing with missing data

#checking for count of null values present in the data set

df.isnull().sum()

o/p :
YearsExperience 2
Age 2
Salary 2
dtype: int64

#Replacing the NaN values with their corresponding column mean values

for column in df.columns:

mean_value = df[column].mean()

null_indexes = df[df[column].isnull()].index.tolist()

print(f'indexes of the records having NaN values in the column {column} :',end=" ")
print(null_indexes)

if(column == 'Age'):

mean_value = np.ceil(mean_value)

else:

mean_value = round(mean_value,1)

df[column].fillna(value=mean_value,inplace=True)

print(f'TheNaN values in the column {column} are updated with',mean_value)

o/p :
indexes of the records having NaN values in the column
YearsExperience : [8, 18]
The NaN values in the column YearsExperience are updated with 5.4
indexes of the records having NaN values in the column Age : [13, 28]
The NaN values in the column Age are updated with 27.0
indexes of the records having NaN values in the column Salary : [7,
16]
The NaN values in the column Salary are updated with 77129.1

#checking for count of the null values in the dataset for verification

df.isnull().sum()

o/p:
YearsExperience 0
Age 0
Salary 0
dtype: int64

2. Demonstrate the following data pre-processing tasks using python libraries. a) Dealing with
categorical data b) Scaling the features c) Splitting dataset into Training and Testing Sets

Aim: To implement the following pre-processing tasks using python libraries.

a) Dealing with categorical data b) Scaling the features c) Splitting dataset into Training and Testing
Sets
Description: Categorical data can be converted to numeric data, one numeric value for each
categorical value. However, the order of the numeric values doesn’t make any sense if the attribute is
not ordinal. Scaling of some features is necessary to see that attributes with large range do not
dominate attributes with small range.
Program:
3 Demonstrate the following Similarity and Dissimilarity Measures using python a) Pearson’s
Correlation b) Cosine Similarity c) Jaccard Similarity d) Euclidean Distance e) Manhattan
Distance.

Aim: To implement a) Pearson’s Correlation b) Cosine Similarity c) Jaccard Similarity d)

Euclidean Distance e) Manhattan Distance using python.

Description:
The Pearson Correlation measures the strength of the linear relationship between two variables.
Cosine similarity measures the similarity between two vectors of an inner product space.
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents.
The Euclidean distance between two points in Euclidean space is the length of a line segment between
the two points.
The Manhattan distance between two points is sum of projected distances between the two points.
Program:

4 Build a model using linear regression algorithm on any dataset.

Aim: To Build a regression model using linear Regression algorithm
Description Linear regression is a statistical method used to model the relationship between a
dependent variable and one or more independent variables. There are two types of linear regression:
simple and multiple. Simple linear regression involves only one independent variable, while multiple
linear regression involves two or more independent variables. Here are the general steps for
performing linear regression:

 Data Preparation
 Choose the Independent and Dependent Variables
 Splitting the Data into Train and Test Datasets
 Fitting the data into Linear regression model
 Make Predictions using Testing dataset
 Interpreting the Results
Dataset Description: A salary dataset typically includes information about individuals and their
corresponding salaries. Some common variables that may be included in a salary dataset are:
Yearsexperience: The number of years of experience the individual has in their field.
Age: The age of the individual.
Salary: The salary earned by the individual.

Program:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import seaborn as sns
data=pd.read_csv(r"C:\Users\HP\Downloads\Salary_Data.csv")
data=data.dropna()
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
model = LinearRegression()
model.fit(x_train, y_train)
r_sq = model.score(x, y)
print(f"coefficient of determination for train dataset(r_square): {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")
y_pred = model.predict(x_test)
df={'Original':list(y_test),"Predicted":list(y_pred)}
df=pd.DataFrame(df)
print(df)
r_sq = model.score(x_test, y_test)
print(f"coefficient of determination for test data set(r_square): {r_sq}")
sns.regplot(y_test,y_pred)

OUTPUT:
coefficient of determination for train dataset(r_square): 0.9590250467694154
intercept: 5888.111721765032
slope: [7177.25453766 1185.80287577]
Original Predicted
0 91738.0 90267.528851
1 66029.0 73322.984634
2 98273.0 92420.705212
3 43525.0 46330.284064
4 116969.0 115575.130482
5 55794.0 63056.398891
coefficient of determination for test data set(r_square): 0.9613797085069627

5 Build a classification model using Decision Tree algorithm on iris dataset.

Aim: Build a classification model using Decision Tree algorithm on iris dataset

Description: Decision Tree is a supervised learning technique that can be used for both classification
and regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome along that branch.
Data Set Description: The entire dataset has 6 columns namely
ID,sepal_length,sepal_width,petal_length,petal_width,species.Species column contains 3 values .They
are Iris-setosa , Iris-versicolor ,Iris-virginica.

Program:

Output:
6. Apply Naïve Bayes Classification algorithm on any dataset.

Aim: Apply Naïve Bayes’ Classification algorithm on any dataset.

Description: The Naïve Bayes’ classifier is a supervised machine learning algorithm, which is used
for classification tasks, like text classification. It is also part of a family of generative learning
algorithms, meaning that it seeks to model the distribution of inputs of a given class or category.

Data Set Description: The entire dataset has 6 columns namely

ID,sepal_length,sepal_width,petal_length,petal_width,species.Species column contains 3 values. They
are Iris-setosa, Iris-versicolor, Iris-virginica.

Program:

import pandas as pd
import numpy as np
df=pd.read_csv(r"C:\Users\dell\OneDrive\Desktop\New folder\sem 6\DWDM LAB\Iris.csv")
df.head(5)

X=df.iloc[:,:-1]
Y=df.iloc[:, -1]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(X,Y, test_size=0.7, random_state=42)
y_train

from sklearn.naive_bayes import GaussianNB

classifier=GaussianNB()
classifier.fit(x_train,y_train)
y_pred=classifier.predict(x_test)
y_pred

from sklearn import metrics

print("accuracy:", metrics.accuracy_score(y_test,y_pred))
from sklearn.metrics import confusion_matrix, accuracy_score
cm=confusion_matrix(y_test,y_pred)
cm

Output:
accuracy: 0.9809523809523809
array([[40, 0, 0],
[ 0, 32, 1],
[ 0, 1, 31]], dtype=int64)
7) Generate frequent item sets using Apriori Algorithm in python and also generate association
rules for any market basket data.
Aim: Generating frequent item sets using Apriori Algorithm.
Description: Apriori Algorithm refers to an algorithm that is used in mining frequent products sets
and relevant association rules. It is a level wise algorithm. It is used to identify frequent patterns or
associations between items in a dataset.
The Apriori Algorithm has two main components:
• Support: The support of an item set is defined as the number of transactions containing the item set
divided by the total number of transactions.
• Confidence: The confidence of a rule is defined as the number of transactions containing both the
antecedent and consequent of the rule divided by the number of transactions containing only the
antecedent.
Dataset Description: Market Basket Optimisation dataset contains transactions from a fictional
grocery store. The dataset contains 22 transactions, where each transaction represents a customer
purchase and contains a list of items bought by the customer.
The dataset includes a total of 6 items, which are:

 Wine
 Chips
 Bread
 Butter
 Milk
 Apple
Program:
import numpy as np
from apyori import apriori
import pandas as pd
store_data=pd.read_csv(“Market_Basket_Optimisation.csv”)
records=[]
for i in range(0,22):
records.append([str(store_data.values[i,j]) for j in range(0,6)])
#Build the Apriori model
association_rules=apriori(records,min_support=0.50,
min_confidence=0.7,
min_lift=1.2,
min_length=2)
association_results=list(association_rules)
#Getting the numbers of rules
print("Numbers of rules:",len(association_results))
#glance at the first rule
print(association_results)
Output:
Numbers of rules: 1
[RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk'}), support=0.5,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'Butter'}), items_add=frozenset({'Bread',
'Milk'}), confidence=0.7333333333333334, lift=1.241025641025641),
OrderedStatistic(items_base=frozenset({'Bread', 'Milk'}), items_add=frozenset({'Butter'}),
confidence=0.8461538461538461, lift=1.241025641025641)])]

8) Apply K- Means clustering algorithm on any dataset.

Description: K-Means clustering is an unsupervised learning algorithm, which groups the unlabelled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on. The Euclidean distance measure is used to find distance between two points.

Before implementation, let's understand what type of problem we will solve here. The data set used is
Mall_Customers, which is the data of customers who visit the mall and spend there.

In the given dataset, there are Customer_Id, Gender, Age, Annual Income ($), and Spending
Score (which is the calculated value of how much a customer has spent in the mall, the more the
value, the more he has spent).

Program:

import numpy as nm

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

import pandas as pd

dataset = pd.read_csv(r"C:\Users\mmuni\Downloads\archive (1)\Mall_Customers.csv")

x = dataset.iloc[:, [3, 4]].values

#training the K-means model on a dataset

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)

y_predict= kmeans.fit_predict(x)

plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first
cluster

plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for
second cluster
plt.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third
cluster

plt.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for
fourth cluster

plt.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for
fifth cluster

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label =

'Centroid')

plt.title('Clusters of customers')

plt.xlabel('Annual Income (k$)')

plt.ylabel('Spending Score (1-100)')

plt.legend()

plt.show()

output:
9) Apply Hierarchical Clustering algorithm on any dataset.
Program:
Output:

10) Apply DBSCAN clustering algorithm on any dataset.

Aim: To form clusters from a data set by applying DBSCAN clustering algorithm.
Description: Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
The DBSCAN algorithm is based on the notion of “clusters” and “noise”.
The key idea is that for each point of a cluster, the neighbourhood of a given radius must contain at
least a minimum number of points. Those points in the dense regions are grouped to form clusters.
In this algorithm, we have 3 types of data points.
Core Point, Border Point, and Noise point.
Dataset Description: The data set used is Mall_Customers which contain attributes: Customer_Id ,
Age, Gender, Annual Income (k$), Spending Score(1-100).
Program:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
df = pd.read_csv(r"D:\Mall_Customers.csv")
X_train = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]
clustering = DBSCAN(eps=12.5, min_samples=4).fit(X_train)
DBSCAN_dataset = X_train.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_
DBSCAN_dataset.Cluster.value_counts().to_frame()
outliers = DBSCAN_dataset[DBSCAN_dataset['Cluster']==-1]
fig2, (axes) = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot('Annual Income(k$)','Spending Score (1-
100)',data=DBSCAN_dataset[DBSCAN_dataset['Cluster']!=-1],hue='Cluster', ax=axes[0],
palette='Set2',legend='full',s=200)
sns.scatterplot('Age', 'Spending Score (1-100)',data=DBSCAN_dataset[DBSCAN_dataset['Cluster']!
=-1],hue='Cluster', palette='Set2', ax=axes[1], legend='full',s=200)
axes[0].scatter(outliers['Annual Income (k$)'], outliers['Spending Score (1-100)'], s=10,
label='outliers', c="k")
axes[1].scatter(outliers['Age'], outliers['Spending Score (1-100)'], s=10,label='outliers', c="k")
axes[0].legend()
axes[1].legend()
plt.setp(axes[0].get_legend().get_texts(), fontsize='12')
plt.setp(axes[1].get_legend().get_texts(), fontsize='12')
plt.show()

Output:

Data Science
No ratings yet
Data Science
18 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
DA Programs
No ratings yet
DA Programs
44 pages
Python Practice Questions
No ratings yet
Python Practice Questions
5 pages
Machine Learning Lab Experiments Guide
No ratings yet
Machine Learning Lab Experiments Guide
47 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
DA Lab
No ratings yet
DA Lab
27 pages
ML
No ratings yet
ML
21 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Python 1
No ratings yet
Python 1
3 pages
Python Data Preprocessing & Regression
No ratings yet
Python Data Preprocessing & Regression
68 pages
Vanshika Goyal Gec Practicals
No ratings yet
Vanshika Goyal Gec Practicals
31 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
CSC 240 HW 2
No ratings yet
CSC 240 HW 2
5 pages
Da Rec
No ratings yet
Da Rec
29 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Data Science Experiment Guide
100% (2)
Data Science Experiment Guide
43 pages
Salary Prediction with Linear Regression
No ratings yet
Salary Prediction with Linear Regression
7 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
171 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
ML Updated File
No ratings yet
ML Updated File
36 pages
IS5312 Mini Project-2
No ratings yet
IS5312 Mini Project-2
5 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
House Price Prediction for Analysts
No ratings yet
House Price Prediction for Analysts
91 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
27 pages
Data - Analytics Lab - Manual JNTUH R22 Regulation
No ratings yet
Data - Analytics Lab - Manual JNTUH R22 Regulation
26 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Linear Regression 1
No ratings yet
Linear Regression 1
2 pages
ML Manoj
No ratings yet
ML Manoj
51 pages
EX2 - BIGDATA - San
No ratings yet
EX2 - BIGDATA - San
9 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
8 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Gec Practicals
No ratings yet
Gec Practicals
31 pages
ML WorkSheet Milan
No ratings yet
ML WorkSheet Milan
4 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
26 pages
Python Simple Linear Regression Guide
No ratings yet
Python Simple Linear Regression Guide
14 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Business Report PM Suchita Bhovar March 10 2024
No ratings yet
Business Report PM Suchita Bhovar March 10 2024
27 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
External
No ratings yet
External
11 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Data Analysis and Visualization Course
No ratings yet
Data Analysis and Visualization Course
4 pages
Case Studies and Best Practices From Leading Companies For Monitoring API Endpoints
No ratings yet
Case Studies and Best Practices From Leading Companies For Monitoring API Endpoints
7 pages
C-SAT 2200HD User Manual Overview
No ratings yet
C-SAT 2200HD User Manual Overview
20 pages
Quectel BG77xA&BG95xA QuecOpen Quick Start Guide V1.0.0 Preliminary 20210723
No ratings yet
Quectel BG77xA&BG95xA QuecOpen Quick Start Guide V1.0.0 Preliminary 20210723
24 pages
Affiche Master
No ratings yet
Affiche Master
1 page
MCAMX4 RefGuide
No ratings yet
MCAMX4 RefGuide
0 pages
RATIONAL KFC Training Document 2017 V1
No ratings yet
RATIONAL KFC Training Document 2017 V1
30 pages
SpyderGuard
No ratings yet
SpyderGuard
1 page
Ebook CISSP Domain 08 Software Development Security
No ratings yet
Ebook CISSP Domain 08 Software Development Security
113 pages
Microsoft 365 Subscription Plans Overview
No ratings yet
Microsoft 365 Subscription Plans Overview
10 pages
Final Uace Subsidiary Ict NCDC Syllabus Highlights
No ratings yet
Final Uace Subsidiary Ict NCDC Syllabus Highlights
9 pages
1d. Advertisements MTOs 2 Discipline GIS Economics
No ratings yet
1d. Advertisements MTOs 2 Discipline GIS Economics
2 pages
Selected Candidates for Interviews
No ratings yet
Selected Candidates for Interviews
12 pages
Elizabeth Berlin Resume
No ratings yet
Elizabeth Berlin Resume
1 page
1.3.3 Constraining A Sketch
No ratings yet
1.3.3 Constraining A Sketch
4 pages
Star CCM+3
No ratings yet
Star CCM+3
3 pages
CHAPTER 6
No ratings yet
CHAPTER 6
100 pages
Cyber Solutions
No ratings yet
Cyber Solutions
12 pages
Fs 3 Ep 1
No ratings yet
Fs 3 Ep 1
4 pages
Arbol
No ratings yet
Arbol
1,790 pages
D26W931 D32W931 LC32W053 LC42H053+service+manual
No ratings yet
D26W931 D32W931 LC32W053 LC42H053+service+manual
188 pages
Supercharge Learning With The Geometer's Sketchpad, Version 5
No ratings yet
Supercharge Learning With The Geometer's Sketchpad, Version 5
64 pages
Digital Communication Exam Paper EC-502
No ratings yet
Digital Communication Exam Paper EC-502
3 pages
Twin Loop Treasure Seeker: Robert and David Crone
100% (1)
Twin Loop Treasure Seeker: Robert and David Crone
5 pages
Automatic Chocolate Vending Machine Design
No ratings yet
Automatic Chocolate Vending Machine Design
8 pages
MAP08
No ratings yet
MAP08
26 pages
Educators' Tech Venture Crisis
No ratings yet
Educators' Tech Venture Crisis
2 pages
0417 m15 QP 31
No ratings yet
0417 m15 QP 31
8 pages
Email Marketing - Wikipedia: History
No ratings yet
Email Marketing - Wikipedia: History
7 pages
Keyblanks Guide
No ratings yet
Keyblanks Guide
36 pages
Wild T1000 Theodolite PDF
No ratings yet
Wild T1000 Theodolite PDF
10 pages

AIDS - DM Using Python - Lab Programs

Uploaded by

AIDS - DM Using Python - Lab Programs

Uploaded by

Sagi Rama Krishnam Raju Engineering College(A): Bhimavaram

Department of Information Technology

a. Loading the dataset

#Loading the dataset

#identifiying the independent and dependent variables

c. Dealing with missing data

#checking for count of null values present in the data set

for column in df.columns:

print(f'TheNaN values in the column {column} are updated with',mean_value)

Aim: To implement the following pre-processing tasks using python libraries.

Aim: To implement a) Pearson’s Correlation b) Cosine Similarity c) Jaccard Similarity d)

4 Build a model using linear regression algorithm on any dataset.

5 Build a classification model using Decision Tree algorithm on iris dataset.

Aim: Apply Naïve Bayes’ Classification algorithm on any dataset.

Data Set Description: The entire dataset has 6 columns namely

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn import metrics

8) Apply K- Means clustering algorithm on any dataset.

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

dataset = pd.read_csv(r"C:\Users\mmuni\Downloads\archive (1)\Mall_Customers.csv")

x = dataset.iloc[:, [3, 4]].values

#training the K-means model on a dataset

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label =

plt.xlabel('Annual Income (k$)')

plt.ylabel('Spending Score (1-100)')

10) Apply DBSCAN clustering algorithm on any dataset.

You might also like