AIDS - DM Using Python - Lab Programs
AIDS - DM Using Python - Lab Programs
1. Demonstrate the following data pre-processing tasks using python libraries. a) Loading the
dataset. b) Identifying the dependent and independent variables. c) Dealing with missing data.
Aim: To implement the following pre-processing tasks:
a) Loading the dataset b) Identifying the dependent and independent variables c) Dealing with missing
data
Description: Data pre-processing is an important preliminary step in the data mining process. Real
world data is messy and is often created for various business problems. As a result, the data may
contain missing values, duplicate data or some errors. The results of any data mining task depend
upon the quality of data fed to the mining algorithm. Hence, various forms of data pre-processing are
applied on data before it is used.
Dependent and independent variables are identified based on semantics of the data and finding the
correlation among various attributes. Missing data are handled using several approaches. One among
them is replacing the missing values with mean value of the remaining values of that column.
Description about the dataset:
The dataset contains 30 records and 3 columns. There are two independent variables namely
‘YearsExperience’ which describes the experience of an individual in his/her job in years , ‘Age’
which describes the age of an individual in years. The target variable is ‘Salary’ which describes the
salary of an individual in rupees.
Dataset:
Program:
df = pd.read_csv('D:\Salary_dataset.csv')
display(df.head())
o/p :
b. Identifying the dependent and independent variables
x = pd.DataFrame(df.iloc[:,:-1].values,columns=df.columns[:-1])
y = pd.DataFrame(df.iloc[:,-1].values,columns=['Salary'])
print("displaying the values of the independent features for the first five records")
display(x[:5])
print("displaying the values of the dependent feature for the first five records")
display(y[:5])
o/p:
df.isnull().sum()
o/p :
YearsExperience 2
Age 2
Salary 2
dtype: int64
#Replacing the NaN values with their corresponding column mean values
mean_value = df[column].mean()
null_indexes = df[df[column].isnull()].index.tolist()
print(f'indexes of the records having NaN values in the column {column} :',end=" ")
print(null_indexes)
if(column == 'Age'):
mean_value = np.ceil(mean_value)
else:
mean_value = round(mean_value,1)
df[column].fillna(value=mean_value,inplace=True)
o/p :
indexes of the records having NaN values in the column
YearsExperience : [8, 18]
The NaN values in the column YearsExperience are updated with 5.4
indexes of the records having NaN values in the column Age : [13, 28]
The NaN values in the column Age are updated with 27.0
indexes of the records having NaN values in the column Salary : [7,
16]
The NaN values in the column Salary are updated with 77129.1
#checking for count of the null values in the dataset for verification
df.isnull().sum()
o/p:
YearsExperience 0
Age 0
Salary 0
dtype: int64
2. Demonstrate the following data pre-processing tasks using python libraries. a) Dealing with
categorical data b) Scaling the features c) Splitting dataset into Training and Testing Sets
Description:
The Pearson Correlation measures the strength of the linear relationship between two variables.
Cosine similarity measures the similarity between two vectors of an inner product space.
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents.
The Euclidean distance between two points in Euclidean space is the length of a line segment between
the two points.
The Manhattan distance between two points is sum of projected distances between the two points.
Program:
Data Preparation
Choose the Independent and Dependent Variables
Splitting the Data into Train and Test Datasets
Fitting the data into Linear regression model
Make Predictions using Testing dataset
Interpreting the Results
Dataset Description: A salary dataset typically includes information about individuals and their
corresponding salaries. Some common variables that may be included in a salary dataset are:
Yearsexperience: The number of years of experience the individual has in their field.
Age: The age of the individual.
Salary: The salary earned by the individual.
Program:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import seaborn as sns
data=pd.read_csv(r"C:\Users\HP\Downloads\Salary_Data.csv")
data=data.dropna()
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
model = LinearRegression()
model.fit(x_train, y_train)
r_sq = model.score(x, y)
print(f"coefficient of determination for train dataset(r_square): {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")
y_pred = model.predict(x_test)
df={'Original':list(y_test),"Predicted":list(y_pred)}
df=pd.DataFrame(df)
print(df)
r_sq = model.score(x_test, y_test)
print(f"coefficient of determination for test data set(r_square): {r_sq}")
sns.regplot(y_test,y_pred)
OUTPUT:
coefficient of determination for train dataset(r_square): 0.9590250467694154
intercept: 5888.111721765032
slope: [7177.25453766 1185.80287577]
Original Predicted
0 91738.0 90267.528851
1 66029.0 73322.984634
2 98273.0 92420.705212
3 43525.0 46330.284064
4 116969.0 115575.130482
5 55794.0 63056.398891
coefficient of determination for test data set(r_square): 0.9613797085069627
Aim: Build a classification model using Decision Tree algorithm on iris dataset
Description: Decision Tree is a supervised learning technique that can be used for both classification
and regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome along that branch.
Data Set Description: The entire dataset has 6 columns namely
ID,sepal_length,sepal_width,petal_length,petal_width,species.Species column contains 3 values .They
are Iris-setosa , Iris-versicolor ,Iris-virginica.
Program:
Output:
6. Apply Naïve Bayes Classification algorithm on any dataset.
Description: The Naïve Bayes’ classifier is a supervised machine learning algorithm, which is used
for classification tasks, like text classification. It is also part of a family of generative learning
algorithms, meaning that it seeks to model the distribution of inputs of a given class or category.
Program:
import pandas as pd
import numpy as np
df=pd.read_csv(r"C:\Users\dell\OneDrive\Desktop\New folder\sem 6\DWDM LAB\Iris.csv")
df.head(5)
X=df.iloc[:,:-1]
Y=df.iloc[:, -1]
Output:
accuracy: 0.9809523809523809
array([[40, 0, 0],
[ 0, 32, 1],
[ 0, 1, 31]], dtype=int64)
7) Generate frequent item sets using Apriori Algorithm in python and also generate association
rules for any market basket data.
Aim: Generating frequent item sets using Apriori Algorithm.
Description: Apriori Algorithm refers to an algorithm that is used in mining frequent products sets
and relevant association rules. It is a level wise algorithm. It is used to identify frequent patterns or
associations between items in a dataset.
The Apriori Algorithm has two main components:
• Support: The support of an item set is defined as the number of transactions containing the item set
divided by the total number of transactions.
• Confidence: The confidence of a rule is defined as the number of transactions containing both the
antecedent and consequent of the rule divided by the number of transactions containing only the
antecedent.
Dataset Description: Market Basket Optimisation dataset contains transactions from a fictional
grocery store. The dataset contains 22 transactions, where each transaction represents a customer
purchase and contains a list of items bought by the customer.
The dataset includes a total of 6 items, which are:
Wine
Chips
Bread
Butter
Milk
Apple
Program:
import numpy as np
from apyori import apriori
import pandas as pd
store_data=pd.read_csv(“Market_Basket_Optimisation.csv”)
records=[]
for i in range(0,22):
records.append([str(store_data.values[i,j]) for j in range(0,6)])
#Build the Apriori model
association_rules=apriori(records,min_support=0.50,
min_confidence=0.7,
min_lift=1.2,
min_length=2)
association_results=list(association_rules)
#Getting the numbers of rules
print("Numbers of rules:",len(association_results))
#glance at the first rule
print(association_results)
Output:
Numbers of rules: 1
[RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk'}), support=0.5,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'Butter'}), items_add=frozenset({'Bread',
'Milk'}), confidence=0.7333333333333334, lift=1.241025641025641),
OrderedStatistic(items_base=frozenset({'Bread', 'Milk'}), items_add=frozenset({'Butter'}),
confidence=0.8461538461538461, lift=1.241025641025641)])]
Description: K-Means clustering is an unsupervised learning algorithm, which groups the unlabelled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on. The Euclidean distance measure is used to find distance between two points.
Before implementation, let's understand what type of problem we will solve here. The data set used is
Mall_Customers, which is the data of customers who visit the mall and spend there.
In the given dataset, there are Customer_Id, Gender, Age, Annual Income ($), and Spending
Score (which is the calculated value of how much a customer has spent in the mall, the more the
value, the more he has spent).
Program:
import numpy as nm
import pandas as pd
y_predict= kmeans.fit_predict(x)
plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first
cluster
plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for
second cluster
plt.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third
cluster
plt.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for
fourth cluster
plt.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for
fifth cluster
plt.title('Clusters of customers')
plt.legend()
plt.show()
output:
9) Apply Hierarchical Clustering algorithm on any dataset.
Program:
Output:
Output: