CL-I Lab Manual
CL-I Lab Manual
(Sem-I) 2024-25
Department of
Artificial Intelligence and Data Science
LAB MANUAL
Computer Laboratory-I
BE SEM-1
Course Objectives:
● Apply regression, classification and clustering algorithms for creation of ML models
● Introduce and integrate models in the form of advanced ensembles
● Conceptualized representation of Data objects
● Create associations between different data objects, and the rules
● Organized data description, data semantics, and consistency constraints of data
Course Outcomes:
After completion of the course, learners should be able to-
2
Computer Laboratory-I B.E.(Sem-I) 2024-25
Table of Contents
A. To use PCA Algorithm for dimensionality reduction. You have a dataset that includes
measurements for different variables on wine (alcohol, ash, magnesium, and so on). Apply
PCA algorithm & transform this data so that most variations in the measurements of the
variables are captured by a small number of principal components so that it is easier to
distinguish between red and white wine by inspecting these principal components. Dataset
Link: https://media.geeksforgeeks.org/wp-content/uploads/Wine.csv
B. Apply LDA Algorithm on Iris Dataset and classify which species a given flower belongs
to. Dataset Link:https://www.kaggle.com/datasets/uciml/iris
A. Predict the price of the Uber ride from a given pickup point to the agreed drop-off
location. Perform following tasks: 1. Pre-process the dataset.
2. Identify outliers. 3. Check the correlation. 4. Implement linear regression and ridge,
Lasso regression models. 5. Evaluate the models and compare their respective scores
like R2, RMSE, etc. Dataset link:
https://www.kaggle.com/datasets/yasserh/uber-fares-dataset
B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing
the following: a. Univariate analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis b. Bivariate analysis: Linear and logistic
regression modeling c. Multiple Regression analysis d. Also compare the results of the
above analysis for the two data sets Dataset
link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
3 A3 Classification Analysis (Any one) 39
A. Implement Random Forest Classifier model to predict the safety of the car. Dataset
link: https://www.kaggle.com/datasets/elikplim/car-evaluation-data-set
B. Use different voting mechanism and Apply AdaBoost (Adaptive Boosting), Gradient
Tree Boosting (GBM), XGBoost classification on Iris dataset and compare the
performance of three models using different evaluation measures. Dataset Link:
https://www.kaggle.com/datasets/uciml/iris
6 A6 Reinforcement Learning (Any one) 70
B. Solve the Taxi problem using reinforcement learning where the agent acts as a taxi driver
to pick up a passenger at one location and then drop the passenger off at their destination.
4
Computer Laboratory-I B.E.(Sem-I) 2024-25
4. Clean and preprocess the retrieved data, handling missing values or inconsistent
formats.
5. Perform data modeling to analyze weather patterns, such as calculating average
temperature, maximum/minimum values, or trends over time.
6. Visualize the weather data using appropriate plots, such as line charts, bar plots, or
scatter plots, to represent temperature changes, precipitation levels, or wind speed
variations.
7. Apply data aggregation techniques to summarize weather statistics by specific time
periods (e.g., daily, monthly, seasonal).
8. Incorporate geographical information, if available, to create maps or geospatial
visualizations representing weather patterns across different locations.
9. Explore and visualize relationships between weather attributes, such as temperature and
humidity, using correlation plots or heatmaps.
8 B3 Data Cleaning and Preparation 84
Dataset: "Telecom_Customer_Churn.csv"
Tasks to Perform:
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Handle missing values in the dataset, deciding on an appropriate strategy. 4. Remove
any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent formatting or spelling variations, and
standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
8. Perform feature engineering, creating new features that may be relevant to predicting
customer churn.
9. Normalize or scale the data if necessary.
10. Split the dataset into training and testing sets for further analysis. 11. Export the
cleaned dataset for future analysis or modeling.
9 B4 Data Wrangling 93
Dataset: "RealEstate_Prices.csv"
5
Computer Laboratory-I B.E.(Sem-I) 2024-25
Description: The dataset contains information about housing prices in a specific real
estate market. It includes various attributes such as property characteristics, location, sale
prices, and other relevant features. The goal is to perform data wrangling to gain insights
into the factors influencing housing prices and prepare the dataset for further analysis or
modeling.
Tasks to Perform:
1. Import the "RealEstate_Prices.csv" dataset. Clean column names by removing spaces,
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an appropriate strategy (e.g.,
imputation or removal).
3. Perform data merging if additional datasets with relevant information are available (e.g.,
neighborhood demographics or nearby amenities).
4. Filter and subset the data based on specific criteria, such as a particular time period,
property type, or location.
5. Handle categorical variables by encoding them appropriately (e.g., one-hot encoding or
label encoding) for further analysis.
6. Aggregate the data to calculate summary statistics or derived metrics such as average
sale prices by neighborhood or property type.
7. Identify and handle outliers or extreme values in the data that may affect the analysis or
modeling process
10 B5 Data Visualization using matplotlib 111
Dataset: "City_Air_Quality.csv"
Tasks to Perform:
6
Computer Laboratory-I B.E.(Sem-I) 2024-25
time periods.
7. Create box plots or violin plots to analyze the distribution of AQI values for different
pollutant categories.
8. Use scatter plots or bubble charts to explore the relationship between AQI values and
pollutant levels.
9. Customize the visualizations by adding labels, titles, legends, and appropriate color
schemes
11 B6 Data Aggregation 126
Problem Statement: Analyzing Sales Performance by Region in a Retail Company
Dataset: "Retail_Sales_Data.csv"
Description: The dataset contains information about sales transactions in a retail
company. It includes attributes such as transaction date, product category, quantity sold,
and sales amount. The goal is to perform data aggregation to analyze the sales
performance by region and identify the top-performing regions.
Tasks to Perform:
7
Computer Laboratory-I B.E.(Sem-I) 2024-25
8
Computer Laboratory-I B.E.(Sem-I) 2024-25
◻ Aim: To use PCA Algorithm for dimensionality reduction. You have a dataset that includes measurements
for different variables on wine (alcohol, ash, magnesium, and so on). Apply PCA algorithm & transform
this data so that most variations in the measurements of the variables are captured by a small number of
principal components so that it is easier to distinguish between red and white wine by inspecting these
principal components
◻ Outcome: At end of this experiment, student will be able understand the scheduler, and how its
behaviour influences the performance of the system
◻ Hardware Requirement:
◻ Software Requirement:
Jupyter Nootbook/Ubuntu
◻ Theory:
PCA is an unsupervised machine learning algorithm. PCA is mainly used for dimensionality reduction in a dataset
consisting of many variables that are highly correlated or lightly correlated with each other while retaining the
variation present in the dataset up to a maximum extent.
It is also a great tool for exploratory data analysis for making predictive models.
PCA performs a linear transformation on the data so that most of the variance or information in your high-
dimensional dataset is captured by the first few principal components. The first principal component will capture
the most variance, followed by the second principal component, and so on.
Each principal component is a linear combination of the original variables. Because all the principal components
are orthogonal to each other, there is no redundant information. So, the total variance in the data is defined as the
sum of the variances of the individual component. So decide the total number of principal components according
to cumulative variance „„explained‟‟ by them.
Implementation:
import pandas as pd
from sklearn.decomposition import PCA
9
Computer Laboratory-I B.E.(Sem-I) 2024-25
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
df = pd.read_csv("C:/Users/HP/Dropbox/PC/Downloads/Wine.csv")
df.keys()
print(df['DESCR'])
df.head(5)
Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols \
0 14.23 1.71 2.43 15.6 127 2.80
1 13.20 1.78 2.14 11.2 100 2.65
2 13.16 2.36 2.67 18.6 101 2.80
3 14.37 1.95 2.50 16.8 113 3.85
4 13.24 2.59 2.87 21.0 118 2.80
df.Customer_Segment.unique()
array([1, 2, 3], dtype=int64)
print(df.isnull().sum()) #checking is null
Alcohol 0
Malic_Acid 0
Ash 0
Ash_Alcanity 0
Magnesium 0
Total_Phenols 0
Flavanoids 0
Nonflavanoid_Phenols 0
Proanthocyanins 0
Color_Intensity 0
Hue 0
OD280 0
10
Computer Laboratory-I B.E.(Sem-I) 2024-25
Proline 0
Customer_Segment 0
dtype: int64
X = df.drop('Customer_Segment', axis=1) # Features
y = df['Customer_Segment'] # Target variable
for col in X.columns:
sc = StandardScaler() #Standardize features by removing the mean and scaling to
unit variance.z = (x - u) / s mean=0, Stddeviation=1
X[col] = sc.fit_transform(X[[col]]) #Fit to data, then transform it.Compute the mean and
std to be used for later scaling.
X.head(5)
Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols \
0 1.518613 -0.562250 0.232053 -1.169593 1.913905 0.808997
1 0.246290 -0.499413 -0.827996 -2.490847 0.018145 0.568648
2 0.196879 0.021231 1.109334 -0.268738 0.088358 0.808997
3 1.691550 -0.346811 0.487926 -0.809251 0.930918 2.491446
4 0.295700 0.227694 1.840403 0.451946 1.281985 0.808997
11
Computer Laboratory-I B.E.(Sem-I) 2024-25
n_components = 12 # Choose the desired number of principal components you want to reduce a
dimention to
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
X_pca.shape
X.shape
red_indices = y[y == 1].index
white_indices = y[y == 2].index
12
Computer Laboratory-I B.E.(Sem-I) 2024-25
#Conclusion: Here we have reduce the dimention now we can able to apply any algorithm like
classification, Regression etc.
13
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
14
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 1 B
◻ Aim: Apply LDA Algorithm on Iris Dataset and classify which species a given flower belongs to. Dataset
Link:https://www.kaggle.com/datasets/uciml/iris
◻ Hardware Requirement:
◻ Software Requirement:
Jupyter Nootbook/Ubuntu
◻ Theory:
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction techniques in machine
learning to solve more than two-class classification problems. It is also known as Normal Discriminant Analysis
(NDA) or Discriminant Function Analysis (DFA).
This can be used to project the features of higher dimensional space into lower-dimensional space in order to
reduce resources and dimensional costs. In this topic, "Linear Discriminant Analysis (LDA) in machine learning”,
we will discuss the LDA algorithm for classification predictive modeling problems, limitation of logistic
regression, representation of linear Discriminant analysis model, how to make a prediction using LDA, how to
prepare data for LDA, extensions to LDA and much more. So, let's start with a quick introduction to Linear
Discriminant Analysis (LDA) in machine learning.
Although the logistic regression algorithm is limited to only two-class, linear Discriminant analysis is applicable
for more than two classes of classification problems.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques used for supervised
classification problems in machine learning. It is also considered a pre-processing step for modeling differences in
ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple features efficiently, the Linear
Discriminant Analysis model is considered the most common technique to solve such classification problems. For
e.g., if we have two classes with multiple features and need to separate them efficiently. When we classify them
using a single feature, then it may show overlapping.
15
Computer Laboratory-I B.E.(Sem-I) 2024-25
To overcome the overlapping issue in the classification process, we must increase the number of features
regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-dimensional plane as
shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points efficiently but
using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the 1-D plane. Using this
technique, we can also maximize the separability between multiple classes.
Implementation:
import pandas as pd
print(df)
16
Computer Laboratory-I B.E.(Sem-I) 2024-25
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
17
Computer Laboratory-I B.E.(Sem-I) 2024-25
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Scale the new flower using the same scaler used for training
new_flower_scaled = scaler.transform(new_flower)
18
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
19
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 2 A
◻ Aim: Predict the price of the Uber ride from a given pickup point to the agreed drop-off location. Perform
following
tasks: 1. Pre-process the dataset. 2. Identify outliers. 3. Check the correlation. 4. Implement linear regression and
ridge, Lasso regression models. 5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
Dataset link: https://www.kaggle.com/datasets/yasserh/uber-fares-dataset
◻ Hardware Requirement:
◻ Software Requirement:
Jupyter Nootbook/Ubuntu
◻ Theory:
Regression analysis is a statistical method to model the relationship between a dependent (target) and independent
(predictor) variables with one or more independent variables. More specifically, Regression analysis helps us to
understand how the value of the dependent variable is changing corresponding to an independent variable when
other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
20
Computer Laboratory-I B.E.(Sem-I) 2024-25
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its own
importance on different scenarios, but at the core, all the regression methods analyze the effect of the independent
variable on dependent variables. Here we are discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
21
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
22
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 2 B
◻ Aim: Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following: a.
Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis b.
Bivariate analysis: Linear and logistic regression modeling c. Multiple Regression analysis d. Also compare the
results of the above analysis for the two data sets Dataset link: https://www.kaggle.com/datasets/uciml/pima-
indians-diabetes-database
◻ Hardware Requirement:
◻ Software Requirement:
Jupyter Nootbook/Ubuntu
◻ Theory:
Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a
representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures
of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and
mode, while measures of variability include standard deviation, variance, minimum and maximum variables, kurtosis,
and skewness.
Types of Descriptive Statistics
All descriptive statistics are either measures of central tendency or measures of variability, also known as measures of
dispersion.
Central Tendency
Measures of central tendency focus on the average or middle values of data sets, whereas measures of variability focus
on the dispersion of data. These two measures use graphs, tables and general discussions to help people understand the
meaning of the analyzed data.
Measures of central tendency describe the center position of a distribution for a data set. A person analyzes the
frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the
most common patterns of the analyzed data set.
Measures of Variability
Measures of variability (or the measures of spread) aid in analyzing how dispersed the distribution is for a set of data.
For example, while the measures of central tendency may give a person the average of a data set, it does not describe
how the data is distributed within the set.
23
Computer Laboratory-I B.E.(Sem-I) 2024-25
So while the average of the data maybe 65 out of 100, there can still be data points at both 1 and 100. Measures of
variability help communicate this by describing the shape and spread of the data set. Range, quartiles, absolute
deviation, and variance are all examples of measures of variability.
Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting
the lowest number (5) in the data set from the highest (100).
Distribution
Distribution (or frequency distribution) refers to the quantity of times a data point occurs. Alternatively, it is the
measurement of a data point failing to occur. Consider a data set: male, male, female, female, female, other. The
distribution of this data can be classified as:
For example, imagine a room full of high school students. Say you wanted to gather the average age of the individuals
in the room. This univariate data is only dependent on one factor: each person's age. By gathering this one piece of
information from each person and dividing by the total number of people, you can determine the average age.
Bivariate data, on the other hand, attempts to link two variables by searching for correlation. Two types of data are
collected, and the relationship between the two pieces of information is analyzed together. Because multiple variables
are analyzed, this approach may also be referred to as multivariate.
Imagine another example where a company sells hot sauce. The company gathers data such as the count of sales,
average quantity purchased per transaction, and average sale per day of the week. All of this information is descriptive,
as it tells a story of what actually happened in the past. In this case, it is not being used beyond being informational.
Let's say the same company wants to roll out a new hot sauce. It gathers the same sales data above, but it crafts the
information to make predictions about what the sales of the new hot sauce will be. The act of using descriptive statistics
and applying characteristics to a different data set makes the data set inferential statistics. We are no longer simply
summarizing data; we are using it predict what will happen regarding an entirely different body of data (the new hot
sauce product).
24
Computer Laboratory-I B.E.(Sem-I) 2024-25
Implementation:
import numpy as np
import pandas as pd
df = pd.read_csv("C:/Users/HP/Dropbox/PC/Downloads/diabetes.csv")
df.shape
(768, 9)
df.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
Mean: 3.8450520833333335
Median: 3.0
Mode:
0 1
Name: Pregnancies, dtype: int64
Variance: 11.35405632062142
Standard Deviation: 3.3695780626988623
Skewness: 0.9016739791518588
Kurtosis: 0.15921977754746486
Column: Glucose
Frequency:
99 17
26
Computer Laboratory-I B.E.(Sem-I) 2024-25
100 17
111 14
129 14
125 14
..
191 1
177 1
44 1
62 1
190 1
Name: Glucose, Length: 136, dtype: int64
Mean: 120.89453125
Median: 117.0
Mode:
0 99
1 100
Name: Glucose, dtype: int64
Variance: 1022.2483142519557
Standard Deviation: 31.97261819513622
Skewness: 0.17375350179188992
Kurtosis: 0.6407798203735053
Column: BloodPressure
Frequency:
70 57
74 52
78 45
68 45
72 44
64 43
80 40
76 39
60 37
0 35
62 34
66 30
82 30
88 25
84 23
90 22
86 21
58 21
50 13
56 12
52 11
54 11
27
Computer Laboratory-I B.E.(Sem-I) 2024-25
75 8
92 8
65 7
85 6
94 6
48 5
96 4
44 4
100 3
106 3
98 3
110 3
55 2
108 2
104 2
46 2
30 2
122 1
95 1
102 1
61 1
24 1
38 1
40 1
114 1
Name: BloodPressure, dtype: int64
Mean: 69.10546875
Median: 72.0
Mode:
0 70
Name: BloodPressure, dtype: int64
Variance: 374.6472712271838
Standard Deviation: 19.355807170644777
Skewness: -1.8436079833551302
Kurtosis: 5.180156560082496
Column: SkinThickness
Frequency:
0 227 31 19
32 31 19 18
30 27 39 18
27 23 29 17
23 22 40 16
33 20 25 16
28 20 26 16
18 20 22 16
28
Computer Laboratory-I B.E.(Sem-I) 2024-25
37 16 45 6
41 15 14 6
35 15 44 5
36 14 10 5
15 14 48 4
17 14 47 4
20 13 49 3
24 12 50 3
42 11 8 2
13 11 7 2
21 10 52 2
46 8 54 2
34 8 63 1
12 7 60 1
38 7 56 1
11 6 51 1
43 6 99 1
16 6
Mean: 20.536458333333332
Median: 23.0
Mode:
0 0
Name: SkinThickness, dtype: int64
Variance: 254.47324532811953
Standard Deviation: 15.952217567727677
Skewness: 0.10937249648187608
Kurtosis: -0.520071866153013
Column: Insulin
Frequency:
0 374
105 11
130 9
140 9
120 8
..
73 1
171 1
255 1
52 1
112 1
Name: Insulin, Length: 186, dtype: int64
Mean: 79.79947916666667
29
Computer Laboratory-I B.E.(Sem-I) 2024-25
Median: 30.5
Mode:
0 0
Name: Insulin, dtype: int64
Variance: 13281.180077955281
Standard Deviation: 115.24400235133837
Skewness: 2.272250858431574
Kurtosis: 7.2142595543487715
Column: BMI
Frequency:
32.0 13
31.6 12
31.2 12
0.0 11
32.4 10
..
36.7 1
41.8 1
42.6 1
42.8 1
46.3 1
Name: BMI, Length: 248, dtype: int64
Mean: 31.992578124999998
Median: 32.0
Mode:
0 32.0
Name: BMI, dtype: float64
Variance: 62.15998395738257
Standard Deviation: 7.8841603203754405
Skewness: -0.42898158845356543
Kurtosis: 3.290442900816981
Column: DiabetesPedigreeFunction
Frequency:
0.258 6
0.254 6
0.268 5
0.207 5
0.261 5
..
1.353 1
0.655 1
0.092 1
0.926 1
30
Computer Laboratory-I B.E.(Sem-I) 2024-25
0.171 1
Name: DiabetesPedigreeFunction, Length: 517, dtype: int64
Mean: 0.47187630208333325
Median: 0.3725
Mode:
0 0.254
1 0.258
Name: DiabetesPedigreeFunction, dtype: float64
Variance: 0.10977863787313938
Standard Deviation: 0.33132859501277484
Skewness: 1.919911066307204
Kurtosis: 5.5949535279830584
Column: Age
Frequency: 51 8
22 72 52 8
21 63 44 8
25 48 58 7
24 46 47 6
23 38 54 6
28 35 49 5
26 33 48 5
27 32 57 5
29 29 53 5
31 24 60 5
41 22 66 4
30 21 63 4
37 19 62 4
42 18 55 4
33 17 67 3
38 16 56 3
36 16 59 3
32 16 65 3
45 15 69 2
34 14 61 2
46 13 72 1
43 13 81 1
40 13 64 1
39 12 70 1
35 10 68 1
50 8
Mean: 33.240885416666664
Median: 29.0
31
Computer Laboratory-I B.E.(Sem-I) 2024-25
Mode:
0 22
Name: Age, dtype: int64
Variance: 138.30304589037365
Standard Deviation: 11.76023154067868
Skewness: 1.1295967011444805
Kurtosis: 0.6431588885398942
Column: Outcome
Frequency:
0 500
1 268
Name: Outcome, dtype: int64
Mean: 0.3489583333333333
Median: 0.0
Mode:
0 0
Name: Outcome, dtype: int64
Variance: 0.22748261625380098
Standard Deviation: 0.4769513772427971
Skewness: 0.635016643444986
Kurtosis: -1.600929755156027
# Make predictions
predictions_linear = model_linear.predict(X_linear)
Linear Regression Coefficients:
Glucose: 0.005932504680360896
32
Computer Laboratory-I B.E.(Sem-I) 2024-25
BloodPressure: -0.00227883712542089
SkinThickness: 0.00016697889986787442
Insulin: -0.0002096169514137912
BMI: 0.013310837289280066
DiabetesPedigreeFunction: 0.1376781570786881
Age: 0.005800684345071733
# Prepare the data
X_logistic = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y_logistic = df['Outcome']
# Make predictions
predictions_logistic = model_logistic.predict(X_logistic)
Logistic Regression Coefficients:
Glucose: 0.03454477124790582
BloodPressure: -0.01220824032665116
SkinThickness: 0.0010051963882454211
Insulin: -0.0013499454083243116
BMI: 0.08780751006435426
DiabetesPedigreeFunction: 0.8191678019528903
Age: 0.032699759788267134
# Split the dataset into the independent variables (X) and the dependent variable (y)
X = df.drop('Outcome', axis=1) # Independent variables
y = df['Outcome'] # Dependent variable
33
Computer Laboratory-I B.E.(Sem-I) 2024-25
OLS Regression Results
==============================================================================
Dep. Variable: Outcome R-squared: 0.303
Model: OLS Adj. R-squared: 0.296
Method: Least Squares F-statistic: 41.29
Date: Sat, 08 Jul 2023 Prob (F-statistic): 7.36e-55
Time: 15:59:17 Log-Likelihood: -381.91
No. Observations: 768 AIC: 781.8
Df Residuals: 759 BIC: 823.6
Df Model: 8
Covariance Type: nonrobust
=================================================================================
===========
coef std err t P>|t| [0.025 0.975]
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names = df.columns
# Rotate x-tick labels by 90 degrees
ax.set_xticklabels(names,rotation=90)
ax.set_yticklabels(names)
pyplot.show()
34
Computer Laboratory-I B.E.(Sem-I) 2024-25
35
Computer Laboratory-I B.E.(Sem-I) 2024-25
36
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
37
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 3 A
◻ Aim: Implementation of Support Vector Machines (SVM) for classifying images of handwritten digits into their
respective numerical classes (0 to 9).
◻ Hardware Requirement:
◻ Software Requirement:
Jypiter Nootbook/Ubuntu
◻ Theory:
Classification analysis is used to group or classify objects according to shared characteristics. Moreover, this analysis
can be used in many applications, from segmenting customers for marketing campaigns to forecasting stock market
trends.
Classification Analysis Example
● Classifying images
One example of a classification analysis is the use of supervised learning algorithms to classify images. In this case,
the algorithm is provided with an image dataset (the training set) that contains labelled images.
The algorithm uses labels to learn how to distinguish between different types of objects in the picture. Once trained, it
can then be used to classify new images as belonging to one category or another.
● Customer Segmentation
Another example of classification analysis would be customer segmentation for marketing campaigns. Classification
algorithms group customers into segments based on their characteristics and behaviours.
This helps marketers target specific groups with tailored content, offers, and promotions that are more likely to
appeal to them.
● Stock Market Prediction
Finally, classification analysis can also be used for stock market prediction. Classification algorithms can identify
patterns between past stock prices and other economic indicators, such as interest rates or unemployment figures. By
understanding these correlations, analysts can better predict future market trends and make more informed investment
decisions.
38
Computer Laboratory-I B.E.(Sem-I) 2024-25
These are just some examples of how classification analysis can be applied to various scenarios. Unquestionably,
classification algorithms can be used to analyse datasets in any domain, from healthcare and finance to agriculture
and logistics.
Classification Analysis Techniques
This analysis is a powerful technique used in data science to analyse and categorise data. Classification techniques
are used in many areas, from predicting customer behaviours to finding patterns and trends in large datasets.
This analysis can help businesses make informed decisions about marketing strategies, product development, and
more. So, let‟s delve into the various techniques
1. Supervised Learning
Supervised learning algorithms require labelled data. This means the algorithm is provided with a dataset that has
already been categorised or labelled with class labels. The algorithm then uses this label to learn how to distinguish
between different class objects in the data. Once trained, it can use its predictive power to classify new datasets.
2. Unsupervised Learning
Unsupervised learning algorithms do not require labelled data. Instead, they use clustering and dimensionality
reduction techniques to identify patterns in the dataset without any external guidance. These algorithms help segment
customers or identify outlier items in a dataset.
3. Deep Learning
Deep learning is a subset/division of machine learning technologies that use artificial neural networks. These
algorithms are capable of learning from large datasets and making complex decisions. Deep learning can be used for
tasks such as image classification, natural language processing, and predictive analytics.
Classification algorithms can help uncover patterns in the data that could not be detected using traditional methods.
By using classification analysis, businesses can gain valuable insights into their customers‟ behaviours and
preferences, helping them make more informed decisions.
Implementation:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
39
Computer Laboratory-I B.E.(Sem-I) 2024-25
[ 0., 0., 0., ..., 16., 9., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 2., ..., 12., 0., 0.],
41
Computer Laboratory-I B.E.(Sem-I) 2024-25
[ 0., 0., 6., ..., 0., 0., 0.]],
...,
digits['data'][0]
array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])
digits['images'][1]
array([[ 0., 0., 0., 12., 13., 5., 0., 0.],
[ 0., 0., 0., 11., 16., 9., 0., 0.],
[ 0., 0., 3., 15., 16., 6., 0., 0.],
[ 0., 7., 15., 16., 16., 2., 0., 0.],
[ 0., 0., 1., 16., 16., 3., 0., 0.],
[ 0., 0., 1., 16., 16., 6., 0., 0.],
[ 0., 0., 1., 16., 16., 6., 0., 0.],
[ 0., 0., 0., 11., 16., 10., 0., 0.]])
digits['target'][0:9]
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
43
Computer Laboratory-I B.E.(Sem-I) 2024-25
digits['target'][0]
0
digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
# Each Digit is represented in digits.images as a matrix of 8x8 = 64 pixels. Each of the 64 values
represent
# a greyscale. The Greyscale are then plotted in the right scale by the imshow method.
iso = Isomap(n_components=2)
44
Computer Laboratory-I B.E.(Sem-I) 2024-25
45
Computer Laboratory-I B.E.(Sem-I) 2024-25
svc.fit(main_data[:1500] , targets[:1500])
predictions = svc.predict(main_data[1501:])
list(zip(predictions , targets[1501:]))
[(7, 7), (8, 8), (6, 6),
(4, 4), (4, 4), (1, 1),
(6, 6), (3, 3), (7, 7),
(3, 3), (1, 1), (5, 5),
(1, 1), (4, 4), (4, 4),
(3, 3), (0, 0), (4, 4),
(9, 9), (5, 5), (7, 7),
(1, 1), (3, 3), (2, 2),
(7, 7), (6, 6), (8, 8),
(6, 6), (9, 9), (2, 2),
46
Computer Laboratory-I B.E.(Sem-I) 2024-25
(2, 2), (0, 0), (7, 7),
(5, 5), (9, 9), (9, 4),
(7, 7), (8, 8), (6, 6),
(9, 9), (9, 9), (3, 3),
(5, 5), (8, 8), (1, 1),
(4, 4), (4, 4), (3, 3),
(8, 8), (1, 1), (9, 9),
(8, 8), (7, 7), (1, 1),
(4, 4), (7, 7), (7, 7),
(9, 9), (3, 3), (6, 6),
(0, 0), (5, 5), (8, 8),
(8, 8), (1, 1), (4, 4),
(9, 9), (0, 0), (3, 3),
(8, 8), (0, 0), (1, 1),
(0, 0), (2, 2), (4, 4),
(1, 1), (2, 2), (0, 0),
(2, 2), (7, 7), (5, 5),
(3, 3), (8, 8), (3, 3),
(4, 4), (2, 2), (6, 6),
(5, 5), (0, 0), (9, 9),
(6, 6), (1, 1), (6, 6),
(7, 7), (2, 2), (1, 1),
(1, 8), (6, 6), (7, 7),
(9, 9), (8, 3), (5, 5),
(0, 0), (3, 3), (4, 4),
(1, 1), (7, 7), (4, 4),
(2, 2), (3, 3), (7, 7),
(3, 3), (3, 3), (2, 2),
(4, 4), (4, 4), (2, 2),
(5, 5), (6, 6), (5, 5),
(6, 6), (6, 6), (7, 7),
(9, 9), (6, 6), (8, 9),
(0, 0), (9, 4), (5, 5),
(1, 1), (9, 9), (9, 4),
(2, 2), (1, 1), (4, 4),
(3, 3), (5, 5), (5, 9),
(4, 4), (0, 0), (0, 0),
(5, 5), (9, 9), (8, 8),
(6, 6), (5, 5), (9, 9),
(7, 7), (2, 2), (8, 8),
(1, 8), (8, 8), (0, 0),
(9, 9), (0, 0), (1, 1),
(4, 0), (1, 1), (2, 2),
(9, 9), (7, 7), (3, 3),
(5, 5), (6, 6), (4, 4),
(5, 5), (3, 3), (5, 5),
(6, 6), (2, 2), (6, 6),
(5, 5), (1, 1), (7, 7),
47
Computer Laboratory-I B.E.(Sem-I) 2024-25
(8, 8), (0, 0), (1, 1),
(9, 9), (2, 2), (3, 3),
(0, 0), (2, 2), (9, 9),
(1, 1), (7, 7), (1, 1),
(2, 2), (8, 8), (7, 7),
(3, 3), (2, 2), (6, 6),
(4, 4), (0, 0), (8, 8),
(5, 5), (1, 1), (4, 4),
(6, 6), (2, 2), (5, 3),
(7, 7), (6, 6), (1, 1),
(8, 8), (8, 3), (4, 4),
(9, 9), (8, 3), (0, 0),
(0, 0), (7, 7), (5, 5),
(1, 1), (5, 3), (3, 3),
(2, 2), (3, 3), (6, 6),
(8, 3), (4, 4), (9, 9),
(4, 4), (6, 6), (6, 6),
(5, 5), (6, 6), (1, 1),
(6, 6), (6, 6), (7, 7),
(7, 7), (4, 4), (5, 5),
(8, 8), (9, 9), (4, 4),
(9, 9), (1, 1), (4, 4),
(0, 0), (5, 5), (7, 7),
(9, 9), (0, 0), (2, 2),
(5, 5), (9, 9), (8, 8),
(5, 5), (5, 5), (2, 2),
(6, 6), (2, 2), (2, 2),
(5, 5), (8, 8), (5, 5),
(0, 0), (2, 2), (7, 7),
(9, 9), (0, 0), (9, 9),
(8, 8), (0, 0), (5, 5),
(9, 9), (1, 1), (4, 4),
(8, 8), (7, 7), (8, 8),
(4, 4), (6, 6), (8, 8),
(1, 1), (3, 3), (4, 4),
(7, 7), (2, 2), (9, 9),
(7, 7), (1, 1), (0, 0),
(3, 3), (7, 7), (8, 8),
(5, 5), (4, 4), (9, 9),
(1, 1), (6, 6), (8, 8)]
(0, 0), (3, 3),
Create the Confusion Matric for Performance Evaluation
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(predictions, targets[1501:])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu");
cm
array([[26, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 30, 0, 0, 0, 0, 0, 0, 2, 0],
[ 0, 0, 27, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 24, 0, 0, 0, 0, 0, 0],
[ 1, 0, 0, 0, 30, 0, 0, 0, 0, 0],
[ 0, 0, 0, 2, 0, 30, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 30, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 30, 0, 0],
[ 0, 0, 0, 4, 0, 0, 0, 0, 26, 1],
[ 0, 0, 0, 0, 3, 0, 0, 0, 0, 29]])
print(classification_report(predictions, targets[1501:]))
precision recall f1-score support
49
Computer Laboratory-I B.E.(Sem-I) 2024-25
macro avg 0.95 0.96 0.95 296
weighted avg 0.96 0.95 0.95 296
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
50
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 4 A
◻ Aim: Implement K-Means clustering on Iris.csv dataset. Determine the number of clusters
using the elbow method.Dataset Link: https://www.kaggle.com/datasets/uciml/iris
◻ Hardware Requirement:
◻ Software Requirement:
Jupyter Nootbook/Ubuntu
◻ Theory:
K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes
that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters
identified from data by algorithm is represented by „K‟ in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance
between the data points and centroid would be minimum. It is to be understood that less variation within the
clusters will lead to more similar data points within same cluster.
We can understand the working of K-Means clustering algorithm with the help of following steps −
Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.
Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple
words, classify the data based on the number of data points.
Step 3 − Now it will compute the cluster centroids.
Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment
of data points to the clusters that are not changing any more −
4.1 − First, the sum of squared distance between data points and centroids would be computed.
51
Computer Laboratory-I B.E.(Sem-I) 2024-25
4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).
4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.
K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used for
assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of
each cluster.
While working with K-means algorithm we need to take care of the following things −
Implementation:
First we read the data from the dataset using read_csv from the pandas library.
data = pd.read_csv('data\iris.csv')
52
Computer Laboratory-I B.E.(Sem-I) 2024-25
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Species
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Checking the sample size of data - how many samples are there in the dataset using len().
len(data)
150
Checking the dimensions/shape of the dataset using shape.
data.shape
(150, 6)
Viewing Column names of the dataset using columns
data.columns
Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
'Species'],
dtype='object')
for i,col in enumerate(data.columns):
54
Computer Laboratory-I B.E.(Sem-I) 2024-25
print(f'Column number {1+i} is {col}')
Column number 1 is Id
Column number 2 is SepalLengthCm
Column number 3 is SepalWidthCm
Column number 4 is PetalLengthCm
Column number 5 is PetalWidthCm
Column number 6 is Species
So, our dataset has 5 columns named:
• Id
• SepalLengthCm
• SepalWidthCm
• PetalLengthCm
• PetalWidthCm
• Species.
View datatypes of each column in the dataset using dtype.
data.dtypes
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object
Gathering Further information about the dataset using info()
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
55
Computer Laboratory-I B.E.(Sem-I) 2024-25
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
Describing the data as basic statistics using describe()
data.describe()
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 75.500000 5.843333 3.054000 3.758667 1.198667
std 43.445368 0.828066 0.433594 1.764420 0.763161
min 1.000000 4.300000 2.000000 1.000000 0.100000
25% 38.250000 5.100000 2.800000 1.600000 0.300000
50% 75.500000 5.800000 3.000000 4.350000 1.300000
75% 112.750000 6.400000 3.300000 5.100000 1.800000
max 150.000000 7.900000 4.400000 6.900000 2.500000
Checking the data for inconsistencies and further cleaning the data if needed.
Checking data for missing values using isnull().
data.isnull()
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
.. ... ... ... ... ... ...
145 False False False False False False
146 False False False False False False
147 False False False False False False
56
Computer Laboratory-I B.E.(Sem-I) 2024-25
148 False False False False False False
149 False False False False False False
Modelling
K - Means Clustering
K-means clustering is a clustering algorithm that aims to partition n observations into k clusters.
Initialisation – K initial “means” (centroids) are generated at random Assignment – K clusters are created
by associating each observation with the nearest centroid Update – The centroid of the clusters becomes
the new mean, Assignment and Update are repeated iteratively until convergence The end result is that the
57
Computer Laboratory-I B.E.(Sem-I) 2024-25
sum of squared errors is minimised between points and their respective centroids. We will use KMeans
Clustering. At first we will find the optimal clusters based on inertia and using elbow method. The
distance between the centroids and the data points should be less.
First we need to check the data for any missing values as it can ruin our model.
data.isna().sum()
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
We conclude that we don't have any missing values therefore we can go forward and start the clustering
procedure.
We will now view and select the data that we need for clustering.
data.head()
Checking the value count of the target column i.e. 'Species' using value_counts()
data['Species'].value_counts()
Iris-setosa 50
Iris-versicolor 50
58
Computer Laboratory-I B.E.(Sem-I) 2024-25
Iris-virginica 50
Name: Species, dtype: int64
Target Data
target_data = data.iloc[:,4]
target_data.head()
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
Name: Species, dtype: object
Training data
clustering_data = data.iloc[:,[0,1,2,3]]
clustering_data.head()
Now, we need to visualize the data which we are going to use for the clustering. This will give us a fair
idea about the data we're working on.
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
ax = sns.scatterplot(x=data['SepalLengthCm'],y=data['SepalWidthCm'], s=70, color='#f73434',
59
Computer Laboratory-I B.E.(Sem-I) 2024-25
edgecolor='#f73434', linewidth=0.3)
ax.set_ylabel('Sepal Width (in cm)')
ax.set_xlabel('Sepal Length (in cm)')
plt.title('Sepal Length vs Width', fontsize = 20)
plt.show()
This gives us a fair Idea and patterns about some of the data.
The Elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and
then for each value of k computes an average score for all clusters. By default, the distortion score is
computed, the sum of square distances from each point to its assigned center.
When these overall metrics for each model are plotted, it is possible to visually determine the best value
for k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best
value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good
indication that the underlying model fits best at that point.
We use the Elbow Method which uses Within Cluster Sum Of Squares (WCSS) against the the number of
clusters (K Value) to figure out the optimal number of clusters value. WCSS measures sum of distances of
observations from their cluster centroids which is given by the below formula.
formula
where Yi is centroid for observation Xi. The main goal is to maximize number of clusters and in limiting
case each data point becomes its own cluster centroid.
With this simple line of code we get all the inertia value or the within the cluster sum of square.
60
Computer Laboratory-I B.E.(Sem-I) 2024-25
km = KMeans(i)
km.fit(clustering_data)
wcss.append(km.inertia_)
np.array(wcss)
Now, we visualize the Elbow Method so that we can determine the number of optimal clusters for our
dataset.
fig, ax = plt.subplots(figsize=(15,7))
ax = plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.axvline(x=3, ls='--')
plt.ylabel('WCSS')
plt.xlabel('No. of Clusters (k)')
plt.title('The Elbow Method', fontsize = 20)
plt.show()
It is clear, that the optimal number of clusters for our data are 3, as the slope of the curve is not steep
enough after it. When we observe this curve, we see that last elbow comes at k = 3, it would be difficult to
visualize the elbow if we choose the higher range.
Clustering
Now we will build the model for creating clusters from the dataset. We will use n_clusters = 3 i.e. 3
clusters as we have determined by the elbow method, which would be optimal for our dataset.
Our data set is for unsupervised learning therefore we will use fit_predict() Suppose we were working
with supervised learning data set we would use fit_tranform()
KMeans(n_clusters=3)
Now that we have the clusters created, we will enter them into a different column
clusters = clustering_data.copy()
clusters['Cluster_Prediction'] = kms.fit_predict(clustering_data)
clusters.head()
Cluster_Prediction
0 1
1 1
2 1
3 1
4 1
We can also get the centroids of the clusters by the cluster_centers_ attribute of KMeans algorithm.
kms.cluster_centers_
62
Computer Laboratory-I B.E.(Sem-I) 2024-25
Now we have all the data we need, we just need to plot the data. We will plot the data using scatterplot
which will allow us to observe different clusters in different colours.
fig, ax = plt.subplots(figsize=(15,7))
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 0]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 0]['SepalWidthCm'],
s=70,edgecolor='teal', linewidth=0.3, c='teal', label='Iris-versicolor')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 1]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 1]['SepalWidthCm'],
s=70,edgecolor='lime', linewidth=0.3, c='lime', label='Iris-setosa')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 2]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 2]['SepalWidthCm'],
s=70,edgecolor='magenta', linewidth=0.3, c='magenta', label='Iris-virginica')
63
Computer Laboratory-I B.E.(Sem-I) 2024-25
64
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 5 B
Aim: Use different voting mechanism and Apply AdaBoost (Adaptive Boosting), Gradient Tree
Boosting (GBM), XGBoost classification on Iris dataset and compare the performance of three
models using different evaluation measures. Dataset Link
https://www.kaggle.com/datasets/uciml/iris
Hardware Requirement:
Software Requirement:
Jypiter Nootbook/Ubuntu
Theory:
Imagine you have a complex problem to solve, and you gather a group of experts from different fields to provide
their input. Each expert provides their opinion based on their expertise and experience. Then, the experts would
vote to arrive at a final decision.
In a random forest classification, multiple decision trees are created using different random subsets of the data and
features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made
by calculating the prediction for each decision tree, then taking the most popular result. (For regression, predictions
use an averaging technique instead.)
In the diagram below, we have a random forest with n decision trees, and we‟ve shown the first 5, along with their
predictions (either “Dog” or “Cat”). Each tree is exposed to a different number of features and a different sample
of the original dataset, and as such, every tree can be different. Each tree makes a prediction. Looking at the first 5
trees, we can see that 4/5 predicted the sample was a Cat. The green circles indicate a hypothetical path the tree
took to reach its decision. The random forest would count the number of predictions from decision trees for Cat and
for Dog, and choose the most popular prediction.
65
Computer Laboratory-I B.E.(Sem-I) 2024-25
Implementation:
import pandas as pd
from sklearn.datasets import load_digits
digits = load_digits()
dir(digits)
%matplotlib inline
import matplotlib.pyplot as plt
plt.gray()
for i in range(4):
plt.matshow(digits.images[i])
df = pd.DataFrame(digits.data)
df.head()
0 1 2 3 4 5 6 7 8 9 ... 54 55 56 \
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 5.0 0.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 9.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
66
Computer Laboratory-I B.E.(Sem-I) 2024-25
57 58 59 60 61 62 63
0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 11.0 16.0 10.0 0.0 0.0
2 0.0 0.0 3.0 11.0 16.0 9.0 0.0
3 0.0 7.0 13.0 13.0 9.0 0.0 0.0
4 0.0 0.0 2.0 16.0 4.0 0.0 0.0
[5 rows x 64 columns]
df['target'] = digits.target
df[0:12]
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 \
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
5 0.0 0.0 12.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
6 0.0 0.0 0.0 12.0 13.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
7 0.0 0.0 7.0 8.0 13.0 16.0 15.0 1.0 0.0 0.0 ... 0.0 0.0 0.0
8 0.0 0.0 9.0 14.0 8.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
9 0.0 0.0 11.0 12.0 0.0 0.0 0.0 0.0 0.0 2.0 ... 0.0 0.0 0.0
10 0.0 0.0 1.0 9.0 15.0 11.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
11 0.0 0.0 0.0 0.0 14.0 13.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
58 59 60 61 62 63 target
0 6.0 13.0 10.0 0.0 0.0 0.0 0
1 0.0 11.0 16.0 10.0 0.0 0.0 1
2 0.0 3.0 11.0 16.0 9.0 0.0 2
3 7.0 13.0 13.0 9.0 0.0 0.0 3
4 0.0 2.0 16.0 4.0 0.0 0.0 4
5 9.0 16.0 16.0 10.0 0.0 0.0 5
6 1.0 9.0 15.0 11.0 3.0 0.0 6
7 13.0 5.0 0.0 0.0 0.0 0.0 7
8 11.0 16.0 15.0 11.0 1.0 0.0 8
9 9.0 12.0 13.0 3.0 0.0 0.0 9
10 1.0 10.0 13.0 3.0 0.0 0.0 0
11 0.0 1.0 13.0 16.0 1.0 0.0 1
X = df.drop('target',axis='columns')
y = df.target
67
Computer Laboratory-I B.E.(Sem-I) 2024-25
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
RandomForestClassifier(n_estimators=20)
model.score(X_test, y_test)
0.9805555555555555
y_predicted = model.predict(X_test)
Confusion Matrix
array([[32, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 30, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 32, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 37, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 35, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 41, 1, 0, 0, 1],
[ 0, 0, 0, 0, 1, 0, 35, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 52, 0, 2],
[ 1, 0, 0, 0, 0, 0, 0, 0, 32, 0],
[ 0, 0, 0, 0, 1, 0, 0, 0, 0, 27]])
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
plt.figure(figsize=(10,7))
sn.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
68
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
69
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 6 C
Aim: Build a Tic-Tac-Toe game using reinforcement learning in Python by using following tasks
a. Setting up the environment
b. Defining the Tic-Tac-Toe game
c. Building the reinforcement learning model
d. Training the model
e. Testing the model
Hardware Requirement:
Software Requirement:
Jypiter Nootbook/Ubuntu
Theory:
In reinforcement learning, developers devise a method of rewarding desired behaviors and punishing negative
behaviors. This method assigns positive values to the desired actions to encourage the agent and negative values to
undesired behaviors. This programs the agent to seek long-term and maximum overall reward to achieve an optimal
solution.
These long-term goals help prevent the agent from stalling on lesser goals. With time, the agent learns to avoid the
negative and seek the positive. This learning method has been adopted in artificial intelligence (AI) as a way of
directing unsupervised machine learning through rewards and penalties.
Common reinforcement learning algorithms
Rather than referring to a specific algorithm, the field of reinforcement learning is made up of several algorithms
that take somewhat different approaches. The differences are mainly due to their strategies for exploring their
environments.
70
Computer Laboratory-I B.E.(Sem-I) 2024-25
agent what's known as a policy. The policy is essentially a probability that tells it the odds of certain
actions resulting in rewards, or beneficial states.
● Q-learning. This approach to reinforcement learning takes the opposite approach. The agent receives no
Implementation:
import numpy as np
class TicTacToeEnvironment:
def init (self):
self.state = [0] * 9
self.is_terminal = False
def reset(self):
self.state = [0] * 9
self.is_terminal = False
def get_available_moves(self):
return [i for i, mark in enumerate(self.state) if mark == 0]
def is_draw(self):
return 0 not in self.state
class QLearningAgent:
def init (self, learning_rate=0.9, discount_factor=0.9, exploration_rate=0.3):
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.exploration_rate = exploration_rate
self.q_table = np.zeros((3**9, 9))
for _ in range(num_episodes):
environment.reset()
current_agent = agent1
while not environment.is_terminal:
available_moves = environment.get_available_moves()
72
Computer Laboratory-I B.E.(Sem-I) 2024-25
current_state = environment.state.copy()
action = current_agent.choose_action(current_state, available_moves)
environment.make_move(action, 1 if current_agent == agent1 else -1)
next_state = environment.state.copy()
reward = 0
if environment.check_win(1 if current_agent == agent1 else -1):
reward = -10
current_agent.update_q_table(current_state, action, next_state, reward)
# Create agents
agent1 = QLearningAgent()
agent2 = QLearningAgent()
# Evaluate agents
agent1_wins, agent2_wins, draws = evaluate_agents(agent1, agent2)
# Print results
print(f"Agent 1 wins: {agent1_wins}")
print(f"Agent 2 wins: {agent2_wins}")
print(f"Draws: {draws}")
73
Computer Laboratory-I B.E.(Sem-I) 2024-25
TicTacToeEnvironment:
This class represents the Tic-Tac-Toe game environment. It maintains the current state of the game,
checks for a win or draw, and provides methods to reset the game and make moves.
The init method initializes the game state and sets the terminal flag to False.
The reset method resets the game state and the terminal flag.
The get_available_moves method returns a list of indices representing the available moves in the current
game state.
The make_move method updates the game state by placing a player's mark at the specified move index.
The check_win method checks if a player has won the game by examining the current state.
The is_draw method checks if the game has ended in a draw.
QLearningAgent:
This class represents the Q-learning agent. It learns to play Tic-Tac-Toe by updating a Q-table based on
the rewards received during gameplay.
The init method initializes the learning rate, discount factor, exploration rate, and the Q-table.
The get_state_index method converts the current game state into a unique index for indexing the Q-table.
The choose_action method selects the action (move) to be taken based on the current game state and the
exploration-exploitation tradeoff using the epsilon-greedy policy.
The update_q_table method updates the Q-table based on the current state, action, next state, and the
reward received.
evaluate_agents:
This function performs the evaluation of two Q-learning agents by playing multiple episodes of Tic-Tac-Toe
games.
It takes the two agents and the number of episodes to play as input.
In each episode, the environment is reset, and the agents take turns making moves until the game is over
(either a win or a draw).
The agents update their Q-tables based on the rewards received during the episode.
The function keeps track of the wins and draws for each agent and returns the counts.
Main code:
The main code creates two Q-learning agents, agent1 and agent2, using the QLearningAgent class.
The evaluate_agents function is called to evaluate the agents by playing a specified number of episodes.
The results (number of wins and draws) for each agent are printed.
The agents choose their moves based on the current game state and the exploration-exploitation policy.
The environment updates the game state based on the chosen moves.
The environment checks if the game has ended (win or draw).
The agents update their Q-tables based on the rewards received.
The agents continue playing until the specified number of episodes is completed.
74
Computer Laboratory-I B.E.(Sem-I) 2024-25
75
Computer Laboratory-I B.E.(Sem-I) 2024-25
Hardware Requirement:
Software Requirement:
Jypiter Nootbook/Ubuntu
Implementation:
import requests
76
Computer Laboratory-I B.E.(Sem-I) 2024-25
import pandas as pd
import datetime
# Set the location for which you want to retrieve weather data
lat = 18.184135
lon = 74.610764
weather_data.keys()
weather_data['list'][0]
{'dt': 1690189200,
'main': {'temp': 298.21,
'feels_like': 298.81,
'temp_min': 298.1,
'temp_max': 298.21,
'pressure': 1006,
'sea_level': 1006,
'grnd_level': 942,
'humidity': 78,
'temp_kf': 0.11},
'weather': [{'id': 804,
'main': 'Clouds',
'description': 'overcast clouds',
'icon': '04d'}],
'clouds': {'all': 100},
'wind': {'speed': 6.85, 'deg': 258, 'gust': 12.9},
'visibility': 10000,
'pop': 0.59,
'sys': {'pod': 'd'},
'dt_txt': '2023-07-24 09:00:00'}
77
Computer Laboratory-I B.E.(Sem-I) 2024-25
len(weather_data['list'])
40
weather_data['list'][0]['weather'][0]['description']
{"type":"string"}
#getting the data from dictionary and taking into one variable
# Extract relevant weather attributes using list comprehension
temperatures = [item['main']['temp'] for item in weather_data['list']] #it will extract all values (40)
and putting into one variable
timestamps = [pd.to_datetime(item['dt'], unit='s') for item in weather_data['list']]
temperature = [item['main']['temp'] for item in weather_data['list']]
humidity = [item['main']['humidity'] for item in weather_data['list']]
wind_speed = [item['wind']['speed'] for item in weather_data['list']]
weather_description = [item['weather'][0]['description'] for item in weather_data['list']]
max_temp = weather_df['Temperature'].max()
max_temp
298.9
min_temp = weather_df['Temperature'].min()
min_temp
294.92
78
Computer Laboratory-I B.E.(Sem-I) 2024-25
else x) # Convert temperature from Kelvin to Celsius
79
Computer Laboratory-I B.E.(Sem-I) 2024-25
daily_mean_temp = weather_df['Temperature'].resample('D').mean()
daily_mean_humidity = weather_df['humidity'].resample('D').mean()
daily_mean_wind_speed = weather_df['wind_speed'].resample('D').mean()
# Plot the relationship between temperature and wind speed (Scatter plot)
plt.figure(figsize=(10, 6))
plt.scatter(weather_df['Temperature'], weather_df['wind_speed'], color='green')
plt.title('Temperature vs. Wind Speed')
plt.xlabel('Temperature (°C)')
plt.ylabel('Wind Speed (m/s)')
plt.grid(True)
plt.show()
###Heatmap
# Create a scatter plot to visualize the relationship between temperature and humidity
plt.scatter(weather_df['Temperature'], weather_df['humidity'])
plt.xlabel('Temperature (°C)')
plt.ylabel('Humidity (%)')
80
Computer Laboratory-I B.E.(Sem-I) 2024-25
plt.title('Temperature vs Humidity Scatter Plot')
plt.show()
###Geospatial Map
import requests
import pandas as pd
import geopandas as gpd
import folium
# Specify the locations for which you want to retrieve weather data
locations = ['London', 'Paris', 'New York']
weather_df = pd.DataFrame()
81
Computer Laboratory-I B.E.(Sem-I) 2024-25
<ipython-input-17-68826faaad0a>:41: FutureWarning: The frame.append method is deprecated and will
be removed from pandas in a future version. Use pandas.concat instead.
weather_df = weather_df.append(location_df, ignore_index=True)
<ipython-input-17-68826faaad0a>:41: FutureWarning: The frame.append method is deprecated and will
be removed from pandas in a future version. Use pandas.concat instead.
weather_df = weather_df.append(location_df, ignore_index=True)
weather_df
# Rename the column used for merging in the world map DataFrame
world_map.rename(columns={'name': 'Location'}, inplace=True)
# Merge the weather data with the world map based on location
weather_map = world_map.merge(weather_df, on='Location')
# Create a folium map centered around the mean latitude and longitude of all locations
map_center = [weather_df['Latitude'].mean(), weather_df['Longitude'].mean()]
weather_map_folium = folium.Map(location=map_center, zoom_start=2)
<folium.folium.Map at 0x7f242a56f430>
type(weather_map_folium)
folium.folium.Map
82
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
83
Computer Laboratory-I B.E.(Sem-I) 2024-25
Tasks to Perform:
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Handle missing values in the dataset, deciding on an appropriate strategy.
4. Remove any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent formatting or spelling variations,
and standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
Hardware Requirement:
Software Requirement:
Jypiter Nootbook/Ubuntu
Implementation:
84
Computer Laboratory-I B.E.(Sem-I) 2024-25
Load the dataset
data = pd.read_csv("Telecom_Customer_Churn.csv")
print(data.index)
print(data)
85
Computer Laboratory-I B.E.(Sem-I) 2024-25
7042 Yes Yes Yes Yes Two year
Churn
0 No
1 No
2 Yes
3 No
4 Yes
... ...
7038 No
7039 No
7040 No
7041 Yes
7042 No
print(data.columns)
data.shape
(7043, 21)
print(data.head())
86
Computer Laboratory-I B.E.(Sem-I) 2024-25
3 7795-CFOCW Male 0 No No 45 No
4 9237-HQITU Female 0 No No 2 Yes
[5 rows x 21 columns]
print(data.tail())
87
Computer Laboratory-I B.E.(Sem-I) 2024-25
Churn
7038 No
7039 No
7040 No
7041 Yes
7042 No
[5 rows x 21 columns]
customerID 7043
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 73
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1585
TotalCharges 6531
Churn 2
dtype: int64
# data.isna().sum() is used to count the number of missing values (NaN values) in each column of a
pandas DataFrame called data.
data.isna().sum()
88
Computer Laboratory-I B.E.(Sem-I) 2024-25
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64
# isna() and isnull() are essentially the same method in Pandas, and they both return a boolean mask of the
same shape as the input object, indicating where missing values (NaN or None) are present.
data.isnull().sum()
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
89
Computer Laboratory-I B.E.(Sem-I) 2024-25
dtype: int64
data.describe()
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72] [ 11 613 238 200 176 133 110 131 123 119 116 99 117 109 76 99 80 87
97 73 71 63 90 85 94 79 79 72 57 72 72 65 69 64 65 88
50 65 59 56 64 70 65 65 51 61 74 68 64 66 68 68 80 70
68 64 80 65 67 60 76 76 70 72 80 76 89 98 100 95 119 170
362]
90
Computer Laboratory-I B.E.(Sem-I) 2024-25
unique, counts = np.unique(data['TotalCharges'], return_counts=True)
print(unique, counts)
[' ' '100.2' '100.25' ... '999.45' '999.8' '999.9'] [11 1 1 ... 1 1 1]
# sns.pairplot(data) creates a grid of pairwise plots of the variables in a dataset, which can help you
quickly visualize the relationships between different pairs of variables.
import seaborn as sns #Seaborn library for data visualization
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7fb9cc97a680>
plt.boxplot(data['MonthlyCharges'])
plt.show()
X = data.drop("Churn", axis=1)
y = data["Churn"]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape
(5634, 20)
y_train.shape
(5634,)
X_test.shape
(1409, 20)
y_test.shape
(1409,)
91
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
92
Computer Laboratory-I B.E.(Sem-I) 2024-25
Software Requirement:
Jypiter Nootbook/Ubuntu
Implementation:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
93
Computer Laboratory-I B.E.(Sem-I) 2024-25
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)
Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format for
better understanding, decision-making, accessing, and analysis in less time. Data Wrangling is also known
as Data Munging.
df1 = pd.read_csv("/content/Bengaluru_House_Data.csv")
df1.head()
df1.shape
(13320, 9)
df1.columns
94
Computer Laboratory-I B.E.(Sem-I) 2024-25
df1['area_type']
df1['area_type'].unique()
df1['area_type'].value_counts()
df2 = df1.drop(['area_type','society','balcony','availability'],axis='columns')
df2.shape
(13320, 5)
95
Computer Laboratory-I B.E.(Sem-I) 2024-25
df2.isnull().sum()
location 1
size 16
total_sqft 0
bath 73
price 0
dtype: int64
df2.shape
(13320, 5)
df3 = df2.dropna()
df3.isnull().sum()
location 0
size 0
total_sqft 0
bath 0
price 0
dtype: int64
df3.shape
(13246, 5)
df3['size'].unique()
array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
'1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
'7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
'9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
'10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
'12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)
96
Computer Laboratory-I B.E.(Sem-I) 2024-25
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
<ipython-input-15-4c4c73fbe7f4>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
df3.head()
df3.bhk.unique()
df3[df3.bhk>20]
df3.total_sqft.unique()
97
Computer Laboratory-I B.E.(Sem-I) 2024-25
Explore total_sqft feature
def is_float(x):
try:
float(x)
except:
return False
return True
df3[~df3['total_sqft'].apply(is_float)].head(10)
def convert_sqft_to_num(x):
tokens = x.split('-')
if len(tokens) == 2:
return (float(tokens[0])+float(tokens[1]))/2
try:
return float(x)
except:
return None
convert_sqft_to_num('2100 - 2850')
98
Computer Laboratory-I B.E.(Sem-I) 2024-25
2475.0
convert_sqft_to_num('34.46Sq. Meter')
df4 = df3.copy()
df4.total_sqft = df4.total_sqft.apply(convert_sqft_to_num)
df4
df4 = df4[df4.total_sqft.notnull()]
df4
99
Computer Laboratory-I B.E.(Sem-I) 2024-25
13315 Whitefield 5 Bedroom 3453.0 4.0 231.00 5
13316 Richards Town 4 BHK 3600.0 5.0 400.00 4
13317 Raja Rajeshwari Nagar 2 BHK 1141.0 2.0 60.00 2
13318 Padmanabhanagar 4 BHK 4689.0 4.0 488.00 4
13319 Doddathoguru 1 BHK 550.0 1.0 17.00 1
For below row, it shows total_sqft as 2475 which is an average of the range 2100-2850
df4.loc[30]
location Yelahanka
size 4 BHK
total_sqft 2475.0
bath 4.0
price 186.0
bhk 4
Name: 30, dtype: object
(2100 + 2850)/2
2475.0
df5 = df4.copy()
df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']
df5.head()
100
Computer Laboratory-I B.E.(Sem-I) 2024-25
3 Lingadheeranahalli 3 BHK 1521.0 3.0 95.00 3
4 Kothanur 2 BHK 1200.0 2.0 51.00 2
price_per_sqft
0 3699.810606
1 4615.384615
2 4305.555556
3 6245.890861
4 4250.000000
df5_stats = df5['price_per_sqft'].describe()
df5_stats
count 1.320000e+04
mean 7.920759e+03
std 1.067272e+05
min 2.678298e+02
25% 4.267701e+03
50% 5.438331e+03
75% 7.317073e+03
max 1.200000e+07
Name: price_per_sqft, dtype: float64
df5.to_csv("bhp.csv",index=False)
Examine locations which is a categorical variable. We need to apply dimensionality reduction technique
here to reduce number of locations
len(df5.location.unique())
1298
101
Computer Laboratory-I B.E.(Sem-I) 2024-25
location_stats
Whitefield 533
Sarjapur Road 392
Electronic City 304
Kanakpura Road 264
Thanisandra 235
...
Rajanna Layout 1
Subramanyanagar 1
Lakshmipura Vidyaanyapura 1
Malur Hosur Road 1
Abshot Layout 1
Name: location, Length: 1287, dtype: int64
len(location_stats[location_stats>10])
240
len(location_stats)
1287
len(location_stats[location_stats<=10])
1047
Any location having less than 10 data points should be tagged as "other" location. This way number of
categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with
having fewer dummy columns
location_stats_less_than_10 = location_stats[location_stats<=10]
location_stats_less_than_10
102
Computer Laboratory-I B.E.(Sem-I) 2024-25
Gunjur Palya 10
Nagappa Reddy Layout 10
Sector 1 HSR Layout 10
Thyagaraja Nagar 10
..
Rajanna Layout 1
Subramanyanagar 1
Lakshmipura Vidyaanyapura 1
Malur Hosur Road 1
Abshot Layout 1
Name: location, Length: 1047, dtype: int64
len(df5.location.unique())
1287
241
df5.head(10)
103
Computer Laboratory-I B.E.(Sem-I) 2024-25
price_per_sqft
0 3699.810606
1 4615.384615
2 4305.555556
3 6245.890861
4 4250.000000
5 3247.863248
6 7467.057101
7 18181.818182
8 4828.244275
9 36274.509804
normally square ft per bedroom is 300 (i.e. 2 bhk apartment is minimum 600 sqft
df5[df5.total_sqft/df5.bhk<300].head()
price_per_sqft
9 36274.509804
45 33333.333333
58 10660.980810
68 6296.296296
70 20000.000000
Check above data points. We have 6 bhk apartment with 1020 sqft. Another one is 8 bhk and total sqft is
600. These are clear data errors that can be removed safely
104
Computer Laboratory-I B.E.(Sem-I) 2024-25
df5.shape
(13200, 7)
df6 = df5[~(df5.total_sqft/df5.bhk<300)]
df6.shape
(12456, 7)
df6.columns
plt.boxplot(df6['total_sqft'])
plt.show()
<ipython-input-51-c46bdd7d51e2>:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
105
Computer Laboratory-I B.E.(Sem-I) 2024-25
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df6.drop(bad_indices, inplace = True, errors = 'ignore')
plt.boxplot(df6['bath'])
plt.show()
<ipython-input-54-cdb575bb4e89>:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
plt.boxplot(df6['price'])
plt.show()
106
Computer Laboratory-I B.E.(Sem-I) 2024-25
upper_outliers = df6[df6['price'] > ul].index.tolist()
lower_outliers = df6[df6['price'] < ll].index.tolist()
bad_indices = list(set(upper_outliers + lower_outliers))
drop = True
if drop:
df6.drop(bad_indices, inplace = True, errors = 'ignore')
<ipython-input-56-e0f097c1f625>:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
plt.boxplot(df6['bhk'])
plt.show()
<ipython-input-58-c12c1120f543>:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
107
Computer Laboratory-I B.E.(Sem-I) 2024-25
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df6.drop(bad_indices, inplace = True, errors = 'ignore')
plt.boxplot(df6['price_per_sqft'])
plt.show()
<ipython-input-60-d349eb2f1f03>:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
df6.shape
(10090, 7)
X = df6.drop(['price'],axis='columns')
X.head(3)
108
Computer Laboratory-I B.E.(Sem-I) 2024-25
2 Uttarahalli 3 BHK 1440.0 2.0 3 4305.555556
3 Lingadheeranahalli 3 BHK 1521.0 3.0 3 6245.890861
X.shape
(10090, 6)
y = df6.price
y.head(3)
0 39.07
2 62.00
3 95.00
Name: price, dtype: float64
len(y)
10090
X_train.shape
(8072, 6)
y_train.shape
(8072,)
X_test.shape
(2018, 6)
y_test.shape
(2018,)
109
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
110
Computer Laboratory-I B.E.(Sem-I) 2024-25
EXPERIMENT NO. 10 (Group B)
Hardware Requirement:
Software Requirement:
Jypiter Nootbook/Ubuntu
Implementation:
import numpy as np
import pandas as pd
111
Computer Laboratory-I B.E.(Sem-I) 2024-25
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
%matplotlib inline
data = pd.read_csv("data.csv")
print(data.index)
RangeIndex(start=0, stop=49005, step=1)
sns.set(style="ticks", rc = {'figure.figsize':(20,15)})
# Supressing update warnings
import warnings
warnings.filterwarnings('ignore')
Checking the dataset
We can see that there are quite a number of NaNs in the dataset. To proceed with the EDA, we must handle these
NaNs by either removing them or filling them. I will be doing both.
# checking the original dataset
print(data.isnull().sum())
print(data.shape)
data.info()
stn_code 15764
sampling_date 0
state 0
location 0
agency 16355
type 994
so2 1312
no2 858
rspm 2696
spm 28659
location_monitoring_station 2537
pm2_5 49005
112
Computer Laboratory-I B.E.(Sem-I) 2024-25
date 1
dtype: int64
(49005, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49005 entries, 0 to 49004
Data columns (total 13 columns):
# Column Non-Null Count Dtype
113
Computer Laboratory-I B.E.(Sem-I) 2024-25
Since date also has missing values, we will drop the rows containing these values as they're of little use as well.
Cleaning values Since the geographical nomenclature has changed over time, we change it here as well to
correspond to more accurate insights.
The type column
Currently, the type column has several names for the same type and therefore, it is better to clean it up and make
it more uniform.
# Cleaning up the data
data.state = data.state.replace({'Uttaranchal':'Uttarakhand'})
data.state[data.location == "Jamshedpur"] = data.state[data.location ==
'Jamshedpur'].replace({"Bihar":"Jharkhand"})
types = {
"Residential": "R",
"Residential and others": "RO",
"Residential, Rural and other Areas": "RRO",
"Industrial Area": "I",
"Industrial Areas": "I",
"Industrial": "I",
"Sensitive Area": "S",
"Sensitive Areas": "S",
"Sensitive": "S",
np.nan: "RRO"
}
data.type = data.type.replace(types)
data.head()
114
Computer Laboratory-I B.E.(Sem-I) 2024-25
state location type so2 no2 rspm spm pm2_5 date
0 Andhra Pradesh Hyderabad RRO 4.8 17.4 NaN NaN NaN 1990-02-01
1 Andhra Pradesh Hyderabad I 3.1 7.0 NaN NaN NaN 1990-02-01
2 Andhra Pradesh Hyderabad RRO 6.2 28.5 NaN NaN NaN 1990-02-01
3 Andhra Pradesh Hyderabad RRO 6.3 14.7 NaN NaN NaN 1990-03-01
4 Andhra Pradesh Hyderabad I 4.7 7.5 NaN NaN NaN 1990-03-01
# defining columns of importance, which shall be used reguarly
VALUE_COLS = ['so2', 'no2', 'rspm', 'spm', 'pm2_5']
Filling NaNs Since our pollutants column contain a lot of NaNs, we must fill them to have consistent data. If we
drop the rows containing NaNs, we will be left with nothing.
I use the SimpleImputer from sklearn.imputer (v0.20.2) to fill the missing values in every column with the mean.
# invoking SimpleImputer to fill missing values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
data[VALUE_COLS] = imputer.fit_transform(data[VALUE_COLS])
115
Computer Laboratory-I B.E.(Sem-I) 2024-25
-> 4019 self._iset_not_inplace(key, value)
4020
4021 elif np.ndim(value) > 1:
116
Computer Laboratory-I B.E.(Sem-I) 2024-25
date
49000 2005-03-23
49001 2005-03-25
49002 2005-03-28
49003 2005-03-30
49004 NaN
Plotting pollutant levels as yearly averages for states
# defining a function that plots SO2, NO2, RSPM and SPM yearly average levels for a given state
# since data is available monthly, it was resampled to a year and averaged to obtain yearly averages
# years for which no data was collected has not been imputed
def plot_for_state(state):
fig, ax = plt.subplots(2,2, figsize=(20,12))
fig.suptitle(state, size=20)
state = aqi[aqi.state == state]
state = state.reset_index().set_index('date')[VALUE_COLS].resample('Y').mean()
state.so2.plot(legend=True, ax=ax[0][0], title="so2")
ax[0][0].set_ylabel("so2 (µg/m3)")
ax[0][0].set_xlabel("Year")
117
Computer Laboratory-I B.E.(Sem-I) 2024-25
plot_for_state("Uttar Pradesh")
Plotting Uttar Pradesh, we see that SO2 levels have fallen in the state while NO2 levels have risen. Information
about RSPM and SPM can't be concluded since a lot of data is missing.
Plotting highest and lowest ranking states
# defining a function to find and plot the top 10 and bottom 10 states for a given indicator (defaults to SO2)
def top_and_bottom_10_states(indicator="so2"):
fig, ax = plt.subplots(2,1, figsize=(20, 12))
118
Computer Laboratory-I B.E.(Sem-I) 2024-25
plt.xticks(rotation=90)
highest_levels_recorded("no2")
highest_levels_recorded("rspm")
Plotting for NO2, we can see that Rajasthan recorded the highest ever NO2 level. Plotting for RSPM, we can see
that Uttar Pradesh recorded the highest ever RSPM level.
Plotting yearly trends
# defining a function to plot the yearly trend values for a given indicator (defaults to SO2) and state (defaults to
overall)
def yearly_trend(state="", indicator="so2", ):
plt.figure(figsize=(20,12))
data['year'] = data.date.dt.year
if state is "":
year_wise = data[[indicator, 'year', 'state']].groupby('year', as_index=False).median()
trend = sns.pointplot(x='year', y=indicator, data=year_wise)
trend.set_title('Yearly trend of {}'.format(indicator))
else:
year_wise = data[[indicator, 'year', 'state']].groupby(['state','year']).median().loc[state].reset_index()
trend = sns.pointplot(x='year', y=indicator, data=year_wise)
trend.set_title('Yearly trend of {} for {}'.format(indicator, state))
yearly_trend()
yearly_trend("Bihar", "no2")
119
Computer Laboratory-I B.E.(Sem-I) 2024-25
5 if state is "":
6 year_wise = data[[indicator, 'year', 'state']].groupby('year', as_index=False).median()
120
Computer Laboratory-I B.E.(Sem-I) 2024-25
data=data.pivot_table(values=indicator, index='state', columns='year', aggfunc='median', margins=True),
annot=True, linewidths=.5, cbar=True, square=True, cmap='inferno', cbar_kws={'label': "Annual
Average"})
<ipython-input-34-3c4f9130ffd5> in indicator_by_state_and_year(indicator)
3 plt.figure(figsize=(20, 20))
4 hmap = sns.heatmap(
----> 5 data=data.pivot_table(values=indicator, index='state', columns='year', aggfunc='median',
margins=True),
6 annot=True, linewidths=.5, cbar=True, square=True, cmap='inferno', cbar_kws={'label':
"Annual Average"})
7
121
Computer Laboratory-I B.E.(Sem-I) 2024-25
96
---> 97 table = internal_pivot_table(
98 data,
99 values,
122
Computer Laboratory-I B.E.(Sem-I) 2024-25
886 in_axis, level, gpr = False, gpr, None
887 else:
--> 888 raise KeyError(gpr)
889 elif isinstance(gpr, Grouper) and gpr.key is not None:
890 # Add key to exclusions
KeyError: 'year'
<Figure size 2000x2000 with 0 Axes>
Plotting pollutant average by type
# defining a function to plot pollutant averages by type for a given indicator
def type_avg(indicator=""):
type_avg = data[VALUE_COLS + ['type', 'date']].groupby("type").mean()
if indicator is not "":
t = type_avg[indicator].plot(kind='bar')
plt.xticks(rotation = 0)
plt.title("Pollutant average by type for {}".format(indicator))
else:
t = type_avg.plot(kind='bar')
plt.xticks(rotation = 0)
plt.title("Pollutant average by type")
type_avg('so2')
Plotting pollutant averages by locations/state
# defining a function to plot pollutant averages for a given indicator (defaults to SO2) by locations in a given state
def location_avgs(state, indicator="so2"):
locs = data[VALUE_COLS + ['state', 'location', 'date']].groupby(['state', 'location']).mean()
state_avgs = locs.loc[state].reset_index()
sns.barplot(x='location', y=indicator, data=state_avgs)
plt.title("Location-wise average for {} in {}".format(indicator, state))
plt.xticks(rotation = 90)
location_avgs("Bihar", "no2")
123
Computer Laboratory-I B.E.(Sem-I) 2024-25
Roll No.
Class BE
Date of Completion
Subject Computer Laboratory-II :Quantum AI
Assessment Marks
Assessor's Sign
124
Computer Laboratory-I B.E.(Sem-I) 2024-25
Hardware Requirement:
Software Requirement:
Jypiter Nootbook/Ubuntu
Implementation:
import pandas as pd
import matplotlib.pyplot as plt
Data Aggregation is important for deriving granular insights about individual customers and for better
125
Computer Laboratory-I B.E.(Sem-I) 2024-25
understanding their perception and expectations regarding the product.
Regardless of the size and type, every business organization needs valuable data and insights to combat the day- to-
day challenges of the competitive market. If a business wants to thrive in the market, then it must understand its
target audience and customer preferences, and in this, big data plays a vital role.
About Dataset
dataset contains shopping information from 10 different shopping malls between 2021 and 2023. We have gathered
data from various age groups and genders to provide a comprehensive view of shopping habits in Istanbul. The
dataset includes essential information such as invoice numbers, customer IDs, age, gender, payment methods,
product categories, quantity, price, order dates, and shopping mall locations.
Attribute Information:
invoice_no: Invoice number. Nominal. A combination of the letter 'I' and a 6-digit integer uniquely assigned to each
operation.
customer_id: Customer number. Nominal. A combination of the letter 'C' and a 6-digit integer uniquely assigned
to each operation.
price: Unit price. Numeric. Product price per unit in Turkish Liras (TL).
payment_method: String variable of the payment method (cash, credit card or debit card) used for the transaction.
shopping_mall: String variable of the name of the shopping mall where the transaction was made.
126
Computer Laboratory-I B.E.(Sem-I) 2024-25
# dataset source: https://www.kaggle.com/datasets/mehmettahiraslan/customer-shopping-dataset
#df = pd.read_csv("/content/customer_shopping_data.csv")
df= pd.read_csv("/content/customer_shopping_data.csv")
df.head()
df.groupby("shopping_mall").count()
127
Computer Laboratory-I B.E.(Sem-I) 2024-25
Metrocity 4193 4193 4193 4193 4193 4193
Metropol AVM 2856 2856 2856 2856 2856 2856
Viaport Outlet 1389 1389 1389 1389 1389 1389
Zorlu Center 1392 1392 1392 1392 1392 1392
df.groupby("category").count()
128
Computer Laboratory-I B.E.(Sem-I) 2024-25
Toys 2819 2819 2819 2819 2819 2819
branch_sales = df.groupby("shopping_mall").sum()
category_sales = df.groupby("category").sum()
129
Computer Laboratory-I B.E.(Sem-I) 2024-25
In the above two cells, the sum method will return sums for all numeric values. For some attributes such as age,
this sum is not relevant.
130
Computer Laboratory-I B.E.(Sem-I) 2024-25
Souvenir 216922 14871 174436.83
combined_branch_category_sales
131