0% found this document useful (0 votes)

6 views16 pages

ML 1

The document outlines an experiment conducted by Piyanshu Gehani at Sardar Patel Institute of Technology, focusing on hardware and software requirements for machine learning and dataset analysis. It details the setup of Anaconda and Python, the exploration of a cricket dataset, and the application of various data preprocessing techniques, including handling missing values and scaling. The analysis reveals key insights into player performance and correlations within the dataset, preparing it for further machine learning model development.

Uploaded by

Piyanshu Gehani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views16 pages

ML 1

Uploaded by

Piyanshu Gehani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Bharatiya Vidya Bhavan’s

SARDAR PATEL INSTITUTE OF TECHNOLOGY

(Autonomous Institute Affiliated to University of Mumbai)
Munshi Nagar, Andheri (W), Mumbai – 400 058.
Department of Computer Engineering
Experiment 1

Aim Part I-H/W & S/W Requirement

Part 2-Dataset Analysis

Name Piyanshu Gehani

UID 2022600012

Subject ML

Class CSE(AIML)
Output: Part 1 : Laboratory set up of ML

I successfully downloaded and set up Anaconda along with Python,

launched and updated Anaconda, installed the CUDA Toolkit and cuDNN,
and created a separate Anaconda environment on my local machine.

Part 2 : Exploration of Dataset Analysis

Dataset Link:
https://www.kaggle.com/code/atifaliak/eda-on-icc-cricket-world-cup-2023/i

nput Theory:

Exploratory Data Analysis (EDA) on the ICC Cricket World Cup 2023 explores
batting dynamics through a detailed ball-by-ball dataset. This analysis delves
into performance metrics, trends in player strategies, scoring patterns, and
team batting strengths across different match conditions, offering insights into
key players and moments that defined the tournament.

Description: We collected, explored, and imported the deliveries.csv dataset,

which contains ball-by-ball details of cricket matches with 26119 entries and 22
columns. The dataset was loaded using pd.read_csv() and examined using
df.head() and df.info(), revealing its structure and missing values in columns like
wides, noballs, and penalty. We utilized Python libraries such as Pandas, NumPy,
Matplotlib, Seaborn, and Scikit-learn for data handling, visualization, and
preprocessing, including missing value imputation with SimpleImputer. This
provided a foundational understanding of dataset preparation for further analysis.
Code:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

df = pd.read_csv('deliveries.csv')

# Printing the first 5 rows of the dataset

df.head()

Scale: Scaling is a crucial preprocessing step that standardizes numerical data to

ensure uniformity and improve model performance. In this experiment, we
identified numerical columns using df.select_dtypes(include=['int64', 'float64'])
and applied StandardScaler from Scikit-learn.

Standardization transforms the data to have a mean of 0 and a standard deviation

of 1, preventing certain features from dominating others due to differing scales.
This step is essential for machine learning algorithms that rely on distance-based
calculations, such as regression, clustering, and neural networks.

Code:

# Scaling numerical columns

numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

scaler = StandardScaler()

df[numerical_cols] = scaler.fit_transform(df[numerical_cols]) df.head()

Appearance of Dataset:
Cleaning:

I used various methods for cleaning as the dataset contained missing

values in multiple columns.

1] Identified missing values using df.isnull().sum() and filtered columns with

missing data.

2] Used SimpleImputer with the 'most_frequent' strategy to fill missing values

in wicket_type and other_player_dismissed.

3] Reshaped categorical data using .values.reshape(-1,1) and flattened it to 1D for

proper integration.

4] Verified the imputation process by rechecking missing values

with df.isnull().sum().
Analysis:
# Distribution of runs off bat

sns.histplot(df['runs_off_bat'], kde=True, bins=20) plt.title('Distribution

of Runs Off Bat')

plt.xlabel('Runs Off Bat')

plt.ylabel('Frequency')

plt.show()

The distribution of runs off the bat is right-skewed, with dot balls (0 runs) being the most
frequent, followed by singles (1 run). Boundaries (4s and 6s) occur less frequently but
contribute significantly to the scoring. Running between the wickets for 2s and 3s is rare,
indicating a preference for strike rotation and boundary-hitting. This trend highlights
bowling effectiveness in restricting runs and batting strategies focused on minimizing dot
balls and maximizing scoring opportunities.
# Total runs scored by each batting team

team_runs =
df.groupby('batting_team')['runs_off_bat'].sum().sort_values(asce nding=False)

# Plot the total runs scored by each team

plt.figure(figsize=(12, 6))

sns.barplot(x=team_runs.index, y=team_runs.values)

plt.title('Total Runs by Batting Team')

plt.xlabel('Batting Team')

plt.ylabel('Total Runs')

plt.xticks(rotation=90)

plt.show()
India and Australia lead in total runs scored, indicating strong batting
performances. South Africa and New Zealand follow closely, showing consistency
in run accumulation. Pakistan and England have moderate totals, suggesting a
balanced mix of aggressive and defensive play. Afghanistan, Bangladesh, and Sri
Lanka have comparable run totals, while the Netherlands has the lowest,
reflecting a possible gap in batting strength against top teams.

# Boxplot for the relationship between batting team and runs off bat
plt.figure(figsize=(12, 6))

sns.boxplot(x='batting_team', y='runs_off_bat', data=df) plt.title('Runs Off Bat

by Batting Team')

plt.xlabel('Batting Team')

plt.ylabel('Runs Off Bat')

plt.xticks(rotation=90)

plt.show()
Variable Relationship:

1] Strong Correlation: Features like extras, wides, noballs, and byes show a high
correlation with each other, indicating that these extra runs are often recorded
together.

2] Negative Correlation: The feature ball has a negative correlation with byes
(-0.37), suggesting that as the number of deliveries increases, the occurrence of
byes might decrease.

3] Weak Correlation: Most features exhibit weak correlations with each

other, implying that they are largely independent and contribute uniquely to
the dataset.

4] Redundant Features: Highly correlated features, such as extras and wides,

may be considered for dimensionality reduction to avoid redundancy in
machine learning models.

5] Feature Relationships: Match-related attributes like innings and ball show only
slight correlations with scoring-related features, meaning external factors might
play a larger role in influencing runs.

# Correlation matrix

# Convert 'season' column to numeric if it's not already

df['season'] = pd.to_numeric(df['season'], errors='coerce') # Convert to numeric,

invalid parsing will be set as NaN

# Drop non-numeric columns before calculating correlation numeric_df =

df.select_dtypes(include=np.number)

corr = numeric_df.corr()

# Plot the correlation heatmap

plt.figure(figsize=(12, 8))

sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

plt.title('Correlation Heatmap of Numerical Features') plt.show()

Conclusion: In this analysis, I successfully cleaned the cricket match dataset by handling
missing values in critical columns such as wicket_type and other_player_dismissed
using SimpleImputer. I then explored the data, identifying key patterns such as
strong correlations between extras, wides, and noballs, as well as weak
correlations between match-related attributes like innings and ball with scoring
features. The heatmap analysis further revealed redundant features that could be
optimized for dimensionality reduction. These insights ensure a well-preprocessed
dataset, ready for further analysis and model development.

ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
IPL Data Analysis
100% (1)
IPL Data Analysis
26 pages
DSBDA - Mini Project Report
100% (1)
DSBDA - Mini Project Report
7 pages
Ipl Data Anlysis
No ratings yet
Ipl Data Anlysis
20 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Isilon - Backend Switches-Backend Switch Upgrade-Replace
No ratings yet
Isilon - Backend Switches-Backend Switch Upgrade-Replace
8 pages
Matplotlib Data Visualization Notebook
No ratings yet
Matplotlib Data Visualization Notebook
77 pages
NBA 2K13 PSP Manual Digital
50% (2)
NBA 2K13 PSP Manual Digital
10 pages
Class12 DataScience Project Template 2024-25
No ratings yet
Class12 DataScience Project Template 2024-25
50 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Back2Back Brain Dead 2k25
No ratings yet
Back2Back Brain Dead 2k25
37 pages
IPL T20 Cricket Analysis Shallshkagksgsohssgsigsgslhsagsjsgsjgsjsh
No ratings yet
IPL T20 Cricket Analysis Shallshkagksgsohssgsigsgslhsagsjsgsjgsjsh
37 pages
SREE
No ratings yet
SREE
24 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Ip Project
No ratings yet
Ip Project
20 pages
Astros
No ratings yet
Astros
20 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Ipl Data Analysis
No ratings yet
Ipl Data Analysis
19 pages
Ipl Analysis
No ratings yet
Ipl Analysis
19 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
Dav Week8 240953580
No ratings yet
Dav Week8 240953580
15 pages
DVA Lab Manual
No ratings yet
DVA Lab Manual
20 pages
Experimenting With Data Analysis Packages and Statistical Operations
No ratings yet
Experimenting With Data Analysis Packages and Statistical Operations
18 pages
DA Phase 3 Dharani
No ratings yet
DA Phase 3 Dharani
19 pages
Ip Practical File
No ratings yet
Ip Practical File
23 pages
Practical File Class 12 2025-26
No ratings yet
Practical File Class 12 2025-26
19 pages
ML 3
No ratings yet
ML 3
24 pages
Exemplar - Perform Feature Engineering
No ratings yet
Exemplar - Perform Feature Engineering
14 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
Py Report
No ratings yet
Py Report
13 pages
Data-Engineering EINDE
No ratings yet
Data-Engineering EINDE
13 pages
PandasAI + Cricket
No ratings yet
PandasAI + Cricket
10 pages
Engo 645
No ratings yet
Engo 645
9 pages
PaperaravindIPLPaper Writefull
No ratings yet
PaperaravindIPLPaper Writefull
7 pages
External
No ratings yet
External
11 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Python Class 6 Assignment Solution
No ratings yet
Python Class 6 Assignment Solution
9 pages
Data Science Project - Flow Graph
No ratings yet
Data Science Project - Flow Graph
7 pages
IPL - Prediction - Model - Training - Final - Ipynb - Colab
No ratings yet
IPL - Prediction - Model - Training - Final - Ipynb - Colab
8 pages
Thinespary Sitharam 841007106016-Supply Chain Management Data Analytic
No ratings yet
Thinespary Sitharam 841007106016-Supply Chain Management Data Analytic
6 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
ML Lab A1 A4
No ratings yet
ML Lab A1 A4
6 pages
Ads Exp1 Kaushal-Patel 25
No ratings yet
Ads Exp1 Kaushal-Patel 25
6 pages
Advanced IPL Match Analysis Using Python (Advanced)
No ratings yet
Advanced IPL Match Analysis Using Python (Advanced)
4 pages
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
No ratings yet
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
10 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
070 1163 04RevB 91496 ServiceManual PDF
No ratings yet
070 1163 04RevB 91496 ServiceManual PDF
78 pages
Coding Notes Data Science
No ratings yet
Coding Notes Data Science
4 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Data Frames and Charts 2: 2.1 Dealing With Missing Values
No ratings yet
Data Frames and Charts 2: 2.1 Dealing With Missing Values
12 pages
Class Activity-2
No ratings yet
Class Activity-2
3 pages
CS202 Assignment - 4 - GIKI
No ratings yet
CS202 Assignment - 4 - GIKI
3 pages
Dream Team
No ratings yet
Dream Team
4 pages
Akash Gadde Dataset
No ratings yet
Akash Gadde Dataset
3 pages
Advanced IPL Match Analysis Using Python (Basic)
No ratings yet
Advanced IPL Match Analysis Using Python (Basic)
3 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Clustering Code Explaination
No ratings yet
Clustering Code Explaination
3 pages
Letters For Ojt
No ratings yet
Letters For Ojt
13 pages
Week4 - Understanding Colors
No ratings yet
Week4 - Understanding Colors
43 pages
OODP-Unit 5
No ratings yet
OODP-Unit 5
128 pages
EN 840Dsl Safety v48 2018-03
No ratings yet
EN 840Dsl Safety v48 2018-03
124 pages
RHEL 9.3 - Configuring A Redhat High Availability Cluster On Redhat Openstack Platform
No ratings yet
RHEL 9.3 - Configuring A Redhat High Availability Cluster On Redhat Openstack Platform
25 pages
Unit 6 Advanced Databases
No ratings yet
Unit 6 Advanced Databases
108 pages
Cracking More Password Hashes With Patterns
No ratings yet
Cracking More Password Hashes With Patterns
28 pages
Disaster Recovery Using Alwayson Availability Group - Scenario 1
No ratings yet
Disaster Recovery Using Alwayson Availability Group - Scenario 1
34 pages
Mark VI Turbine Controls GE - AddingIO - Doc 1 ADDING NEW INPUTS/OUPUTS
No ratings yet
Mark VI Turbine Controls GE - AddingIO - Doc 1 ADDING NEW INPUTS/OUPUTS
29 pages
5 - Cryptography
No ratings yet
5 - Cryptography
5 pages
Project Book Finish
No ratings yet
Project Book Finish
40 pages
Ruijie RG-WLAN Series Access Points RGOS Configuration Guide, Release 11.1 (5) B40P2
No ratings yet
Ruijie RG-WLAN Series Access Points RGOS Configuration Guide, Release 11.1 (5) B40P2
1,249 pages
EJX430A General Spec
No ratings yet
EJX430A General Spec
11 pages
CubeCoders - AMP Installation
No ratings yet
CubeCoders - AMP Installation
3 pages
Unit 5 Notes - Unit5
No ratings yet
Unit 5 Notes - Unit5
10 pages
Unit I Introduction To DevOps and The Culture
No ratings yet
Unit I Introduction To DevOps and The Culture
38 pages
Fully Automatic Hot Foil Stamping Machine
No ratings yet
Fully Automatic Hot Foil Stamping Machine
4 pages
Reservoir Modelling and Simulation
No ratings yet
Reservoir Modelling and Simulation
2 pages
Yashwanth Kumar G N: Mob No:-9980703082 Email ID
No ratings yet
Yashwanth Kumar G N: Mob No:-9980703082 Email ID
2 pages
ALB 180-RM Series: Enhanced Monitoring and Control
No ratings yet
ALB 180-RM Series: Enhanced Monitoring and Control
2 pages
OnGrid Verification and Registration Format v1.9 Update .
No ratings yet
OnGrid Verification and Registration Format v1.9 Update .
17 pages
Rincy: Face Recognition Based Automated Registration System
No ratings yet
Rincy: Face Recognition Based Automated Registration System
2 pages
AP-14 Ver 1.0 EN
No ratings yet
AP-14 Ver 1.0 EN
3 pages
Polyga h3
No ratings yet
Polyga h3
3 pages
DSB For R PDF
No ratings yet
DSB For R PDF
6 pages
Resume Vishnu Shankar
No ratings yet
Resume Vishnu Shankar
1 page
Basic Computer Terminologies
No ratings yet
Basic Computer Terminologies
2 pages