0% found this document useful (0 votes)
5 views16 pages

ML 1

The document outlines an experiment conducted by Piyanshu Gehani at Sardar Patel Institute of Technology, focusing on hardware and software requirements for machine learning and dataset analysis. It details the setup of Anaconda and Python, the exploration of a cricket dataset, and the application of various data preprocessing techniques, including handling missing values and scaling. The analysis reveals key insights into player performance and correlations within the dataset, preparing it for further machine learning model development.

Uploaded by

Piyanshu Gehani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

ML 1

The document outlines an experiment conducted by Piyanshu Gehani at Sardar Patel Institute of Technology, focusing on hardware and software requirements for machine learning and dataset analysis. It details the setup of Anaconda and Python, the exploration of a cricket dataset, and the application of various data preprocessing techniques, including handling missing values and scaling. The analysis reveals key insights into player performance and correlations within the dataset, preparing it for further machine learning model development.

Uploaded by

Piyanshu Gehani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Bharatiya Vidya Bhavan’s

SARDAR PATEL INSTITUTE OF TECHNOLOGY


(Autonomous Institute Affiliated to University of Mumbai)
Munshi Nagar, Andheri (W), Mumbai – 400 058.
Department of Computer Engineering
Experiment 1

Aim Part I-H/W & S/W Requirement


Part 2-Dataset Analysis

Name Piyanshu Gehani


UID 2022600012

Subject ML

Class CSE(AIML)
Output: Part 1 : Laboratory set up of ML

I successfully downloaded and set up Anaconda along with Python,


launched and updated Anaconda, installed the CUDA Toolkit and cuDNN,
and created a separate Anaconda environment on my local machine.

Part 2 : Exploration of Dataset Analysis

Dataset Link:
https://www.kaggle.com/code/atifaliak/eda-on-icc-cricket-world-cup-2023/i

nput Theory:

Exploratory Data Analysis (EDA) on the ICC Cricket World Cup 2023 explores
batting dynamics through a detailed ball-by-ball dataset. This analysis delves
into performance metrics, trends in player strategies, scoring patterns, and
team batting strengths across different match conditions, offering insights into
key players and moments that defined the tournament.

Description: We collected, explored, and imported the deliveries.csv dataset,


which contains ball-by-ball details of cricket matches with 26119 entries and 22
columns. The dataset was loaded using pd.read_csv() and examined using
df.head() and df.info(), revealing its structure and missing values in columns like
wides, noballs, and penalty. We utilized Python libraries such as Pandas, NumPy,
Matplotlib, Seaborn, and Scikit-learn for data handling, visualization, and
preprocessing, including missing value imputation with SimpleImputer. This
provided a foundational understanding of dataset preparation for further analysis.
Code:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

df = pd.read_csv('deliveries.csv')

# Printing the first 5 rows of the dataset

df.head()

Scale: Scaling is a crucial preprocessing step that standardizes numerical data to


ensure uniformity and improve model performance. In this experiment, we
identified numerical columns using df.select_dtypes(include=['int64', 'float64'])
and applied StandardScaler from Scikit-learn.

Standardization transforms the data to have a mean of 0 and a standard deviation


of 1, preventing certain features from dominating others due to differing scales.
This step is essential for machine learning algorithms that rely on distance-based
calculations, such as regression, clustering, and neural networks.

Code:

# Scaling numerical columns

numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns


scaler = StandardScaler()

df[numerical_cols] = scaler.fit_transform(df[numerical_cols]) df.head()

Appearance of Dataset:
Cleaning:

I used various methods for cleaning as the dataset contained missing


values in multiple columns.

1] Identified missing values using df.isnull().sum() and filtered columns with


missing data.

2] Used SimpleImputer with the 'most_frequent' strategy to fill missing values


in wicket_type and other_player_dismissed.

3] Reshaped categorical data using .values.reshape(-1,1) and flattened it to 1D for


proper integration.

4] Verified the imputation process by rechecking missing values


with df.isnull().sum().
Analysis:
# Distribution of runs off bat

sns.histplot(df['runs_off_bat'], kde=True, bins=20) plt.title('Distribution

of Runs Off Bat')

plt.xlabel('Runs Off Bat')

plt.ylabel('Frequency')

plt.show()

The distribution of runs off the bat is right-skewed, with dot balls (0 runs) being the most
frequent, followed by singles (1 run). Boundaries (4s and 6s) occur less frequently but
contribute significantly to the scoring. Running between the wickets for 2s and 3s is rare,
indicating a preference for strike rotation and boundary-hitting. This trend highlights
bowling effectiveness in restricting runs and batting strategies focused on minimizing dot
balls and maximizing scoring opportunities.
# Total runs scored by each batting team

team_runs =
df.groupby('batting_team')['runs_off_bat'].sum().sort_values(asce nding=False)

# Plot the total runs scored by each team

plt.figure(figsize=(12, 6))

sns.barplot(x=team_runs.index, y=team_runs.values)

plt.title('Total Runs by Batting Team')

plt.xlabel('Batting Team')

plt.ylabel('Total Runs')

plt.xticks(rotation=90)

plt.show()
India and Australia lead in total runs scored, indicating strong batting
performances. South Africa and New Zealand follow closely, showing consistency
in run accumulation. Pakistan and England have moderate totals, suggesting a
balanced mix of aggressive and defensive play. Afghanistan, Bangladesh, and Sri
Lanka have comparable run totals, while the Netherlands has the lowest,
reflecting a possible gap in batting strength against top teams.

# Boxplot for the relationship between batting team and runs off bat
plt.figure(figsize=(12, 6))

sns.boxplot(x='batting_team', y='runs_off_bat', data=df) plt.title('Runs Off Bat

by Batting Team')

plt.xlabel('Batting Team')

plt.ylabel('Runs Off Bat')

plt.xticks(rotation=90)

plt.show()
Variable Relationship:

1] Strong Correlation: Features like extras, wides, noballs, and byes show a high
correlation with each other, indicating that these extra runs are often recorded
together.

2] Negative Correlation: The feature ball has a negative correlation with byes
(-0.37), suggesting that as the number of deliveries increases, the occurrence of
byes might decrease.

3] Weak Correlation: Most features exhibit weak correlations with each


other, implying that they are largely independent and contribute uniquely to
the dataset.

4] Redundant Features: Highly correlated features, such as extras and wides,


may be considered for dimensionality reduction to avoid redundancy in
machine learning models.

5] Feature Relationships: Match-related attributes like innings and ball show only
slight correlations with scoring-related features, meaning external factors might
play a larger role in influencing runs.

# Correlation matrix

# Convert 'season' column to numeric if it's not already

df['season'] = pd.to_numeric(df['season'], errors='coerce') # Convert to numeric,


invalid parsing will be set as NaN

# Drop non-numeric columns before calculating correlation numeric_df =

df.select_dtypes(include=np.number)

corr = numeric_df.corr()

# Plot the correlation heatmap

plt.figure(figsize=(12, 8))

sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

plt.title('Correlation Heatmap of Numerical Features') plt.show()


Conclusion: In this analysis, I successfully cleaned the cricket match dataset by handling
missing values in critical columns such as wicket_type and other_player_dismissed
using SimpleImputer. I then explored the data, identifying key patterns such as
strong correlations between extras, wides, and noballs, as well as weak
correlations between match-related attributes like innings and ball with scoring
features. The heatmap analysis further revealed redundant features that could be
optimized for dimensionality reduction. These insights ensure a well-preprocessed
dataset, ready for further analysis and model development.

You might also like