ML 1
ML 1
Subject ML
Class CSE(AIML)
Output: Part 1 : Laboratory set up of ML
Dataset Link:
https://www.kaggle.com/code/atifaliak/eda-on-icc-cricket-world-cup-2023/i
nput Theory:
Exploratory Data Analysis (EDA) on the ICC Cricket World Cup 2023 explores
batting dynamics through a detailed ball-by-ball dataset. This analysis delves
into performance metrics, trends in player strategies, scoring patterns, and
team batting strengths across different match conditions, offering insights into
key players and moments that defined the tournament.
import numpy as np
df = pd.read_csv('deliveries.csv')
df.head()
Code:
Appearance of Dataset:
Cleaning:
plt.ylabel('Frequency')
plt.show()
The distribution of runs off the bat is right-skewed, with dot balls (0 runs) being the most
frequent, followed by singles (1 run). Boundaries (4s and 6s) occur less frequently but
contribute significantly to the scoring. Running between the wickets for 2s and 3s is rare,
indicating a preference for strike rotation and boundary-hitting. This trend highlights
bowling effectiveness in restricting runs and batting strategies focused on minimizing dot
balls and maximizing scoring opportunities.
# Total runs scored by each batting team
team_runs =
df.groupby('batting_team')['runs_off_bat'].sum().sort_values(asce nding=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=team_runs.index, y=team_runs.values)
plt.xlabel('Batting Team')
plt.ylabel('Total Runs')
plt.xticks(rotation=90)
plt.show()
India and Australia lead in total runs scored, indicating strong batting
performances. South Africa and New Zealand follow closely, showing consistency
in run accumulation. Pakistan and England have moderate totals, suggesting a
balanced mix of aggressive and defensive play. Afghanistan, Bangladesh, and Sri
Lanka have comparable run totals, while the Netherlands has the lowest,
reflecting a possible gap in batting strength against top teams.
# Boxplot for the relationship between batting team and runs off bat
plt.figure(figsize=(12, 6))
by Batting Team')
plt.xlabel('Batting Team')
plt.xticks(rotation=90)
plt.show()
Variable Relationship:
1] Strong Correlation: Features like extras, wides, noballs, and byes show a high
correlation with each other, indicating that these extra runs are often recorded
together.
2] Negative Correlation: The feature ball has a negative correlation with byes
(-0.37), suggesting that as the number of deliveries increases, the occurrence of
byes might decrease.
5] Feature Relationships: Match-related attributes like innings and ball show only
slight correlations with scoring-related features, meaning external factors might
play a larger role in influencing runs.
# Correlation matrix
df.select_dtypes(include=np.number)
corr = numeric_df.corr()
plt.figure(figsize=(12, 8))