Final Report 2024
Final Report 2024
ON
DATA ANALYTICS
AT
GREAT LEARNING ACADEMY
Submitted By:
I (Aman Gupta) hereby declare that I have undertaken six week industrial training project during
a period from JULY 02 2024 to AUGUST 17 2024 in partial fulfillment of requirements for the
award of degree of Bachelor of Technology in Computer Science & Engineering at School of
Engineering and Emerging Technology, BUEST, Baddi. The work which is being presented in
the industrial project report is an authentic record of our work carried out under the guidance of
Mr. Bhavin Akelle. We have not submitted this work elsewhere for any other degree or
diploma.
Signature of Examiner
ii
ACKNOWLEDGEMENT
Behind every successful effort, there lie contributions from numerous sources
irrespective of their magnitude. Hardwork and dedication are not the only thing
required for the completion of a Project, but equally important is proper guidance and
inspiration. Our project is no exception and we take this opportunity to thank all those
who are lending a helping hand.
We take this opportunity to express our deep and sincere gratitude to most esteemed
Head of Department Ms. Agrimaa Singh Thakur and Project Guide Mr. Bhavin Akelle,
as who have been kind enough to spare their valuable time, on which we have no
claim. Their guidance and motivation conceived a direction in us, and are helping us to
make this project a grand success.
Last but not the least we shall remain thankful to all our classmates, at present as well
as in future who are cooperating with us in making this project happening.
iii
COMPANY PROFILE
“POWER AHEAD”
With more than 9.2 Million learners in 170+ countries, Great Learning is a leading global ed-tech
company for professional and higher education offering industry-relevant programs in blended,
classroom, and purely online modes across technology, data, and business domains. These programs
are developed in collaboration with top academic institutions of the world.
Great Learning is an ed-tech company owned by BYJU’S and Founded by Mohan Lakhamraju in
2013. It offers comprehensive industry-relevant online programs in software engineering, business
management, business analytics, data science, AI ML, cloud computing, cyber security, digital
marketing, and design thinking among others. Great Learning’s programs are developed in
collaboration with popular universities like Stanford University, MIT, The University of Texas at
Austin, National University Singapore, IIT Madras, IIT Bombay, IIT Roorkee, and Great Lakes
Institute of Management.
As we all know, Great Learning is the Indian first Online Learning Platform for professional
learning course providers. They provide a fully online course with an expert mentor and experienced
facilities. Great Learning makes users unable to learn a course from popular universities like
Stanford University, Texas McCombs, and Great Lakes from Home. and The programs follow a
learn-by-doing approach to make professionals job-ready. All the faculty members are very good
and supportive. and They also offer mock interviews to prepare for your dream job.
GREAT LEARNING believe in constant learning to become more powerful & stronger . To feature
their core philosophy of “guided growth” for us learners , they’ve revealed a new logo , that mainly
focuses on vibrant , distinct , and a strong visual appeal that features their strong tagline “power
ahead”.
iv
ABSTRACT
The integration of data analytics into cricket, particularly in high-profile tournaments like the
Cricket World Cup, has revolutionized how teams strategize, perform, and analyze their chances of
success. This paper explores the role of data analytics in shaping team dynamics, player
performance, and tactical decisions during the Cricket World Cup. By leveraging vast amounts of
historical data, player statistics, match simulations, and real-time performance metrics, teams are
able to gain insights into opposition strategies, player form, and optimal game tactics. Advanced
analytics tools such as machine learning models, predictive analytics, and data visualization are now
integral in decision-making processes related to player selection, batting order, field placements, and
match forecasts. The study highlights the impact of these data-driven approaches on the 2019 and
2023 World Cups, examining how data analysis has influenced outcomes, and offers a forward-
looking perspective on how emerging technologies like AI and big data will continue to transform
the future of cricket.Ultimately, data analytics has become a critical enabler in enhancing team
performance, providing deeper insights into game dynamics, and fostering innovation in cricket
strategy at the World Cup level.
Keywords: Data Analytics, Cricket World Cup, Machine Learning, Player Performance, Predictive
Analytics, Team Strategy, Big Data, AI in Sports.
v
LIST OF FIGURES
FIGURE PAGE
FIGURE NAME
NO. NO.
vi
TABLE OF CONTENTS
Certificate(Training) i
Candidate Declaration ii
Acknowledgement iii
Company Profile iv
Abstract v
List of figures vi
Table of Content vii
1. INTRODUCTION 1-2
3. WORKING OF PROJECT 20
5.1 CONCLUSION 30
6. REFERENCES 32
vii
CHAPTER – 1
INTRODUCTION
3. Exploratory and Predictive: It enables both exploration of past data (e.g., performance
trends) and predictions about the future (e.g., expected scores or match outcomes).
1
Application of Data Analytics:
Sports Analytics
Business Intelligence
Finance and Banking
Marketing and Customer Analytics
2
1.1 INTRODUCTION TO PROJECT
The Cricket World Cup is one of the most prestigious tournaments in international
cricket, showcasing the skills and strategies of the best cricketing nations. Over the years,
it has become a data-rich domain, providing an abundance of information about matches,
players, and teams. Analyzing this data can offer valuable insights into team
performances, player contributions, and factors influencing match outcomes.
1.4 OBJECTIVES:
The primary objective of Cricket World Cup data analysis is to derive meaningful insights
from historical and real-time data to improve understanding, decision-making, and
strategic planning in cricket. This involves evaluating team and player performances,
identifying trends, and highlighting factors that contribute to success in the tournament.
Below is a detailed outline of objectives for Cricket World Cup data analysis. Key metrics
such as:
4
Fig.1.3 (FRONT END)
5
Fig.1.5 (FRONT END)
6
Fig.1.7 (FRONT END)
7
Fig.1.9 (FRONT END)
8
1.6 FEASIBILITY STUDY:
A feasibility study assesses the practicality and viability of implementing a Cricket World Cup data
analysis system. It evaluates technical, economic, operational, legal, and schedule-related aspects to
determine whether the project is achievable and beneficial. Below is a detailed feasibility study for
this project.
Analysts and coaches are the primary users. They can use reports and visualizations for
decision-making.
Stakeholders like broadcasters and fans benefit from enhanced content.
9
CHAPTER – 2
SYSTEM ANALYSIS AND DESIGN
The lack of accessible and effective platforms for analyzing the rich dataset generated during
Cricket World Cup matches is a primary issue.
The goal is to make the data actionable by providing trends, patterns, and predictions
through an analytical platform.
Technical Feasibility:
o The project leverages Python and its robust ecosystem (pandas, NumPy, matplotlib,
seaborn) for data analysis.
o The availability of datasets from trusted sources like Kaggle ensures reliable data input.
o Using Jupyter Notebook, a platform suited for both development and presentation,
ensures high compatibility with the tools.
Operational Feasibility:
o The solution is designed for accessibility by analysts and enthusiasts with minimal
technical expertise.
o Step-by-step documentation ensures smooth operation even for users unfamiliar with
Jupyter Notebook.
Economic Feasibility:
o The open-source nature of Python and Jupyter Notebook keeps costs low.
o Future commercialization could target cricket enthusiasts, broadcasters, and analysts.
10
2.1.3Requirement Analysis
Functional:
o Data importation, cleaning, and preprocessing are crucial.
o Analysis capabilities must include both descriptive and predictive insights.
o Visualization features are essential for user engagement and comprehension.
Non-Functional:
o Responsiveness in processing large datasets is vital.
o Aesthetic and informative visualizations enhance user experience.
2.2System Design
Notebook Interface: Markdown cells for documentation and Python cells for execution.
Interactive Features: Widgets to customize dataset selection and analysis type.
Visualization Integration: Interactive charts for detailed exploration.
2.2.3 Implementation:
Data Cleaning: Use pandas to handle missing values, normalize formats, and remove
outliers.
Exploratory Data Analysis (EDA): Identify patterns using descriptive statistics and
visualizations.
Predictive Modeling: Employ machine learning algorithms for predicting match outcomes.
Visualization: Use matplotlib, seaborn, and Plotly for creating static and interactive plots.
11
2.2.4 Testing:
Unit Testing:
o Each module, including data preprocessing, analysis, and visualization components, was
tested individually to ensure proper functionality.
o Example: Validating that missing values are correctly handled during the data cleaning
process.
Integration Testing:
o Ensures smooth interaction between different modules, such as seamless transitions from
data preprocessing to analysis and visualization.
o Example: Testing whether processed data flows correctly into the predictive modeling
and visualization modules.
System Testing:
Performance Testing:
o Evaluates system performance under varying loads, such as handling large datasets or
complex predictive models.
o Tools like Python’s time module were used to measure execution time for different
processes.
o Conducted with a group of cricket analysts and enthusiasts to validate the usability and
functionality of the Jupyter Notebook interface.
o Feedback was collected on user experience, clarity of visualizations, and the accuracy of
predictions.
12
2.2.5 Deployment:
Local Deployment:Users can run the Jupyter Notebook on their local machines using
Python and its libraries.Recommended tools include Anaconda or standalone Python
installations.
Cloud Deployment:Host the Jupyter Notebook on platforms like Google Colab, Binder,
or JupyterHub for wider accessibility.These platforms provide users with pre-configured
environments to execute the analysis without requiring local setup.
2.2.6 Maintenance:
Bug Fixes:
o Regularly monitor and resolve issues reported by users or identified during runtime.
o Ensure compatibility with updates to Python libraries or dependencies.
Data Updates:
o Continuously update datasets with the latest Cricket World Cup statistics.
o Implement mechanisms to automatically fetch and integrate real-time data.
Documentation:
o Maintain up-to-date documentation for installation, usage, and troubleshooting.
o Include changelogs to track system updates.
7 Upgrades
Feature Enhancements:
o Add new analysis features, such as player comparisons and historical win probability
analysis.
o Introduce more advanced visualizations using tools like Tableau or Power BI
integrations.
Scalability:
o Optimize system performance for handling larger datasets and more complex
analyses.
o Transition to a cloud-based infrastructure for improved accessibility and resource
scalability.
13
Machine Learning Upgrades:
o Integrate more sophisticated machine learning models, such as neural networks, for
predictive analytics.
o Provide personalization options for users to tailor analysis based on specific interests
(e.g., favorite teams or players).
The project incorporates the following models and methodologies to achieve its objectives:
Descriptive Statistics: Provides insights into the data, such as averages, medians, and
standard deviations, to summarize historical trends in the Cricket World Cup.
Correlation Analysis: Evaluates relationships between variables like team performance,
player stats, and match outcomes.
Graphical Models:
o Bar Graphs and Line Charts: Display trends like team performance over multiple
years.
o Heatmaps: Highlight player performances and match factors (e.g., batting vs.
bowling impact).
o Interactive Dashboards: Built using Plotly to allow dynamic exploration of data.
14
2.3.4 Exploratory Data Analysis (EDA)
Uses Python libraries like pandas, NumPy, and seaborn to discover insights and generate
hypotheses for predictive analysis.
Jupyter Notebook:
o Jupyter Notebook (via Anaconda or pip) must be installed.
o Python 3.7 or higher.
Python Libraries:
o Pandas (for data manipulation and analysis).
15
Web Browser:
o Chrome, Firefox, or any modern browser for accessing the Jupyter Notebook
interface.
Processor:
RAM:
o Minimum: 8 GB RAM.
o Recommended: 16 GB RAM or more.
Storage:
Minimum: 100 GB of free disk space.For storing datasets, notebooks, and libraries.SSD
(Solid State Drive) is recommended for faster data access and operations.
Graphics:
16
Fig 2.1 (IMPORT JUPYTER NOTEBOOK)
17
Fig. 2.2 (DATA FLOW DIAGRAM)
18
Fig. 2.3 (ER – DIAGRAM)
19
CHAPTER – 3
WORKING OF PROJECT
The project titled "Data Analysis on the Cricket World Cup using Jupyter Notebook" involves
analyzing historical data from the Cricket World Cup tournaments to extract valuable insights
regarding team performance, player statistics, trends, and various other aspects. The primary aim of
this project is to use data analysis techniques to uncover patterns in the performance of teams and
players in different Cricket World Cup editions. We use Jupyter Notebook as the main platform for
conducting the analysis, as it offers an interactive and easy-to-use environment for data
manipulation, visualization, and modeling.
1.Objectives
Data Collection: Gather data on Cricket World Cup tournaments (matches, players,
statistics, etc.) from reliable sources like APIs, CSV files, and websites.
Data Cleaning and Preprocessing: Clean and format the data for analysis by handling
missing values, correcting errors, and transforming data into usable formats.
Data Analysis: Use Python and libraries such as Pandas, NumPy, and Scikit-learn to analyze
and model the data.
Visualization: Create interactive and informative visualizations using libraries like
Matplotlib, Seaborn, and Plotly to help interpret the analysis.
Insight Generation: Provide insights and trends related to the Cricket World Cup, such as
top-performing teams, players, and significant match statistics.
2.Target Audience
Cricket Enthusiasts: Individuals who follow cricket at all levels, from amateur fans to
professional spectators, will find the insights from the data analysis valuable. This
audience is keen on learning more about team and player performances, historical trends,
and predictions related to Cricket World Cup tournaments.Understanding how teams and
20
players have performed over the years, exploring trends, and gaining statistical insights
into the game.
Cricket Teams and Coaches: Coaches, team managers, and analysts working with cricket
teams (either professional or amateur) who wish to analyze past performance, evaluate
players, and gain insights that could inform future strategies.
3.Benefits
In-Depth Understanding of Cricket Performance: The project provides a deep dive into
the historical data of the Cricket World Cup, revealing performance trends, key factors
influencing outcomes, and identifying top-performing teams and players.
Data-Driven Insights for Decision Making: By analyzing cricket performance data, the
project generates data-driven insights that can help make informed decisions.
Predictive Analytics for Future World Cups: The use of statistical models (e.g., logistic
regression, machine learning) allows for predictions regarding match outcomes, player
performance, and team dynamics.
21
Functions Used In Jupyter Notebook:
Data Loading
pd.read_csv(): Load data from a CSV file.
pd.read_excel(): Load data from an Excel file.
pd.read_sql(): Load data from a SQL query or database.
pd.read_json(): Load data from a JSON file.
Data Viewing
df.head(n): View the first n rows of the Data Frame.
df.tail(n): View the last n rows of the Data Frame.
df.info(): Display a summary of the Data Frame.
df.describe(): Show statistical summary of numerical columns.
Data Selection
df['column_name']: Select a single column.
df[['col1', 'col2']]: Select multiple columns.
df.loc[row_labels, column_labels]: Select by label.
df.iloc[row_indices, column_indices]: Select by index.
Data Filtering
df[df['column'] > value]: Filter rows based on a condition.
df.query('column > value'): Query rows using a string expression.
Data Aggregation
df.groupby('column'): Group rows by a column.
df['column'].sum(): Sum values in a column.
df['column'].mean(): Compute the mean of a column.
Data Transformation
df['new_column'] = df['column'] * 2: Create or modify columns.
df.rename(columns={'old_name': 'new_name'}): Rename columns.
df.drop(columns=['col1', 'col2']): Drop specified columns.
df.sort_values('column'): Sort Data Frame by a column.
Array Creation
np.array(): Create an array.
np.zeros(), np.ones(), np.random.rand(): Create arrays with specific values or random
numbers.
Array Manipulation
np.reshape(): Change the shape of an array.
np.concatenate(): Join arrays along an axis.
np.split(): Split an array into sub-arrays.
Mathematical Operations
np.sum(), np.mean(), np.std(): Compute sum, mean, and standard deviation.
np.dot(): Perform matrix multiplication.
np.linalg.inv(): Calculate the inverse of a matrix.
23
Steps to download the Jupyter Notebook :
Description: Python is a versatile, high-level programming language widely used for data
analysis, machine learning, and web development.
Usage: Python is used for scripting, data manipulation, statistical modeling, and creating
visualizations in this project. Python’s rich ecosystem of libraries makes it well-suited for
data science tasks.
3. Pandas
Description: Pandas is a powerful Python library for data manipulation and analysis,
particularly for structured data such as tabular data (CSV, Excel, SQL, etc.).
Usage: Pandas is used for data loading, cleaning, and preprocessing. It allows efficient
handling of large datasets, missing value imputation, data transformation, and filtering.
24
4. NumPy
Description: NumPy is a library for numerical computing in Python, providing support for
large, multi-dimensional arrays and matrices, along with a collection of mathematical
functions to operate on these arrays.
Usage: NumPy is used for performing mathematical operations and handling large numerical
datasets, which is crucial for calculations like averages, variances, and correlations in player
and team performance.
5. Matplotlib
Description: Matplotlib is a plotting library for Python that allows users to create static,
interactive, and animated visualizations.
Usage: Matplotlib is used to create basic visualizations such as bar charts, line graphs, and
histograms to represent player statistics, team performance, and match outcomes over time.
6. Plotly
Description: Plotly is an interactive graphing library for Python that is used to create
interactive, web-based visualizations.
Usage: Plotly is used to create interactive dashboards and visualizations that allow users to
explore and analyze the data dynamically. This is especially useful for visualizing player
statistics or team performance trends over different World Cup editions.
7. Scikit-learn
Description: Scikit-learn is a machine learning library for Python that provides simple and
efficient tools for data mining and data analysis, built on NumPy, SciPy, and Matplotlib.
Usage: Scikit-learn is used for building predictive models, including regression and
classification algorithms. For instance, logistic regression may be used to predict match
outcomes, and clustering techniques like K-means may be applied to group similar teams
based on performance metrics.
25
8. SQL / SQLite
Description: SQL (Structured Query Language) is used for querying relational databases,
while SQLite is a lightweight, self-contained SQL database engine that stores data in a file.
Usage: In case the data is stored in a relational database, SQL or SQLite is used to query and
extract relevant data for analysis. This can be useful when working with larger, structured
datasets.
3.3 WORKING:
Data Collection: Gather data from public datasets, APIs, or official sources about teams,
players, matches, and tournament results.
Data Cleaning and Preprocessing: Prepare the data by handling missing values,
normalizing formats, and creating new variables for analysis (e.g., averages, win margins).
Exploratory Data Analysis (EDA): Investigate the data using statistical summaries and
visualizations to uncover trends and patterns.
Visualization: Use Python libraries like Matplotlib, Seaborn, and Plotly to create
interactive and static charts (e.g., bar graphs, heatmaps, and line plots).
Predictive Analysis: Apply machine learning models (e.g., logistic regression,
clustering) to analyze and predict outcomes based on historical data.
Insights and Reporting: Compile findings into reports, highlighting critical insights and
presenting results through visual dashboards.
The project "Data Analysis on the Cricket World Cup using Jupyter Notebook" aims to explore and
analyze historical data from past tournaments to uncover meaningful insights about team performances,
player contributions, and match trends. Data is collected from reliable sources such as CSV files or web
scraping and is cleaned and processed using libraries like Pandas. This includes handling missing values,
removing duplicates, and engineering features like win percentages and net run rates. Exploratory Data
Analysis (EDA) is conducted to visualize team statistics, player achievements, and match outcomes using
Matplotlib, Seaborn, or Plotly. Insights are drawn on topics such as the most successful teams, top-
performing players, venue-specific patterns, and the importance of toss outcomes. Advanced analysis,
such as clustering teams or studying correlations, may also be performed. The results are summarized
into actionable insights, emphasizing historical trends and key moments in the tournament's history.
Finally, findings are shared through reports or interactive dashboards, providing a comprehensive
understanding of the Cricket World Cup's evolution and its data-driven narratives.
26
CHAPTER – 4
RESULT & DISCUSSION
4.1 Result
The analysis of the Cricket World Cup data yielded several valuable insights, trends, and patterns
across teams, players, and matches. Below are the key results organized by topic:
2. Player Performance
Top Performers:
o Leading batsmen like Sachin Tendulkar and Ricky Ponting scored the most runs
across multiple World Cups.
o Bowlers like Glenn McGrath and Muttiah Muralitharan took the highest wickets,
showcasing consistent performance.
All-Round Impact:
o All-rounders like Jacques Kallis and Shakib Al Hasan were pivotal in both batting
and bowling for their respective teams.
Strike Rates and Averages:
o Modern players showed an increasing trend in strike rates compared to past players,
indicating a shift toward aggressive batting styles.
27
3. Match Insights
High-Scoring Matches:
o Recent tournaments showed an increase in match totals, with scores above 300
becoming more common due to better pitches and powerplay utilization.
Toss Impact:
o Teams winning the toss had a slight advantage, with a higher percentage of wins in
matches where they batted second.
Venue Influence:
o Certain venues favored spinners (e.g., subcontinent pitches), while others benefited
pacers (e.g., Australian and English grounds).
4. Statistical Insights
5. Predictive Analysis
28
4.2 Discussion
1. Evolution of Cricket Strategies
Aggressive Batting:
o The increase in strike rates and higher team totals reflect the shift from defensive to
aggressive batting strategies.
o Innovations like the use of powerplays and shorter boundaries have influenced
scoring patterns.
Bowling Adaptations:
o Bowlers have adapted with variations like slower balls, yorkers, and better use of
spin to counter aggressive batting.
Data Gaps:
o Some historical data may be incomplete or inconsistent, especially from older World
Cups.
Contextual Factors:
o The analysis does not fully account for factors like player injuries, psychological
pressure, or match-fixing allegations, which may influence results.
Predictive models, while reasonably accurate, can only account for historical trends and fail
to predict real-time external factors like weather, injuries, or player form on the day of the
match.
29
CHAPTER – 5
CONCLUSION & FUTURE SCOPE
5.1 CONCLUSION
The project, "Data Analysis on Cricket World Cup Using Jupyter Notebook," provided a
comprehensive exploration of historical Cricket World Cup data, offering meaningful
insights into team performances, player contributions, and match dynamics. By utilizing
Python’s data analysis and visualization libraries, the project uncovered trends such as the
increasing dominance of aggressive batting strategies, evidenced by rising strike rates and
higher match totals, and the adaptation of bowling techniques to counter these changes. It
highlighted key factors like toss decisions, pitch conditions, and venue advantages, which
significantly influenced match outcomes, and revealed the consistent performances of
legendary players like Sachin Tendulkar and Glenn McGrath, alongside the dominance of
teams such as Australia and India.
Predictive modeling added another layer of value, with machine learning algorithms
achieving 75-80% accuracy in forecasting match outcomes based on historical data. This
demonstrated the practical application of data science in cricket analytics, offering potential
tools for teams, analysts, and enthusiasts to better understand the game’s dynamics.
The project also serves as a learning platform for data science practitioners, showcasing
techniques like data cleaning, visualization, and statistical analysis in a real-world context.
Future enhancements could include real-time data integration for live analysis, advanced
predictive models for greater accuracy, and interactive dashboards for dynamic user
engagement. Expanding the analysis to other cricket formats, such as T20 leagues and
bilateral series, would further broaden its applicability.
In conclusion, this project bridges the gap between raw sports data and actionable insights,
illustrating how data science can revolutionize the way cricket is analyzed, understood, and
appreciated. By offering a detailed understanding of the sport's evolution and strategies, it
paves the way for more informed decision-making and greater fan engagement.
30
5.2 FUTURE SCOPE
b. Interactive Dashboards
Build interactive dashboards using tools like Dash, Streamlit, or Tableau for real-time
visualization and analysis.
Provide features for filtering data by team, player, venue, or specific tournaments to make
the analysis user-friendly.
c. Sentiment Analysis
Perform sentiment analysis on social media or news articles related to the Cricket World
Cup.
Correlate public opinion and sentiment trends with team performances and key moments.
31
CHAPTER – 6
REFERENCE
2. chatGPT 4.0
3. Google.
4. BlackboxAI .
32
33
33
4