0% found this document useful (0 votes)
17 views50 pages

Class12 DataScience Project Template 2024-25

DataScience_Project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views50 pages

Class12 DataScience Project Template 2024-25

DataScience_Project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

DELHI PUBLIC SCHOOL, HYDERABAD

DATA SCIENCE

PROJECT II –
PREDICT BASKETBALL PLAYER
EFFICIENCY RATINGS BY USING
MACHINE LEARNING AND VISUAL
STUDIO CODE

GRADE XII

CBSE BOARD ROLL NUMBER:

Academic Year: 2024 - 25


Title of the Project: Project II - Prediction Model

Name of the Student: Harshith Dunna

Class & Section: XII D

Batch: 2024 - 2025

Subject Teacher: Ms. Veena Hegde


CERTIFICATE FROM THE SCHOOL

This is to certify that


of class XII, Delhi Public School, Hyderabad, has
done this project as a part of Data Science (844)
Curriculum issued by CBSE.

___________ has shown sincerity and utmost


care in the completion of this project. I certify
that this project is up to my expectations and as
per the guidelines issued by the CBSE.

Internal Examiner External Examiner


I, do hereby declare that this
project is implemented by me and I would like to thank Ms.
Veena Hegde for her wholehearted support and guidance for
making it possible to complete this project on time.

I would also like to thank Microsoft for all the study materials.

I also thank the Central Board of Secondary Education (CBSE)


for designing the curriculum in such a manner that it provided a
wonderful opportunity to gain hands-on experience and
guidance.

Name:
Signature:
PROJECT II
INDEX

1. Introduction

2. Set up your local environment for data science coding

3. Data cleansing part 1 – Find missing values

4. Data cleansing part 2 – Drop unnecessary columns and rows

5. Data exploration part 1 – Identify outliers

6. Data exploration part 2 – Check the distribution of data

7. Data exploration part 3 – Find data representing more than


one population

8. Data manipulation part 1 – Add relevant player information

9. Data manipulation part 2 – Fill in missing values in specific


column.

10. Data manipulation part 3 – Use machine learning to predict


and fill in missing data

11. Knowledge Check

12. Conclusion
INTRODUCTION TO THE PROJECT

We can create an code using Visual Studio Code to


study and predict the performance of different
Basketball players in the new film Space Jam: A New
Legacy

We will use some tools and methods from Data


Science and machine learning, explained below:

● Use Python, Pandas, and Visual Studio Code to


look at basketball stats.

● Apply machine learning to clean and fill any


missing data in the datasets.

● Learn how to find patterns in data for both human


and Tune Squad basketball players.
SETUP LOCAL PYTHON
ENVIRONMENT IN VISUAL STUDIO
CODE

1) Create a folder named ‘space-jam-anl’ anywhere on the


computer.

2) Open Visual Studio Code and open the folder you created
3) Create file ‘space-jam-anl.ipynb’ in the ‘space-jam-anl’
folder

4) Ensure file opens in notebook, Jupyter server is connected


and kernel points to correct Python version

DOWNLOAD DATA FOR BASKETBALL PLAYERS

5) On GitHub download CSV file at player_data.csv

6) Save the player_data.csv file in your space-jam-anl folder

7) Select the CSV File to view on Visual Studio Code


DATA CLEANSING PART 1 - FIND
MISSING VALUES:

● Explore the data: (use the below code)

“import pandas as pd”

Output:

● Look for missing values:


(Use the isna() function with the data frame)
Output:

Print out the information:


Output:
DATA CLEANSING PART 2 - DROP
COLUMNS AND ROWS

To drop columns, you'll use the dropna():

● By default, dropna() removes rows, so specify that


you want to remove columns by using the axis
parameter.

● The dropna() method usually returns a new


DataFrame.

● Use the inplace parameter to tell it to drop these


columns in the original player_df DataFrame.

● The dropna() to remove only columns in which all


the values are missing.

● Set the how parameter to ‘all’.


● The thresh parameter refers to threshold.
● This parameter lets us to set the minimum number
of non-NaN values a row or
● column needs to avoid being dropped by dropna().
● To remove specific rows from the DataFrame, set
thresh to 12.
The index counts 0 through 10, skipping 8.
● The row that had the index of 8 was dropped
because it had more than
● two NaN values.
● In the 14 columns, the rows that had three or more
NaN values didn’t meet the
● threshold of 12.
DATA EXPLORATION PART 1 -
CHECKING FOR OUTLIERS

● Outliers are data values so far outside the


distribution of other values that they
● bring into question whether they even belong in the
dataset.
● Outliers often arise from data errors or other
undesirable noise.
● Need to check for and deal with possible outliers
before we analyze the data.
● A quick way to identify outliers is to use the pandas
describe() function.
● A list of the column names, excluding ID.

● The list is to find specific values within each row.

● Create a matrix of subplots to have one figure that


shows all 13 columns.

● Add padding around the subplots to make them


easier to read.

● Create a box plot based on the data in each column,


across all the rows.
The tail() function shows the last five values of a
DataFrame.

Resetting the index for the DataFrame to ensure


accuracy within the data.
DATA EXPLORATION PART 2 - CHECK
THE DISTRIBUTION OF THE DATA

● A common way to visualize the distribution of data


is a histogram.
● A histogram is a bar chart that shows how many
times the data in a dataset appears within a range of
values
● Ranges are called bins.
Create kernel-density estimates of the DataFrame
data

Using a for loop to generate a matrix of KDEs for all the


columns.
● Each top represents a mode of the data, or a
value around which values in the dataset
concentrate.
● The fact that so many of the columns are
bimodal indicates that the dataset
● represents samples from two discrete
populations.
DATA EXPLORATION PART 3 - DATA
THAT REPRESENTS MORE THAN ONE
POPULATION

Around 1,600, the two populations split.


Rows where players scored more than 1,600 points:
The PER for player 34 is given as NaN since no value
was given for them, so the system replaces the value
with NaN.

Now that we have found the outliers, its time to


manipulate the data a little bit.
DATA MANIPULATION PART 1 - ADD
QUALIFYING PLAYERS
INFORMATION

● Identified the groups of players by examining the


bimodal histograms.

● Creating a column to indicate whether a row


represents a human or Tune Squad player.

● Giving each row a unique ‘name.’

● Creating the new column for the DataFrame.

● Creating it by making a list of values for the column


and then assigning the column a name.
Adding the list of strings to the DataFrame.

Changing the column order:

Output:
DATA MANIPULATION PART 2 -
IMPUTE MISSING VALUES FOR
COLUMNS

Checking the columns with missing values:


Revisit the histograms for GP and MPG.
Impute missing values by using average values

Data is clean now. We have only one column left to


clean using machine learning algorithm.
DATA MANIPULATION PART 3 -
IMPUTE MISSING VALUES BY USING
MACHINE LEARNING

• As player_df.isna().sum() confirmed that nine missing


values remain in PER.
• Can not use a simple average to impute values in PER
column.
• PER is computed from the values of the nine columns that
precede it in the DataFrame (GP through REBR).
• To get some sense of a model’s accuracy, you could use
machine learning to split your data into two subsets: test and
training.
• The training subset is the portion of the data you use to
train the model and other subset to test the model.
• 75 percent of the data is used to train the model, and 25
percent is used to test the model.
• To achieve this, we can use a technique called cross-
validation.
• The idea is to iterate through the dataset, splitting the data
in different ways between training data and test data.
• The following image provides a visualization of the cross-
validation process.
Cross-validate the R2 scores for the model

● I have used 10-fold cross validation.


● Python will iterate through the data 10 times,
reserving 10 percent of the data for
● testing and training on the other 90 percent of the
data each time.
● A histogram of the results is plotted.
Output:

This model is good because R2 (R Square) score is


99.95 percent.
Fit the regression model for the player data

● Model fitting using all the data.


● General Rule:
● use cross-validation for model selection or
evaluation, but you use all of the
● data for model building.

Creating a mask of rows that use missing values in


the DataFrame
Using the mask and the fitted mask to impute the
final missing values in the DataFrame
To check any other missing value. Printing the
entire dataset.
Exporting in another file as a CSV file.

The new data set will look like:


KNOWLEDGE CHECK

CONCLUSION
In conclusion, this project helped me learn how to use
data science and machine learning to understand
basketball stats. I used Python, Pandas, and Visual
Studio Code to clean and analyze the data and predict
how players might perform. The steps in the project
made it easy to see how these tools can be used to
work with real-world data and find helpful insights.
Overall, it was a good way to learn about coding and
data analysis through basketball.

You might also like