0% found this document useful (0 votes)
29 views

Lab 2 - Basic Statistical Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lab 2 - Basic Statistical Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lab 2 - Basic Statistical Analysis

December 12, 2024

0.1 Imports
0.1.1 Step 1: Import Required Libraries
Import essential libraries for data manipulation, visualization, and statistics.
[1]: import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

0.1.2 Step 2: Load the Dataset


Load the student performance dataset from the specified CSV file.
[2]: file_path = "dataset/Student_performance_10k.csv"
data = pd.read_csv(file_path)

[3]: data.head()

[3]: roll_no gender race_ethnicity parental_level_of_education lunch \


0 std-01 male group D some college 1.0
1 std-02 male group B high school 1.0
2 std-03 male group C master's degree 1.0
3 std-04 male group D some college 1.0
4 std-05 male group C some college 0.0

test_preparation_course math_score reading_score writing_score \


0 1.0 89.0 38.0 85.0
1 0.0 65.0 100.0 67.0
2 0.0 10.0 99.0 97.0
3 1.0 22.0 51.0 41.0
4 1.0 26.0 58.0 64.0

science_score total_score grade


0 26.0 238.0 C
1 96.0 328.0 A
2 58.0 264.0 B
3 84.0 198.0 D
4 65.0 213.0 C

1
0.2 Exploratory Data Analysis (EDA)
0.2.1 Step 3: Basic Dataset Information
Display basic information about the dataset, such as column names and data types.
[4]: print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 roll_no 9999 non-null object
1 gender 9982 non-null object
2 race_ethnicity 9977 non-null object
3 parental_level_of_education 9978 non-null object
4 lunch 9976 non-null float64
5 test_preparation_course 9977 non-null float64
6 math_score 9976 non-null float64
7 reading_score 9975 non-null float64
8 writing_score 9976 non-null float64
9 science_score 9977 non-null float64
10 total_score 9981 non-null float64
11 grade 9997 non-null object
dtypes: float64(7), object(5)
memory usage: 937.6+ KB
None

0.2.2 Step 4: Statistical Summary


Use describe() to compute summary statistics for numerical columns.
[5]: print("\nDescriptive Statistics:")
print(data.describe())

Descriptive Statistics:
lunch test_preparation_course math_score reading_score \
count 9976.000000 9977.000000 9976.000000 9975.000000
mean 0.644246 0.388694 57.177125 70.125915
std 0.478765 0.487478 21.746777 19.026245
min 0.000000 0.000000 0.000000 17.000000
25% 0.000000 0.000000 41.000000 57.000000
50% 1.000000 0.000000 58.000000 71.000000
75% 1.000000 1.000000 73.000000 85.000000
max 1.000000 1.000000 100.000000 100.000000

writing_score science_score total_score


count 9976.000000 9977.000000 9981.000000
mean 71.415798 66.063045 264.740908

2
std 18.245360 19.324331 42.304858
min 10.000000 9.000000 89.000000
25% 59.000000 53.000000 237.000000
50% 72.500000 67.000000 268.000000
75% 85.000000 81.000000 294.000000
max 100.000000 100.000000 383.000000

0.2.3 Step 5: Check for Missing Values


Identify the total number of missing values in each column.
[6]: missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)

Missing Values in Each Column:


roll_no 1
gender 18
race_ethnicity 23
parental_level_of_education 22
lunch 24
test_preparation_course 23
math_score 24
reading_score 25
writing_score 24
science_score 23
total_score 19
grade 3
dtype: int64

0.2.4 Step 6: Handle Missing Values


Simple approach: Drop rows with missing values (not preferred)
[7]: # Uncomment the following block to drop rows with missing values.
# data = data.dropna()

Better Approach: Fill missing numerical values with the mean and categorical values
with the mode.
[8]: numerical_cols = data.select_dtypes(include=[np.number]).columns
categorical_cols = data.select_dtypes(include=["object"]).columns

data[numerical_cols] = data[numerical_cols].fillna(data[numerical_cols].mean())
data[categorical_cols] = data[categorical_cols].fillna(data[categorical_cols].
↪mode().iloc[0])

Verify if missing values are handled


[9]: print("\nMissing Values After Handling:")
print(data.isnull().sum())

3
Missing Values After Handling:
roll_no 0
gender 0
race_ethnicity 0
parental_level_of_education 0
lunch 0
test_preparation_course 0
math_score 0
reading_score 0
writing_score 0
science_score 0
total_score 0
grade 0
dtype: int64

0.3 Visualization: Distributions


0.3.1 Step 7: Distribution of Grades
Plot the distribution of grades to understand grade trends.
[10]: plt.figure(figsize=(4, 2))
sns.countplot(x='grade', data=data, palette='magma', hue="grade")
plt.title('Distribution of Grades')
plt.xlabel('Grade')
plt.ylabel('Count')
plt.show()

0.3.2 Step 8: Individual Subject Score Distributions


Visualize the distribution of scores for each subject.

4
[11]: subjects = ['math_score', 'reading_score', 'writing_score', 'science_score']
for subject in subjects:
plt.figure(figsize=(3, 2))
ax = sns.histplot(data[subject], kde=True, bins=20)
plt.title(f'Distribution of {subject.capitalize()}')
plt.xlabel(subject.capitalize())
plt.ylabel('Frequency')
plt.show()

5
0.4 Probability & Statistics Questions
0.4.1 Step 10: Calculate Z-scores
Example: Calculate the probability that a student scores above 300 in total scores.

Note: The function (0.5 * (1 + math.erf(z / np.sqrt(2)))) calculates the cumulative


probability or the area under the curve from − ∞ to �, which is z-score area. You can
refer Z-score table as well.
[12]: def z_score(value):
return (value - data['total_score'].mean()) / data['total_score'].std()

z = z_score(300)
probability_above_300 = 1 - (0.5 * (1 + math.erf(z / np.sqrt(2))))
print(f"Probability of scoring above 300: {probability_above_300 * 100:.2f}%")

Probability of scoring above 300: 20.21%

6
0.4.2 Step 11: Solve Statistical Problems
Example: What percentage of students score between 250 and 350?
[13]: z1 = z_score(250)
z2 = z_score(350)
probability_between = (0.5 * (1 + math.erf(z2 / np.sqrt(2)))) - (0.5 * (1 +␣
↪math.erf(z1 / np.sqrt(2))))

print(f"Percentage of students scoring between 250 and 350:␣


↪{probability_between * 100:.2f}%")

Percentage of students scoring between 250 and 350: 61.45%

You might also like