Lab 2 - Basic Statistical Analysis
Lab 2 - Basic Statistical Analysis
0.1 Imports
0.1.1 Step 1: Import Required Libraries
Import essential libraries for data manipulation, visualization, and statistics.
[1]: import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
[3]: data.head()
1
0.2 Exploratory Data Analysis (EDA)
0.2.1 Step 3: Basic Dataset Information
Display basic information about the dataset, such as column names and data types.
[4]: print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 roll_no 9999 non-null object
1 gender 9982 non-null object
2 race_ethnicity 9977 non-null object
3 parental_level_of_education 9978 non-null object
4 lunch 9976 non-null float64
5 test_preparation_course 9977 non-null float64
6 math_score 9976 non-null float64
7 reading_score 9975 non-null float64
8 writing_score 9976 non-null float64
9 science_score 9977 non-null float64
10 total_score 9981 non-null float64
11 grade 9997 non-null object
dtypes: float64(7), object(5)
memory usage: 937.6+ KB
None
Descriptive Statistics:
lunch test_preparation_course math_score reading_score \
count 9976.000000 9977.000000 9976.000000 9975.000000
mean 0.644246 0.388694 57.177125 70.125915
std 0.478765 0.487478 21.746777 19.026245
min 0.000000 0.000000 0.000000 17.000000
25% 0.000000 0.000000 41.000000 57.000000
50% 1.000000 0.000000 58.000000 71.000000
75% 1.000000 1.000000 73.000000 85.000000
max 1.000000 1.000000 100.000000 100.000000
2
std 18.245360 19.324331 42.304858
min 10.000000 9.000000 89.000000
25% 59.000000 53.000000 237.000000
50% 72.500000 67.000000 268.000000
75% 85.000000 81.000000 294.000000
max 100.000000 100.000000 383.000000
Better Approach: Fill missing numerical values with the mean and categorical values
with the mode.
[8]: numerical_cols = data.select_dtypes(include=[np.number]).columns
categorical_cols = data.select_dtypes(include=["object"]).columns
data[numerical_cols] = data[numerical_cols].fillna(data[numerical_cols].mean())
data[categorical_cols] = data[categorical_cols].fillna(data[categorical_cols].
↪mode().iloc[0])
3
Missing Values After Handling:
roll_no 0
gender 0
race_ethnicity 0
parental_level_of_education 0
lunch 0
test_preparation_course 0
math_score 0
reading_score 0
writing_score 0
science_score 0
total_score 0
grade 0
dtype: int64
4
[11]: subjects = ['math_score', 'reading_score', 'writing_score', 'science_score']
for subject in subjects:
plt.figure(figsize=(3, 2))
ax = sns.histplot(data[subject], kde=True, bins=20)
plt.title(f'Distribution of {subject.capitalize()}')
plt.xlabel(subject.capitalize())
plt.ylabel('Frequency')
plt.show()
5
0.4 Probability & Statistics Questions
0.4.1 Step 10: Calculate Z-scores
Example: Calculate the probability that a student scores above 300 in total scores.
z = z_score(300)
probability_above_300 = 1 - (0.5 * (1 + math.erf(z / np.sqrt(2))))
print(f"Probability of scoring above 300: {probability_above_300 * 100:.2f}%")
6
0.4.2 Step 11: Solve Statistical Problems
Example: What percentage of students score between 250 and 350?
[13]: z1 = z_score(250)
z2 = z_score(350)
probability_between = (0.5 * (1 + math.erf(z2 / np.sqrt(2)))) - (0.5 * (1 +␣
↪math.erf(z1 / np.sqrt(2))))