0% found this document useful (0 votes)
31 views14 pages

Final Project

This report presents a predictive analysis of student performance in Mathematics and Portuguese using two datasets from Portuguese high schools. It details the data preprocessing steps, model development using Linear Regression, Random Forest, and Gradient Boosting, and evaluates their performance based on RMSE, MAE, and R² scores. Key findings indicate that Random Forest and Gradient Boosting models outperform Linear Regression, with absences, study time, and past failures identified as significant predictors of student performance.

Uploaded by

Tommy smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

Final Project

This report presents a predictive analysis of student performance in Mathematics and Portuguese using two datasets from Portuguese high schools. It details the data preprocessing steps, model development using Linear Regression, Random Forest, and Gradient Boosting, and evaluates their performance based on RMSE, MAE, and R² scores. Key findings indicate that Random Forest and Gradient Boosting models outperform Linear Regression, with absences, study time, and past failures identified as significant predictors of student performance.

Uploaded by

Tommy smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Analytics Final Project Report

Members’ names & contributions


 Somnang Pichmonireach: Exploratory Data Analysis, Model
Development, Results and Evaluation, Conclusion
 Houy K: Introduction, Data Preprocessing

1. Introduction
This report focuses on building predictive models for student performance using two datasets:
Math and Portuguese. The goal is to predict the students' final grades

This report studies student achievement data collected from two Portuguese high schools: Mousinho da
Silveira (MS) and Gabriel Pereira (GP). The purpose of this analysis is to understand and explore the
factors that influence student’s academic performance in Mathematics and Portugueses language subjects
and to predict their final grades using machine learning and statistical models based on various
demographic, social, and academic features.

The dataset used for this report contains two clean files:

1. student_math_cleaned.csv: Contains data on student performance in Mathematics


2. student_portuguese_cleaned.csv: Contains data on student performance in Portuguese.

Both datasets include attributes covering the details about a student's demographics, family background,
school environment, and academic history, along with students’ grades for three assessment periods
(grade_1, grade_2, and final_grade). There are 382 students appearing in both datasets, although the IDs
do not match.

2. Data Preprocessing
Before training the models, the data was preprocessed as follows:

• Dropped Unnecessary Columns: Non-relevant features like student_id, school, grade_1,


and grade_2 were dropped.
• One-Hot Encoding: Categorical features were converted to numerical values using one-
hot encoding.
• Feature Scaling: StandardScaler was used to standardize the data to ensure that all
features were on the same scale.

# Dropped unnecessary columns and applied one-hot encoding to


categorical features.
X_math = pd.get_dummies(X_math, drop_first=True)
X_portuguese = pd.get_dummies(X_portuguese, drop_first=True)

# Standardize the data using StandardScaler to ensure all


features are on the same scale.
scaler = StandardScaler()
X_train_math = scaler.fit_transform(X_train_math)
X_test_math = scaler.transform(X_test_math)

3. Exploratory Data Analysis (EDA)


4. Model Development
4.1 Model Selection

Three regression models were selected for this analysis:

• Linear Regression: A basic regression model to establish a baseline.


• Random Forest Regressor: An ensemble model to capture complex interactions in the
data.
• Gradient Boosting Regressor: A model that builds an additive model in a forward
stage-wise manner to minimize prediction error.

Each model was trained and evaluated on both the Math and Portuguese datasets.

# Function to evaluate models on training and test sets,


returning RMSE, MAE, and R² scores.
def evaluate_model(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae = mean_absolute_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)
return rmse, mae, r2
5. Results and Evaluation
5.1 Model Performance Metrics

The models were evaluated using the following metrics:

• RMSE (Root Mean Squared Error): A lower value indicates better performance.
• MAE (Mean Absolute Error): Measures the average magnitude of the errors in
predictions.
• R² Score: The proportion of variance in the dependent variable that is predictable from
the independent variables.

Below is a bar chart comparing RMSE, MAE, and R² scores for all models on both datasets:

# Plotting model performance comparison


labels = list(results.keys())
rmse_vals = [res[0] for res in results.values()]
mae_vals = [res[1] for res in results.values()]
r2_vals = [res[2] for res in results.values()]

fig, ax = plt.subplots(figsize=(14, 7))


rects1 = ax.bar(x - width, rmse_vals, width, label='RMSE')
rects2 = ax.bar(x, mae_vals, width, label='MAE')
rects3 = ax.bar(x + width, r2_vals, width, label='R²')
plt.show()

5.2 Feature Importance

We analyzed the importance of individual features using the Random Forest and Gradient
Boosting models. Below are the feature importance plots for the Math and Portuguese datasets:

# Plot feature importance for Random Forest (Math)


plot_feature_importance(rf, pd.DataFrame(X_train_math,
columns=X_math.columns), "Random Forest Feature Importance
(Math)")

# Plot feature importance for Gradient Boosting (Math)


plot_feature_importance(gb, pd.DataFrame(X_train_math,
columns=X_math.columns), "Gradient Boosting Feature Importance
(Math)")
# Plot feature importance for Random Forest (Portuguese)
plot_feature_importance(rf, pd.DataFrame(X_train_port,
columns=X_portuguese.columns), "Random Forest Feature Importance
(Portuguese)")

The key features that influence students' performance in both subjects include:

• Absences: The number of times a student was absent.


• Studytime: Weekly study time.
• Failures: Number of past class failures.

5.3 Residual Analysis

Residual plots were generated to visualize the difference between actual and predicted values.
These plots help assess how well the models are fitting the data.

# Plot residuals for Math (Gradient Boosting)


plot_residuals(gb, X_test_math, y_test_math, "Residuals for
Gradient Boosting (Math)")

# Plot residuals for Portuguese (Gradient Boosting)


plot_residuals(gb, X_test_port, y_test_port, "Residuals for
Gradient Boosting (Portuguese)")

6. Conclusion
6.1 Key Findings

• Random Forest and Gradient Boosting outperformed Linear Regression in terms of


RMSE and R² across both datasets, making them better choices for predicting student
performance.
• Features such as absences, studytime, and failures are critical predictors for both Math
and Portuguese.
6.2 Limitations

• The moderate R² scores indicate that the models do not explain a significant portion of
the variance in the final grades. This suggests that additional features or alternative
modeling approaches could improve predictive power.
• Overfitting could occur with more complex models like Random Forest and Gradient
Boosting without careful tuning.

6.3 Future Improvements

• Incorporating More Data: Gathering additional data related to student behavior,


extracurricular activities, and home environment may help improve predictions.
• Hyperparameter Tuning: Fine-tuning hyperparameters for Random Forest and Gradient
Boosting could further enhance model performance.
• Feature Engineering: Creating interaction terms and exploring non-linear relationships
between features could lead to better model accuracy.

7. References
• Scikit-learn Documentation: https://scikit-learn.org/stable/
• Python Data Analysis Library (Pandas): https://pandas.pydata.org/
• Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC.
• Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics
Press.

8. Appendix (Optional)
• Detailed code snippets.

import kagglehub

# Download latest version


path = kagglehub.dataset_download("dillonmyrick/high-school-student-performance-
and-demographics")

print("Path to dataset files:", path)

import pandas as pd
# Importing the first dataset
math_df = pd.read_csv(path + "/student_math_clean.csv")

# Importing the second dataset


portuguese_df = pd.read_csv(path + "/student_portuguese_clean.csv")

# Displaying the both datasets


print("First dataset:")
print(math_df.head())

print("\nSecond dataset:")
print(portuguese_df.head())

import matplotlib.pyplot as plt


import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Drop columns not needed for prediction


drop_columns = ['student_id', 'school', 'grade_1', 'grade_2', 'final_grade']

# Prepare features and target for both datasets


X_math = math_df.drop(drop_columns, axis=1)
y_math = math_df['final_grade']

X_portuguese = portuguese_df.drop(drop_columns, axis=1)


y_portuguese = portuguese_df['final_grade']

# Dropped unnecessary columns and applied one-hot encoding to categorical


features.

X_math = pd.get_dummies(X_math, drop_first=True)

X_portuguese = pd.get_dummies(X_portuguese, drop_first=True)

# Split the data into training and testing sets


X_train_math, X_test_math, y_train_math, y_test_math = train_test_split(X_math,
y_math, test_size=0.2, random_state=42)
X_train_port, X_test_port, y_train_port, y_test_port =
train_test_split(X_portuguese, y_portuguese, test_size=0.2, random_state=42)

# Standardize the data using StandardScaler to ensure all features are on the
same scale.

scaler = StandardScaler()

X_train_math = scaler.fit_transform(X_train_math)

X_test_math = scaler.transform(X_test_math)

X_train_port = scaler.fit_transform(X_train_port)
X_test_port = scaler.transform(X_test_port)

# Initialize models
lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=5,
random_state=42)
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5,
random_state=42)

# Train and evaluate models


def evaluate_model(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Calculate performance metrics


rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae = mean_absolute_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)

return rmse, mae, r2

# Store results for comparison


results = {}

# Linear Regression
results['Linear Regression (Math)'] = evaluate_model(lr, X_train_math,
X_test_math, y_train_math, y_test_math)
results['Linear Regression (Portuguese)'] = evaluate_model(lr, X_train_port,
X_test_port, y_train_port, y_test_port)

# Random Forest
results['Random Forest (Math)'] = evaluate_model(rf, X_train_math, X_test_math,
y_train_math, y_test_math)
results['Random Forest (Portuguese)'] = evaluate_model(rf, X_train_port,
X_test_port, y_train_port, y_test_port)

# Gradient Boosting
results['Gradient Boosting (Math)'] = evaluate_model(gb, X_train_math,
X_test_math, y_train_math, y_test_math)
results['Gradient Boosting (Portuguese)'] = evaluate_model(gb, X_train_port,
X_test_port, y_train_port, y_test_port)

# Plotting model performance comparison


labels = list(results.keys())
rmse_vals = [res[0] for res in results.values()]
mae_vals = [res[1] for res in results.values()]
r2_vals = [res[2] for res in results.values()]

x = np.arange(len(labels)) # Label locations


width = 0.25 # Width of the bars

fig, ax = plt.subplots(figsize=(14, 7))


rects1 = ax.bar(x - width, rmse_vals, width, label='RMSE')
rects2 = ax.bar(x, mae_vals, width, label='MAE')
rects3 = ax.bar(x + width, r2_vals, width, label='R²')

ax.set_xlabel('Models')
ax.set_title('Model Performance Metrics')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45, ha="right")
ax.legend()

plt.tight_layout()
plt.show()

results
# Residual plot: difference between actual and predicted values for Gradient
Boosting (Math and Portuguese)

def plot_residuals(model, X_test, y_test, title):


y_pred = model.predict(X_test)
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.6, color='blue')
plt.hlines(y=0, xmin=min(y_pred), xmax=max(y_pred), colors='red',
linestyles='dashed')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title(title)
plt.show()

# Plot residuals for Math (Gradient Boosting)


plot_residuals(gb, X_test_math, y_test_math, "Residuals for Gradient Boosting
(Math)")

# Plot residuals for Portuguese (Gradient Boosting)


plot_residuals(gb, X_test_port, y_test_port, "Residuals for Gradient Boosting
(Portuguese)")

# Generate descriptive statistics


math_summary = math_df[['absences', 'study_time', 'class_failures',
'final_grade']].describe()
portuguese_summary = portuguese_df[['absences', 'study_time', 'class_failures',
'final_grade']].describe()

print(f"math dataset: \n{math_summary}")


print(f"portuguese dataset: \n{portuguese_summary}")

import matplotlib.pyplot as plt


import seaborn as sns

# Histogram for Final Grades


sns.histplot(math_df['final_grade'], bins=10, kde=True)
plt.title('Distribution of Math Final Grades')
plt.show()

sns.histplot(portuguese_df['final_grade'], bins=10, kde=True)


plt.title('Distribution of Portuguese Final Grades')
plt.show()

# Boxplot for absences and studytime


sns.boxplot(y='absences', data=math_df)
plt.title('Absences Distribution in Math')
plt.show()

# Boxplot for absences and studytime


sns.boxplot(y='absences', data=portuguese_df)
plt.title('Absences Distribution in Math')
plt.show()

import pandas as pd
import plotly.graph_objects as go

# Map study time ranges to numerical values


study_time_map = {
'<2 hours': 1.5,
'2 to 5 hours': 3.5,
'5 to 10 hours': 7.5,
'>10 hours': 12
}

# Apply the mapping to the 'studytime' column in both datasets


math_df['study_time'] = math_df['study_time'].map(study_time_map)
portuguese_df['study_time'] = portuguese_df['study_time'].map(study_time_map)

# Generate descriptive statistics


math_summary = math_df[['absences', 'study_time', 'class_failures',
'final_grade']].describe()
portuguese_summary = portuguese_df[['absences', 'study_time', 'class_failures',
'final_grade']].describe()

fig = go.Figure(data=[go.Table(
header=dict(values=['Statistic', 'absences', 'study_time', 'class_failures',
'final_grade'],
fill_color='lightblue', align='center', font=dict(color='white',
size=12)),
cells=dict(values=[math_summary.index,
math_summary['absences'],
math_summary['study_time'],
math_summary['class_failures'],
math_summary['final_grade']],
fill_color='lightgrey', align='center'))
])

fig1 = go.Figure(data=[go.Table(
header=dict(values=['Statistic', 'absences', 'study_time', 'class_failures',
'final_grade'],
fill_color='lightblue', align='center', font=dict(color='white',
size=12)),
cells=dict(values=[portuguese_summary.index,
portuguese_summary['absences'],
portuguese_summary['study_time'],
portuguese_summary['class_failures'],
portuguese_summary['final_grade']],
fill_color='lightgrey', align='center'))
])

# Display the table


fig.update_layout(title='Math Summary Statistics')
fig.show()

fig1.update_layout(title='Portuguese Summary Statistics')


fig1.show()

import matplotlib.pyplot as plt


import seaborn as sns

# Histogram for Final Grades


sns.histplot(math_df['final_grade'], bins=10, kde=True)
plt.title('Distribution of Math Final Grades')
plt.show()

sns.histplot(portuguese_df['final_grade'], bins=10, kde=True)


plt.title('Distribution of Portuguese Final Grades')
plt.show()

# Boxplot for absences and studytime


sns.boxplot(y='absences', data=math_df)
plt.title('Absences Distribution in Math')
plt.show()

# Boxplot for absences and studytime


sns.boxplot(y='absences', data=portuguese_df)
plt.title('Absences Distribution in Portuguese')
plt.show()

• Additional visualizations or exploratory analysis.

You might also like