Final Project
Final Project
1. Introduction
This report focuses on building predictive models for student performance using two datasets:
Math and Portuguese. The goal is to predict the students' final grades
This report studies student achievement data collected from two Portuguese high schools: Mousinho da
Silveira (MS) and Gabriel Pereira (GP). The purpose of this analysis is to understand and explore the
factors that influence student’s academic performance in Mathematics and Portugueses language subjects
and to predict their final grades using machine learning and statistical models based on various
demographic, social, and academic features.
The dataset used for this report contains two clean files:
Both datasets include attributes covering the details about a student's demographics, family background,
school environment, and academic history, along with students’ grades for three assessment periods
(grade_1, grade_2, and final_grade). There are 382 students appearing in both datasets, although the IDs
do not match.
2. Data Preprocessing
Before training the models, the data was preprocessed as follows:
Each model was trained and evaluated on both the Math and Portuguese datasets.
• RMSE (Root Mean Squared Error): A lower value indicates better performance.
• MAE (Mean Absolute Error): Measures the average magnitude of the errors in
predictions.
• R² Score: The proportion of variance in the dependent variable that is predictable from
the independent variables.
Below is a bar chart comparing RMSE, MAE, and R² scores for all models on both datasets:
We analyzed the importance of individual features using the Random Forest and Gradient
Boosting models. Below are the feature importance plots for the Math and Portuguese datasets:
The key features that influence students' performance in both subjects include:
Residual plots were generated to visualize the difference between actual and predicted values.
These plots help assess how well the models are fitting the data.
6. Conclusion
6.1 Key Findings
• The moderate R² scores indicate that the models do not explain a significant portion of
the variance in the final grades. This suggests that additional features or alternative
modeling approaches could improve predictive power.
• Overfitting could occur with more complex models like Random Forest and Gradient
Boosting without careful tuning.
7. References
• Scikit-learn Documentation: https://scikit-learn.org/stable/
• Python Data Analysis Library (Pandas): https://pandas.pydata.org/
• Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC.
• Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics
Press.
8. Appendix (Optional)
• Detailed code snippets.
import kagglehub
import pandas as pd
# Importing the first dataset
math_df = pd.read_csv(path + "/student_math_clean.csv")
print("\nSecond dataset:")
print(portuguese_df.head())
# Standardize the data using StandardScaler to ensure all features are on the
same scale.
scaler = StandardScaler()
X_train_math = scaler.fit_transform(X_train_math)
X_test_math = scaler.transform(X_test_math)
X_train_port = scaler.fit_transform(X_train_port)
X_test_port = scaler.transform(X_test_port)
# Initialize models
lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=5,
random_state=42)
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5,
random_state=42)
# Linear Regression
results['Linear Regression (Math)'] = evaluate_model(lr, X_train_math,
X_test_math, y_train_math, y_test_math)
results['Linear Regression (Portuguese)'] = evaluate_model(lr, X_train_port,
X_test_port, y_train_port, y_test_port)
# Random Forest
results['Random Forest (Math)'] = evaluate_model(rf, X_train_math, X_test_math,
y_train_math, y_test_math)
results['Random Forest (Portuguese)'] = evaluate_model(rf, X_train_port,
X_test_port, y_train_port, y_test_port)
# Gradient Boosting
results['Gradient Boosting (Math)'] = evaluate_model(gb, X_train_math,
X_test_math, y_train_math, y_test_math)
results['Gradient Boosting (Portuguese)'] = evaluate_model(gb, X_train_port,
X_test_port, y_train_port, y_test_port)
ax.set_xlabel('Models')
ax.set_title('Model Performance Metrics')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45, ha="right")
ax.legend()
plt.tight_layout()
plt.show()
results
# Residual plot: difference between actual and predicted values for Gradient
Boosting (Math and Portuguese)
import pandas as pd
import plotly.graph_objects as go
fig = go.Figure(data=[go.Table(
header=dict(values=['Statistic', 'absences', 'study_time', 'class_failures',
'final_grade'],
fill_color='lightblue', align='center', font=dict(color='white',
size=12)),
cells=dict(values=[math_summary.index,
math_summary['absences'],
math_summary['study_time'],
math_summary['class_failures'],
math_summary['final_grade']],
fill_color='lightgrey', align='center'))
])
fig1 = go.Figure(data=[go.Table(
header=dict(values=['Statistic', 'absences', 'study_time', 'class_failures',
'final_grade'],
fill_color='lightblue', align='center', font=dict(color='white',
size=12)),
cells=dict(values=[portuguese_summary.index,
portuguese_summary['absences'],
portuguese_summary['study_time'],
portuguese_summary['class_failures'],
portuguese_summary['final_grade']],
fill_color='lightgrey', align='center'))
])