0% found this document useful (0 votes)

32 views14 pages

Final Project

This report presents a predictive analysis of student performance in Mathematics and Portuguese using two datasets from Portuguese high schools. It details the data preprocessing steps, model development using Linear Regression, Random Forest, and Gradient Boosting, and evaluates their performance based on RMSE, MAE, and R² scores. Key findings indicate that Random Forest and Gradient Boosting models outperform Linear Regression, with absences, study time, and past failures identified as significant predictors of student performance.

Uploaded by

Tommy smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views14 pages

Final Project

Uploaded by

Tommy smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Analytics Final Project Report

Members’ names & contributions

 Somnang Pichmonireach: Exploratory Data Analysis, Model
Development, Results and Evaluation, Conclusion
 Houy K: Introduction, Data Preprocessing

1. Introduction
This report focuses on building predictive models for student performance using two datasets:
Math and Portuguese. The goal is to predict the students' final grades

This report studies student achievement data collected from two Portuguese high schools: Mousinho da
Silveira (MS) and Gabriel Pereira (GP). The purpose of this analysis is to understand and explore the
factors that influence student’s academic performance in Mathematics and Portugueses language subjects
and to predict their final grades using machine learning and statistical models based on various
demographic, social, and academic features.

The dataset used for this report contains two clean files:

1. student_math_cleaned.csv: Contains data on student performance in Mathematics

2. student_portuguese_cleaned.csv: Contains data on student performance in Portuguese.

Both datasets include attributes covering the details about a student's demographics, family background,
school environment, and academic history, along with students’ grades for three assessment periods
(grade_1, grade_2, and final_grade). There are 382 students appearing in both datasets, although the IDs
do not match.

2. Data Preprocessing
Before training the models, the data was preprocessed as follows:

• Dropped Unnecessary Columns: Non-relevant features like student_id, school, grade_1,

and grade_2 were dropped.
• One-Hot Encoding: Categorical features were converted to numerical values using one-
hot encoding.
• Feature Scaling: StandardScaler was used to standardize the data to ensure that all
features were on the same scale.

# Dropped unnecessary columns and applied one-hot encoding to

categorical features.
X_math = pd.get_dummies(X_math, drop_first=True)
X_portuguese = pd.get_dummies(X_portuguese, drop_first=True)

# Standardize the data using StandardScaler to ensure all

features are on the same scale.
scaler = StandardScaler()
X_train_math = scaler.fit_transform(X_train_math)
X_test_math = scaler.transform(X_test_math)

3. Exploratory Data Analysis (EDA)

4. Model Development
4.1 Model Selection

Three regression models were selected for this analysis:

• Linear Regression: A basic regression model to establish a baseline.

• Random Forest Regressor: An ensemble model to capture complex interactions in the
data.
• Gradient Boosting Regressor: A model that builds an additive model in a forward
stage-wise manner to minimize prediction error.

Each model was trained and evaluated on both the Math and Portuguese datasets.

# Function to evaluate models on training and test sets,

returning RMSE, MAE, and R² scores.
def evaluate_model(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae = mean_absolute_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)
return rmse, mae, r2
5. Results and Evaluation
5.1 Model Performance Metrics

The models were evaluated using the following metrics:

• RMSE (Root Mean Squared Error): A lower value indicates better performance.
• MAE (Mean Absolute Error): Measures the average magnitude of the errors in
predictions.
• R² Score: The proportion of variance in the dependent variable that is predictable from
the independent variables.

Below is a bar chart comparing RMSE, MAE, and R² scores for all models on both datasets:

# Plotting model performance comparison

labels = list(results.keys())
rmse_vals = [res[0] for res in results.values()]
mae_vals = [res[1] for res in results.values()]
r2_vals = [res[2] for res in results.values()]

fig, ax = plt.subplots(figsize=(14, 7))

rects1 = ax.bar(x - width, rmse_vals, width, label='RMSE')
rects2 = ax.bar(x, mae_vals, width, label='MAE')
rects3 = ax.bar(x + width, r2_vals, width, label='R²')
plt.show()

5.2 Feature Importance

We analyzed the importance of individual features using the Random Forest and Gradient
Boosting models. Below are the feature importance plots for the Math and Portuguese datasets:

# Plot feature importance for Random Forest (Math)

plot_feature_importance(rf, pd.DataFrame(X_train_math,
columns=X_math.columns), "Random Forest Feature Importance
(Math)")

# Plot feature importance for Gradient Boosting (Math)

plot_feature_importance(gb, pd.DataFrame(X_train_math,
columns=X_math.columns), "Gradient Boosting Feature Importance
(Math)")
# Plot feature importance for Random Forest (Portuguese)
plot_feature_importance(rf, pd.DataFrame(X_train_port,
columns=X_portuguese.columns), "Random Forest Feature Importance
(Portuguese)")

The key features that influence students' performance in both subjects include:

• Absences: The number of times a student was absent.

• Studytime: Weekly study time.
• Failures: Number of past class failures.

5.3 Residual Analysis

Residual plots were generated to visualize the difference between actual and predicted values.
These plots help assess how well the models are fitting the data.

# Plot residuals for Math (Gradient Boosting)

plot_residuals(gb, X_test_math, y_test_math, "Residuals for
Gradient Boosting (Math)")

# Plot residuals for Portuguese (Gradient Boosting)

plot_residuals(gb, X_test_port, y_test_port, "Residuals for
Gradient Boosting (Portuguese)")

6. Conclusion
6.1 Key Findings

• Random Forest and Gradient Boosting outperformed Linear Regression in terms of

RMSE and R² across both datasets, making them better choices for predicting student
performance.
• Features such as absences, studytime, and failures are critical predictors for both Math
and Portuguese.
6.2 Limitations

• The moderate R² scores indicate that the models do not explain a significant portion of
the variance in the final grades. This suggests that additional features or alternative
modeling approaches could improve predictive power.
• Overfitting could occur with more complex models like Random Forest and Gradient
Boosting without careful tuning.

6.3 Future Improvements

• Incorporating More Data: Gathering additional data related to student behavior,

extracurricular activities, and home environment may help improve predictions.
• Hyperparameter Tuning: Fine-tuning hyperparameters for Random Forest and Gradient
Boosting could further enhance model performance.
• Feature Engineering: Creating interaction terms and exploring non-linear relationships
between features could lead to better model accuracy.

7. References
• Scikit-learn Documentation: https://scikit-learn.org/stable/
• Python Data Analysis Library (Pandas): https://pandas.pydata.org/
• Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC.
• Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics
Press.

8. Appendix (Optional)
• Detailed code snippets.

import kagglehub

# Download latest version

path = kagglehub.dataset_download("dillonmyrick/high-school-student-performance-
and-demographics")

print("Path to dataset files:", path)

import pandas as pd
# Importing the first dataset
math_df = pd.read_csv(path + "/student_math_clean.csv")

# Importing the second dataset

portuguese_df = pd.read_csv(path + "/student_portuguese_clean.csv")

# Displaying the both datasets

print("First dataset:")
print(math_df.head())

print("\nSecond dataset:")
print(portuguese_df.head())

import matplotlib.pyplot as plt

import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Drop columns not needed for prediction

drop_columns = ['student_id', 'school', 'grade_1', 'grade_2', 'final_grade']

# Prepare features and target for both datasets

X_math = math_df.drop(drop_columns, axis=1)
y_math = math_df['final_grade']

X_portuguese = portuguese_df.drop(drop_columns, axis=1)

y_portuguese = portuguese_df['final_grade']

# Dropped unnecessary columns and applied one-hot encoding to categorical

features.

X_math = pd.get_dummies(X_math, drop_first=True)

X_portuguese = pd.get_dummies(X_portuguese, drop_first=True)

# Split the data into training and testing sets

X_train_math, X_test_math, y_train_math, y_test_math = train_test_split(X_math,
y_math, test_size=0.2, random_state=42)
X_train_port, X_test_port, y_train_port, y_test_port =
train_test_split(X_portuguese, y_portuguese, test_size=0.2, random_state=42)

# Standardize the data using StandardScaler to ensure all features are on the
same scale.

scaler = StandardScaler()

X_train_math = scaler.fit_transform(X_train_math)

X_test_math = scaler.transform(X_test_math)

X_train_port = scaler.fit_transform(X_train_port)
X_test_port = scaler.transform(X_test_port)

# Initialize models
lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=5,
random_state=42)
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5,
random_state=42)

# Train and evaluate models

def evaluate_model(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Calculate performance metrics

rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae = mean_absolute_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)

return rmse, mae, r2

# Store results for comparison

results = {}

# Linear Regression
results['Linear Regression (Math)'] = evaluate_model(lr, X_train_math,
X_test_math, y_train_math, y_test_math)
results['Linear Regression (Portuguese)'] = evaluate_model(lr, X_train_port,
X_test_port, y_train_port, y_test_port)

# Random Forest
results['Random Forest (Math)'] = evaluate_model(rf, X_train_math, X_test_math,
y_train_math, y_test_math)
results['Random Forest (Portuguese)'] = evaluate_model(rf, X_train_port,
X_test_port, y_train_port, y_test_port)

# Gradient Boosting
results['Gradient Boosting (Math)'] = evaluate_model(gb, X_train_math,
X_test_math, y_train_math, y_test_math)
results['Gradient Boosting (Portuguese)'] = evaluate_model(gb, X_train_port,
X_test_port, y_train_port, y_test_port)

# Plotting model performance comparison

labels = list(results.keys())
rmse_vals = [res[0] for res in results.values()]
mae_vals = [res[1] for res in results.values()]
r2_vals = [res[2] for res in results.values()]

x = np.arange(len(labels)) # Label locations

width = 0.25 # Width of the bars

fig, ax = plt.subplots(figsize=(14, 7))

rects1 = ax.bar(x - width, rmse_vals, width, label='RMSE')
rects2 = ax.bar(x, mae_vals, width, label='MAE')
rects3 = ax.bar(x + width, r2_vals, width, label='R²')

ax.set_xlabel('Models')
ax.set_title('Model Performance Metrics')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45, ha="right")
ax.legend()

plt.tight_layout()
plt.show()

results
# Residual plot: difference between actual and predicted values for Gradient
Boosting (Math and Portuguese)

def plot_residuals(model, X_test, y_test, title):

y_pred = model.predict(X_test)
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.6, color='blue')
plt.hlines(y=0, xmin=min(y_pred), xmax=max(y_pred), colors='red',
linestyles='dashed')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title(title)
plt.show()

# Plot residuals for Math (Gradient Boosting)

plot_residuals(gb, X_test_math, y_test_math, "Residuals for Gradient Boosting
(Math)")

# Plot residuals for Portuguese (Gradient Boosting)

plot_residuals(gb, X_test_port, y_test_port, "Residuals for Gradient Boosting
(Portuguese)")

# Generate descriptive statistics

math_summary = math_df[['absences', 'study_time', 'class_failures',
'final_grade']].describe()
portuguese_summary = portuguese_df[['absences', 'study_time', 'class_failures',
'final_grade']].describe()

print(f"math dataset: \n{math_summary}")

print(f"portuguese dataset: \n{portuguese_summary}")

import matplotlib.pyplot as plt

import seaborn as sns

# Histogram for Final Grades

sns.histplot(math_df['final_grade'], bins=10, kde=True)
plt.title('Distribution of Math Final Grades')
plt.show()

sns.histplot(portuguese_df['final_grade'], bins=10, kde=True)

plt.title('Distribution of Portuguese Final Grades')
plt.show()

# Boxplot for absences and studytime

sns.boxplot(y='absences', data=math_df)
plt.title('Absences Distribution in Math')
plt.show()

# Boxplot for absences and studytime

sns.boxplot(y='absences', data=portuguese_df)
plt.title('Absences Distribution in Math')
plt.show()

import pandas as pd
import plotly.graph_objects as go

# Map study time ranges to numerical values

study_time_map = {
'<2 hours': 1.5,
'2 to 5 hours': 3.5,
'5 to 10 hours': 7.5,
'>10 hours': 12
}

# Apply the mapping to the 'studytime' column in both datasets

math_df['study_time'] = math_df['study_time'].map(study_time_map)
portuguese_df['study_time'] = portuguese_df['study_time'].map(study_time_map)

# Generate descriptive statistics

fig = go.Figure(data=[go.Table(
header=dict(values=['Statistic', 'absences', 'study_time', 'class_failures',
'final_grade'],
fill_color='lightblue', align='center', font=dict(color='white',
size=12)),
cells=dict(values=[math_summary.index,
math_summary['absences'],
math_summary['study_time'],
math_summary['class_failures'],
math_summary['final_grade']],
fill_color='lightgrey', align='center'))
])

fig1 = go.Figure(data=[go.Table(
header=dict(values=['Statistic', 'absences', 'study_time', 'class_failures',
'final_grade'],
fill_color='lightblue', align='center', font=dict(color='white',
size=12)),
cells=dict(values=[portuguese_summary.index,
portuguese_summary['absences'],
portuguese_summary['study_time'],
portuguese_summary['class_failures'],
portuguese_summary['final_grade']],
fill_color='lightgrey', align='center'))
])

# Display the table

fig.update_layout(title='Math Summary Statistics')
fig.show()

fig1.update_layout(title='Portuguese Summary Statistics')

fig1.show()

import matplotlib.pyplot as plt

import seaborn as sns

# Histogram for Final Grades

sns.histplot(math_df['final_grade'], bins=10, kde=True)
plt.title('Distribution of Math Final Grades')
plt.show()

sns.histplot(portuguese_df['final_grade'], bins=10, kde=True)

plt.title('Distribution of Portuguese Final Grades')
plt.show()

# Boxplot for absences and studytime

sns.boxplot(y='absences', data=math_df)
plt.title('Absences Distribution in Math')
plt.show()

# Boxplot for absences and studytime

sns.boxplot(y='absences', data=portuguese_df)
plt.title('Absences Distribution in Portuguese')
plt.show()

• Additional visualizations or exploratory analysis.

Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
Regression Analysis - Cheatsheet
No ratings yet
Regression Analysis - Cheatsheet
9 pages
MOOC Econometrics Test Exercise 1
No ratings yet
MOOC Econometrics Test Exercise 1
3 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
22BCE7750 ML Assignment
No ratings yet
22BCE7750 ML Assignment
23 pages
Phase 3.PDF Ramana
No ratings yet
Phase 3.PDF Ramana
17 pages
SFA Paper 12
No ratings yet
SFA Paper 12
2 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
Machine Learning Project
No ratings yet
Machine Learning Project
29 pages
Em Semester Project
No ratings yet
Em Semester Project
21 pages
Assignment
No ratings yet
Assignment
5 pages
Student Performance Prediction Report
No ratings yet
Student Performance Prediction Report
9 pages
Unit 5
No ratings yet
Unit 5
18 pages
Student Performance Analysis Using Machine Learning: Yamnampet, Hyderabad.
No ratings yet
Student Performance Analysis Using Machine Learning: Yamnampet, Hyderabad.
8 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
SFA Paper 3
No ratings yet
SFA Paper 3
2 pages
Machine Learning Based Student AcademicPerformance Prediction
No ratings yet
Machine Learning Based Student AcademicPerformance Prediction
6 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Pooja Kabadi - Predictive Modelling Project
No ratings yet
Pooja Kabadi - Predictive Modelling Project
70 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
4 pages
Project-1 (Data Preprocessing)
No ratings yet
Project-1 (Data Preprocessing)
5 pages
Project 2
No ratings yet
Project 2
5 pages
Student Performance Analysis and Prediction
No ratings yet
Student Performance Analysis and Prediction
19 pages
'Yatham Padma' 8 May 2022
No ratings yet
'Yatham Padma' 8 May 2022
82 pages
Modelling and Simmulation Assignment - Ipynb - Colab
No ratings yet
Modelling and Simmulation Assignment - Ipynb - Colab
7 pages
Assignment 9
No ratings yet
Assignment 9
8 pages
DA Manual - Part B
No ratings yet
DA Manual - Part B
13 pages
Class X A Project File
No ratings yet
Class X A Project File
10 pages
Presentation 3
No ratings yet
Presentation 3
23 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
SFA Paper 9
No ratings yet
SFA Paper 9
2 pages
Predictive Modelling Sweta Kumari
No ratings yet
Predictive Modelling Sweta Kumari
35 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
EE331 Introduction To Machine Learning Spring 2019 Project Proposal Predicting Alcohol Consumption Based On Student Information
No ratings yet
EE331 Introduction To Machine Learning Spring 2019 Project Proposal Predicting Alcohol Consumption Based On Student Information
5 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Source Code
No ratings yet
Source Code
20 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Personalized Learning
No ratings yet
Personalized Learning
13 pages
Predicting Students Performance by Learning Analytics
No ratings yet
Predicting Students Performance by Learning Analytics
51 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Assignment1 LATEX
No ratings yet
Assignment1 LATEX
11 pages
Report
No ratings yet
Report
14 pages
ADS Expt5 BE9 29
No ratings yet
ADS Expt5 BE9 29
3 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
ML Recordjp
No ratings yet
ML Recordjp
35 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Predictive Modeling (MP) Project Report
100% (1)
Predictive Modeling (MP) Project Report
73 pages
11 (1) Merged
No ratings yet
11 (1) Merged
12 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Documentation
No ratings yet
Documentation
7 pages
Report
No ratings yet
Report
6 pages
Machine Learning Lab Record Report
No ratings yet
Machine Learning Lab Record Report
38 pages
MBSD
No ratings yet
MBSD
5 pages
Assignment
No ratings yet
Assignment
9 pages
Assignment
No ratings yet
Assignment
8 pages
Pratique Work 3:data Preprocessing
No ratings yet
Pratique Work 3:data Preprocessing
7 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Asiign2 Aaryan Ai
No ratings yet
Asiign2 Aaryan Ai
11 pages
Module 5
No ratings yet
Module 5
3 pages
Binder 1
No ratings yet
Binder 1
13 pages
Module II Footprinting and Reconnaissance
No ratings yet
Module II Footprinting and Reconnaissance
176 pages
Homework1 Solutions
No ratings yet
Homework1 Solutions
5 pages
Artificial Intelligence (AI) Class - Research Proposal Format
No ratings yet
Artificial Intelligence (AI) Class - Research Proposal Format
8 pages
Excel Rak Rancob
No ratings yet
Excel Rak Rancob
3 pages
Correction of Measurement Error - Part 1
No ratings yet
Correction of Measurement Error - Part 1
22 pages
Unit 09
100% (1)
Unit 09
25 pages
BCS301 Questions Paper
No ratings yet
BCS301 Questions Paper
17 pages
Home Work On Hypothesis Testing
No ratings yet
Home Work On Hypothesis Testing
3 pages
Wa0048.
No ratings yet
Wa0048.
3 pages
Hubungan Sosialisasi Politik Dengan Partisipasi Politik Dalam Pemilihan Kepala Daerah Di Kabupaten Dairi Kecamatan Gunung Sitember
No ratings yet
Hubungan Sosialisasi Politik Dengan Partisipasi Politik Dalam Pemilihan Kepala Daerah Di Kabupaten Dairi Kecamatan Gunung Sitember
12 pages
SPSS (Lab)
No ratings yet
SPSS (Lab)
11 pages
Statistic and PRO. B. E. (Civil IV - Computer - E&C VI)
No ratings yet
Statistic and PRO. B. E. (Civil IV - Computer - E&C VI)
0 pages
Par Inc Case Problem
No ratings yet
Par Inc Case Problem
2 pages
2-Siklus Regresi
No ratings yet
2-Siklus Regresi
27 pages
The Statistical Analysis of Recurrent Events Complete EPUB Download
No ratings yet
The Statistical Analysis of Recurrent Events Complete EPUB Download
14 pages
10 Forecasting IPE 493 CSE JAN 24
No ratings yet
10 Forecasting IPE 493 CSE JAN 24
49 pages
Ann PM
No ratings yet
Ann PM
1 page
Computing The Variance of A Discrete Probability Distribution Autosaved
No ratings yet
Computing The Variance of A Discrete Probability Distribution Autosaved
31 pages
Al Manja Hie 2020
No ratings yet
Al Manja Hie 2020
15 pages
Data Analytics and Visualization Previous Year Questions
No ratings yet
Data Analytics and Visualization Previous Year Questions
4 pages
Predictive Maintenance Project Milestone Report
No ratings yet
Predictive Maintenance Project Milestone Report
7 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Hevia ARMA Estimation
No ratings yet
Hevia ARMA Estimation
6 pages
Doc3 Main Report
No ratings yet
Doc3 Main Report
60 pages
Completely Randomized Design
100% (4)
Completely Randomized Design
10 pages
COSM - Lesson Plan (CSE)
No ratings yet
COSM - Lesson Plan (CSE)
4 pages
Multiple Choice Test Item Analysis
No ratings yet
Multiple Choice Test Item Analysis
26 pages
Statistical Treatments
No ratings yet
Statistical Treatments
34 pages
Bharathidasan University-Statistics-QP-Nov-2010
No ratings yet
Bharathidasan University-Statistics-QP-Nov-2010
3 pages
ETC3550 Applied Forecasting For Business and Economics: Ch12. Some Practical Forecasting Issues
No ratings yet
ETC3550 Applied Forecasting For Business and Economics: Ch12. Some Practical Forecasting Issues
22 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Describe Machine Learning Lifecycle
No ratings yet
Describe Machine Learning Lifecycle
4 pages

Final Project

Uploaded by

Final Project

Uploaded by

Data Analytics Final Project Report

Members’ names & contributions

1. student_math_cleaned.csv: Contains data on student performance in Mathematics

• Dropped Unnecessary Columns: Non-relevant features like student_id, school, grade_1,

# Dropped unnecessary columns and applied one-hot encoding to

# Standardize the data using StandardScaler to ensure all

3. Exploratory Data Analysis (EDA)

Three regression models were selected for this analysis:

• Linear Regression: A basic regression model to establish a baseline.

# Function to evaluate models on training and test sets,

The models were evaluated using the following metrics:

# Plotting model performance comparison

fig, ax = plt.subplots(figsize=(14, 7))

5.2 Feature Importance

# Plot feature importance for Random Forest (Math)

# Plot feature importance for Gradient Boosting (Math)

• Absences: The number of times a student was absent.

5.3 Residual Analysis

# Plot residuals for Math (Gradient Boosting)

# Plot residuals for Portuguese (Gradient Boosting)

• Random Forest and Gradient Boosting outperformed Linear Regression in terms of

6.3 Future Improvements

• Incorporating More Data: Gathering additional data related to student behavior,

# Download latest version

print("Path to dataset files:", path)

# Importing the second dataset

# Displaying the both datasets

import matplotlib.pyplot as plt

# Drop columns not needed for prediction

# Prepare features and target for both datasets

X_portuguese = portuguese_df.drop(drop_columns, axis=1)

# Dropped unnecessary columns and applied one-hot encoding to categorical

X_math = pd.get_dummies(X_math, drop_first=True)

X_portuguese = pd.get_dummies(X_portuguese, drop_first=True)

# Split the data into training and testing sets

# Train and evaluate models

# Calculate performance metrics

return rmse, mae, r2

# Store results for comparison

# Plotting model performance comparison

x = np.arange(len(labels)) # Label locations

fig, ax = plt.subplots(figsize=(14, 7))

def plot_residuals(model, X_test, y_test, title):

# Plot residuals for Math (Gradient Boosting)

# Plot residuals for Portuguese (Gradient Boosting)

# Generate descriptive statistics

print(f"math dataset: \n{math_summary}")

import matplotlib.pyplot as plt

# Histogram for Final Grades

sns.histplot(portuguese_df['final_grade'], bins=10, kde=True)

# Boxplot for absences and studytime

# Boxplot for absences and studytime

# Map study time ranges to numerical values

# Apply the mapping to the 'studytime' column in both datasets

# Generate descriptive statistics

# Display the table

fig1.update_layout(title='Portuguese Summary Statistics')

import matplotlib.pyplot as plt

# Histogram for Final Grades

sns.histplot(portuguese_df['final_grade'], bins=10, kde=True)

# Boxplot for absences and studytime

# Boxplot for absences and studytime

• Additional visualizations or exploratory analysis.

You might also like