Student Performance Analysis and Prediction
Student Performance Analysis and Prediction
Introduction
Student performance analysis and prediction using datasets has become an essential component of
modern education systems. With the increasing availability of data on student demographics, academic
history, and other relevant factors, schools and universities are using advanced analytics and machine
learning algorithms to gain insights into student performance and predict future outcomes. This approach
helps educators identify areas of improvement, personalize learning experiences, and provide targeted
support to struggling students. Furthermore, student performance analysis and prediction can also aid in
decision-making processes for school administrators and policymakers, helping them allocate resources
more effectively. In this article, we will explore the benefits of using datasets for student performance
analysis and prediction and discuss some of the methods and tools used in this field.
Table of Contents
This project understands how the student’s performance (test scores) is affected by other variables such
as Gender, Ethnicity, Parental level of education, and Lunch and Test preparation course.
The primary objective of higher education institutions is to impart quality education to their students. To
achieve the highest level of quality in the education system, knowledge must be discovered to predict
student enrollment in specific courses, identify issues with traditional classroom teaching models, detect
unfair means used in online examinations, detect abnormal values in student result sheets, and predict
student performance. This knowledge is hidden within educational datasets and can be extracted through
data mining techniques.
This project focuses on evaluating students’ capabilities in various subjects using a classification task.
Data classification has many approaches, and the decision tree method and probabilistic classification
method are utilized here. By performing this task, knowledge is extracted that describes students’
performance in the end-semester examination. This helps in identifying dropouts and students who require
special attention, enabling teachers to provide appropriate advising and counseling.
Data Collection
Dataset Source – Students performance dataset.csv. The data consists of 8 column and 1000 rows.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("data/StudentsPerformance.csv")
df.head()
show the top 5 records on the dataset and look at the features.
To see the shape of the dataset
df.shape
Dataset Information
After that, we check the data as the next step. There are a number of categorical features contained in the
dataset, including multiple missing value kinds, duplicate values, check data types, and a number of unique
value types.
To check every column of the missing values or null values in the dataset.
df.isnull().sum()
If there are no missing values in the dataset.
Check Duplicates
If checking the our dataset has any duplicated values present or not
df.duplicated().sum()
To check the information of the dataset like datatypes, any null values present in the dataset.
df.nunique()
Check Statistics of the Data Set
Insight
The numerical data shown above shows that all means are fairly similar to one another, falling between
66 and 68.05.
The range of all standard deviations, between 14.6 and 15.19, is also narrow.
While there is a minimum score of 0 for math, the minimums for writing and reading are substantially
higher at 10 and 17, respectively.
We don’t have any duplicate or missing values, and the following codes will provide a good data
checking.
Exploring Data
The unique values in the dataset will be provided and presented in a pleasant way in the code above.
#define numerical and categorical columns numeric_features = [feature for feature in df.columns if
The above code will use separate the numerical and categorical features and count the feature values.
Histogram
Kernel Distribution Function (KDE)
Histogram & KDE
Gender Column
# Create a figure with two subplots f,ax=plt.subplots(1,2,figsize=(8,6)) # Create a countplot of the 'gender'
plot plt.show()
Gender has balanced data with female students are 518 (48%) and male students are 482 (52%)
Race/Ethnicity Column
# Define a color palette for the countplot colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'] #
blue, orange, green, red, purple are respectiively the color names for the color codes used above # Create a
figure with two subplots f, ax = plt.subplots(1, 2, figsize=(12, 6)) # Create a countplot of the
'race/ethnicity' column and add labels to the bars sns.countplot(x=df['race/ethnicity'], data=df,
ax[0].tick_params(labelsize=14) # Create a dictionary that maps category names to colors in the color palette
color_dict = dict(zip(df['race/ethnicity'].unique(), colors)) # Map the colors to the pie chart slices
pie_colors = [color_dict[race] for race in df['race/ethnicity'].value_counts().index] # Create a pie chart of
the 'race/ethnicity' column and add labels to the slices plt.pie(x=df['race/ethnicity'].value_counts(),
id = Insights>Insights
Bivariate Analysis
The score of student whose parents possess master and bachelor level education are higher than
others.
plt.fig
id = Insights>Insights
From the above three plots its clearly visible that most of the students score in between 60-80 in Maths
whereas in reading and writing most of them score from 50-80.
"Bachelor's Degree", "Master's Degree" color = ['red', 'green', 'blue', 'cyan', 'orange', 'grey']
plt.pie(size, colors=color, labels=labels, autopct='%.2f%%') plt.title('Parental Education', fontsize=20)
plt.axi
ff') # Remove extra subplot plt.subplot(2, 3, 6).remove() # Add super title plt.suptitle('Comparison of
Student Attributes', fontsize=20, fontweight='bold') # Adjust layout and show plot # This is removed as there
are only 5 subplots in this figure and we want to arrange them in a 2x3 grid. # Since there is no 6th
subplot, it is removed to avoid an empty subplot being shown in the figure. plt.tight_layout()
plt.subplots_adjust(top=0.85) plt.show()
id = Insights>Insights
From the above plot, it is clear that all the scores increase linearly with each other.
Model Training
Import Data and Required Packages
RandomizedSearchCV from catboost import CatBoostRegressor from xgboost import XGBRegressor import warnings
This separation of the dependent variable(y) and independent variables(X) is one the most important in our
project we use the math score as a dependent variable. Because so many students lack in math subjects it
will almost 60% to 70% of students in classes 7-10 students are fear of math subjects that’s why I am
choosing the math score as a dependent score.
It will use to improve the percentage of math scores and increase the grad f students and also remove fear
in math.
To separate the dataset into train and test to identify the training size and testing size of the dataset.
This function is use to evaluate the model and build a good model.
The output of before tuning all algorithms’ hyperparameters. And it provides the RMSE, MSE, MAE, and R2
score values for training and test data.
Hyperparameter Tuning
It will give the model with most accurate predictions and improve prediction accuracy.
This will give the optimized value of hyperparameters, which maximize your model predictive accuracy.
Outputs
The output of after tuning all algorithms’ hyperparameters. And it provides the RMSE, MSE, MAE, and R2
score values for training and test data.
If we choose Linear regression as the final model because that model will get a training set r2 score is
87.42 and a testing set r2 score is 88.03.
Model Selection
This is used to select the best model of all of the regression algorithms.
In linear regression, we got 88.03 curacy in all of the regression models that’s why we choose model.
["R2_Score"],ascending=False)
sns.regplot(x=y_test,y=y_pred,ci=None,color ='red')
Difference Between Actual and Predicted Values
# loading library import pickle # create an iterator object with write permission - model.pkl with
open('model_pkl', 'wb') as files: pickle.dump(model, files) # load saved model with open('model_pkl'
, 'rb') as f: lr = pickle.load(f)
Conclusion
This brings us to an end to the student’s performance prediction. Let us review our work. First, we
started by defining our problem statement, looking into the algorithms we were going to use and
the regression implementation pipeline. Then we moved on to practically implementing the
identification and regression algorithms like Linear Regression, Lasso, K-Neighbors Regressor,
Decision Tree, Random Forest Regressor, XGBRegressor, CatBoosting Regressor, and AdaBoost
Regressor. Moving forward, we compared the performances of these models. Lastly, we built a
Linear regression model that proved that it works best for student performance prediction
problems.
The key takeaways from this student performance prediction are:
I hope you like my article on “Student performance analysis and prediction.” The entire code can be
found in my GitHub repository. You can connect with me here on LinkedIn.
The media shown in this ar ticle is not owned by Analytics Vidhya and is used at the Author’s discretion.
Sai Battula