Student Grade Prediction
Student Grade Prediction
In higher education institutions (HEI), every institution has its student academic
management system to record all student data containing information about
student academic results in final examination marks and grades in different
courses and programs. All student marks and grades have been recorded and used
to generate a student academic performance report to evaluate the course
achievement every semester. The data keep in the repository can be used to
discover insightful information related to student academic performance.
Solomon et al. indicated that determining student academic performance is a
crucial challenge in HEI. Due to this, many previous researchers have well-
defined the influence factors that can highly affect student academic
performance .However, most common factors are relying on socio-economic
background, demographics and learning activities compared to final student
grades in the final examination. As for this reason, we observe that the trend of
predicting student grades can be one of the solutions that are applicable to
improve student academic performance.
Predictive analytics has shown the successful benefit in the HEI. It can be a
potential approach to benefit the competitive educational domain to find hidden
patterns and make predictions trends in a vast database. It has been used to solve
several educational areas that include student performance, dropout prediction,
academic early warning systems, and course selection. Moreover, the application
of predictive analytics in predicting student academic performance has increased
over the years.
The ability to predict student grade is one of the important area that can help to
improve student academic performance. Many previous research has found
variant machine learning techniques performed in predicting student academic
performance. However, the related works on mechanism to improve imbalanced
multi-classification problem in predicting students’ grade prediction are difficult
to found. Therefore, in this study, a comparative analysis has been done to find
the best prediction model for student grade prediction by addressing the following
questions:
• RQ1:
1
• RQ2:
• Our comparative analysis showed that the ratio between the minority class
in imbalanced dataset does not necessarily to approach same ratio of
majority class to obtain better performance in student grade prediction.
2
I.Data Cleaning
Table 1: Meta-Data
Number of non-
Attribute
Null Records
IDNO 203
Year 200
Attendance
73
%
M/F 200
CGPA 73
Mid
200
Semester
Mid Sem
200
Grade
Mid Sem
200
Collection
Quiz 1 (30) 202
Quiz 2 (30) 199
Part A (40) 202
Part B (40) 202
Grade 203
3
Table 2: Mean and Standard Deviation of each attribute considered
Mean Std
Mid
19.04 8.70
Semester
Quiz 1 12.95 6.00
Quiz 2 11.32 5.56
Part A 16.02 6.71
Part B 17.30 7.75
CGPA 8.30 1.17
Year 2.49 0.53
Attendance 4.50 0.74
Grade 6.72 2.06
Corr(X, Y ) = E
A positive correlation implies that both X and Y increase and decrease
together. We do note that correlation does not necessarily imply
causation. However, it could provide useful insights into the behaviour
of various variables. These are observed in tables 3-5.
4
attendance is at least five times as slow as variance in the grades. We also
see, Mid Sem Grades and final Grade have a much higher correlation.
Mid Sem
Grades 1.00 0.71
Grade 1.00
Mid
Semester 1.00 0.96 * 0.64 0.35 0.52 0.53 0.75 *
Mid Sem
Grades 1.00 0.61 0.37 0.5 0.53 0.72
Quiz 1
1.00 0.47 0.62 0.58 0.80 *
5
Quiz 2 1.00 0.59 0.51 0.63
Part A
1.00 0.66 0.81 *
Part B
1.00 0.76 *
Grade 1.00
Mid Sem
Collection
6
For these tests we ignore the class withdrawn (W ) because most of the data
is missing for such records.
Since the dataset is very small (197 records for scores, after removal of
Naive Bayes, SVM ( with linear, RBF and sigmoid kernels), and K-
7
EXPERIMENT 2.
Predicting final grade using only additional information.
We observe the prediction capability of the classifiers using only Year,
Attendance, CGPA & MidSem Collection to identify how well do these
attributes distinguish.
8
Figure 3: Comparison of prediction accuracy using all the
information. Here, the K-NN classifier performs similar to the SVM
with linear kernel.
From the observations in the three tests above, we find that the input
space is “almost” linearly separable when the test scores are chosen as
the input (linear SVM outperforms all other methods). When additional
information such as year and attendance is included, we find that the
performance of several classifiers reduces. Hence considering only the
test scores can aid in predicting the final grade of the student to a good
extent.
9
Figure 5: PCA with 2 components
10
Figure 8: PCA with 5 components
11
ALGORITHMS
By using machine learning algorithms, we can predict how well the students are
going to perform so that we can help the students whose grades are predicted low.
Student Grades Prediction is based on the problem of regression in machine
learning. In the section below, I will take you through the task of Student Grades
prediction with machine learning using Python.
➢ Logistic Regression
A popular statistical technique to predict binomial outcomes (y = 0 or 1) is
Logistic Regression. Logistic regression predicts categorical outcomes (binomial
/ multinomial values of y). The predictions of Logistic Regression (henceforth,
LogR in this article) are in the form of probabilities of an event occurring, i.e. the
probability of y=1, given certain values of input variables x. Thus, the results of
LogR range between 0-1.
LogR models the data points using the standard logistic function, which is
an S-shaped curve also called as sigmoid curve and is given by the equation:
12
Fig 3.1: logistic regression
➢ Random Forest
Random forest is a supervised learning algorithm which is used for both
classification as well as regression .But however ,it is mainly used for
classification problems .As we know that a forest is made up of trees and more
trees means more robust forest .
Similarly ,random forest creates decision trees on data sample and then
getsthe predictionfrom each of them and finally selects the best solution by mea
ns of voting .It isensemble method which is better than a single decision tree
because it reduces the over-fitting by averaging the result .
13
Working of Random Forest with the help of following steps:
•First ,start with the selection of random samples from a given dataset.
•Next ,this algorithm will construct a decision tree for every sample .Then it will
get the prediction result from every decision tree .
•In this step, voting will be performed for every predicted result.
•At last ,select the most voted prediction results as the final prediction result.
The following diagram will illustrates its working-
➢ FEASIBILITY STUDY
A Feasibility Study is a preliminary study undertaken before the
real work of a project starts to ascertain the likely hood of the projects
success. It is an analysis of possible alternative solutions to a problem
and a recommendation on the best alternative.
• Economic Feasibility:
It is defined as the process of assessing the benefits and costs
associated with the development of project. A proposed system, which
14
is both operationally and technically feasible, must be a good
investment for the organization. With the proposed system the users are
greatly benefited as the users can be able to detect the fake news from
the real news and are aware of most real and most fake news published
in the recent years. This proposed system does not need any additional
software and high system configuration. Hence the proposed system is
economically feasible.
• Technical Feasibility:
The technical feasibility infers whether the proposed system can
be developed considering the technical issues like availability of the
necessary technology, technical capacity, adequate response and
extensibility. The project is decided to build using Python. Jupyter Note
Book is designed for use in distributed environment of the internet and
forthe professional programmer it is easy to learn and use effectively.
As the developingorganization has all the resources available to build
the system therefore the proposed system is technically feasible.
• Operational Feasibility:
Operational feasibility is defined as the process of assessing the
degree to which a proposed system solves business problems or takes
advantage of business opportunities. The system is self-explanatory
and doesn’t need any extra sophisticated training. The system has built-
in methods and classes which are required to produce the result. The
application can be handled very easily with a novice user. The overall
time that a user needs to get trained is 14less than one hour. As the
software that is used for developing this application is very economical
and is readily available in the market. Therefore the proposed system is
operationally feasible.
15
EFFORT, DURATION AND COST ESTIMATION
USING COCOMO MODEL
The Cocomo (Constructive Cost Model) model is the most
complete and thoroughly documented model used in effort estimation.
The model provides detailed formulas for determining the development
time schedule, overall development effort, and effort breakdown by
phase and activity as well as maintenance effort.
16
For basic COCOMO Effort = a*(KLOC) b
Type =c*(effort)d
Type of product A B C D
➢ Product Attributes
•Required reliability (RELY):It is used to express an effect of
software faults ranging from slight inconvenience (VL) to loss of life
(VH). The nominal value(NM) denotes moderate recoverable losses.
17
•Data bytes per DSI (DATA):The lower rating comes with lower
size of a database. Complexity (CPLX): The attribute expresses code
complexity again ranging from straight batch code (VL) to real time
code with multiple resources scheduling (XH)
➢ Computer Attributes
•Execution time (TIME) and memory (STOR) constraints:
This attribute identifies the percentage of computer resources used by
the system. NM states that less than50% is used; 95% is indicated by
XH.
➢ Personal Attributes:
•Analyst capability(ACAP) and programmer capability (PCAP):
•This describes skills of the developing team. The higher the skills, the
higher the rating.
18
➢ Project Attributes:
VL LO NM HI VH XH
19
Our project is an organic system and for intermediate
COCOMO Effort = a * (KLOC) b *EAF
KLOC = 115
For organic system
a = 2.4
b = 1.02
EAF = product of cost
Driver’s effort=2.4*(0.115)^1.02*1.30
= 1.034
Programmer months Time for development = C * (Effort) d
= 2.5 * (1.034)^0.38
= 2.71 months
Cost of programmer = Effort * cost of programmer per month
= 1.034 * 20000
= 20680
Project cost = 20000 +20680
= 40680
SOFTWARE REQUIREMENTS SPECIFICATION
20
a.Purpose
The purpose of software requirements specification specifies the
intentions and intended audience of the SRS.
b. Scope
The scope of the SRS identifies the software product to be
produced, the capabilities, application, relevant objects etc. We are
proposed to implement Passive Aggressive Algorithm which takes the
test and trained data set from the
c. Definitions, Acronyms and Abbreviations Software
Requirements Specification
It’s a description of a particular software product, program or set
of programs that performs a set of function in target environment.
d. References
IEEE Std. 830-1993, IEEE Recommended Practice for Software
Requirements Specifications thy Sierra and Bert Bates.
e. Overview
The SRS contains the details of process, DFD’s, functions of the
product, user characteristics. The non-functional requirements if any
are also specified.
f. Overall description
The main functions associated with the product are described in
this section of SRS. The characteristics of a user of this product are
indicated. The assumptions in this section result from interaction with
the project stakeholders.
REQUIREMENT ANALYSIS
Software Requirement Specification (SRS) is the starting point of
the software developing activity. As system grew more complex it
became evident that the goal of the entire system cannot be easily
comprehended. Hence the need for the requirement phase arose. The
software project is initiated by the client needs. The SRS is the means
of translating the ideas of the minds of clients (the input) into a formal
document (the output of the requirement phase.) Under requirement
specification, the focus is on specifying what
21
hasbeen found giving analysis such as representation, specification la
nguages and tools, andchecking the specifications are addressed during
this activity. The Requirement phase terminates with the production of
the validate SRS document.
Producing the SRS document is the basic goal of this phase. The
purpose of the Software Requirement Specification is to reduce the
communication gap between the clients and the developers. Software
Requirement Specification is the medium though which the client and
user needs are accurately specified. It forms the basis of software
development. A good SRS should satisfy all the parties involved in the
system.
➢ Product Perspective:
The application is developed in such a way that any future
enhancement can be easily implementable. The project is developed in
such a way that it requires minimal maintenance. The software used are
open source and easy to install. The application developed should be
easy to install and use. This is an independent application which can be
easily run on to any system which has Python installed and Jupiter
Notebook.
➢ Product Features:
The application is developed in a way that ‘Heart disease’ accuracy
is predicted using Random Forest. The data set is taken from
https://www.datacamp.com/community/tutorials/scikit-learn-credit-
card.We can compare the accuracy for the implemented algorithms.
User characteristics Application is developed in such a way that its
users are v Easy to use v Error free 20 v Minimal training or no training
v Patient regular monitor Assumption & Dependencies It is considered
that the dataset taken fulfils all the requirements.
➢ Domain Requirements:
This document is the only one that describes the requirements of the
system. It is meant for the use by the developers, and will also be the
bases for validating the final Heart disease system. Any changes made
22
to the requirements in the future will have to go through a formal
change approval process. User Requirements User can decide on the
prediction accuracy to decide on which algorithm can be used in real-
time predictions. Non Functional Requirements ÿ Dataset collected
should be in the CSV format ÿ .The column values should be
numerical values ÿ Training set and test set are stored as CSV files ÿ
Error rates can be calculated for prediction algorithms product.
➢ Requirements Efficiency:
Less time for predicting the Heart Disease Reliability: Maturity,
fault tolerance and recoverability. Portability: can the software easily
be transferred to another environment, including install ability.
➢ Usability:
How easy it is to understand, learn and operate the software
system Organizational Requirements: Do not block the some available
ports through the windows firewall. Internet connection should be
available Implementation Requirements The dataset collection, internet
connection to install related libraries. Engineering Standard
Requirements User Interfaces User interface is developed in python,
which gets input such stock symbol.
➢ Hardware Interfaces:
Ethernet on the AS/400 supports TCP/IP, Advanced Peer-to-Peer
Networking (APPN)and advanced program-to-program
communications (APPC). ISDN To connect AS/400 to an Integrated
Services Digital Network (ISDN) for faster, more accurate data
transmission. An ISDN is a public or private digital communications
network that can support data, fax, image, and other services over the
same physical interface. We can use other protocols on ISDN, such as
IDLC and X.25. Software Interfaces Anaconda Navigator and Jupiter
Notebook are used.
23
➢ Operational Requirements:
a)Economic: The developed product is economic as it is not required
any hardware interface etc.
Environmental Statements of fact and assumptions that define the
expectations of the system in terms of mission objectives, environment,
constraints, and measures of effectiveness and suitability (MOE/MOS).
The customers are those that perform the eight primary functions of
systems engineering, with special emphasis on the operator as the key
customer.
b)Health and Safety: The software may be safety-critical. If so, there
are issues associated with its integrity level. The software may not be
safety-critical although it forms part of a safety-critical system.
• For example, software may simply log transactions. If a system must
be of a high integrity level and if the software is shown to be of that
integrity level, then the hardware must be at least of the same integrity
level.
• There is little point in producing 'perfect' code in some language if
hardware and system software (in widest sense) are not reliable. If a
computer system is to run software of a high integrity level then that
system should not at the same time accommodate software of a lower
integrity level.
•Systems with different requirements for safety levels must be
separated. Otherwise, the highest level of integrity required must be
applied to all systems in the same environment.
SYSTEM REQUIREMENTS
➢ Hardware Requirements
Processor : above 500 MHz
Ram : 4 GB
Hard Disk : 4 GB
Input device : Standard Keyboard and Mouse.
Output device : VGA and High Resolution Monitor.
24
➢ Software Requirements
Operating System : Windows 7 or higher
Programming : Python 3.6 and related libraries
Software :
Anaconda Navigator and Jupyter Notebook
SOFTWARE DESCRIPTION
➢ Python
Python is an interpreted high-level programming language for
general-purpose programming. Created by Guido van Rossum and first
released in 1991,Python has a
design philosophy that emphasizes code readability, notably using sini
ficant whitespace. It provides constructs that enable clear programmin
g on both small and large scales. Python features a dynamic type
system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative,
functional and procedural, and has a large and comprehensive standard
library. Python interpreters are available for many operating systems.
C Python, the reference implementation of Python, is open source
software and has a community-based development model, as do nearly
all of its variant implementations. C Python is managed by the non-
profit Python Software Foundation.
➢ Pandas
Pandas is an open-source Python Library providing high-
performance data manipulation and analysis tool using its powerful
data structures. The name Pandas is derived from the word Panel
Data – an Econometrics from Multidimensional data. In 2008,
developer Wes McKinney started developing pandas when in need of
high performance, flexible tool for analysis of data. Prior to Pandas,
Python was majorly used for data mining and preparation. It had very
little contribution towards data analysis. Pandas solved this problem.
25
Using Pandas, we can accomplish five typical steps in the processing
and analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyse. Python with
26
➢ Sckit-Learn
• Simple and efficient tools for data mining and data analysis\
• Accessible to everybody, and reusable in various contexts
• Built on NumPy, SciPy, and matplotlib
• Open source, commercially usable - BSD license
➢ Matplotlib
•Matplotlib is a python library used to create 2D graphs and plots by
using python scripts.
•It has a module named pyplot which makes things easy for plotting by
providing feature to control line styles, font properties, formatting axes
etc.
•It supports a very wide variety of graphs and plots namely - histogram,
bar charts, power spectra, error charts etc.
➢ Jupyter Notebook
•The Jupyter Notebook is an incredibly powerful tool for interactively
developing and presenting data science projects.
•A notebook integrates code and its output into a single document that
combines visualizations, narrative text, mathematical equations, and
other rich media.
•The Jupyter Notebook is an open-source web application that allows
you to create and share documents that contain live code, equations,
visualizations and narrative text.
SYSTEMDESIGN
➢ SYSTEM ARCHITECTURE
The below figure shows the process flow diagram or proposed work.
First we collected the C level and Heart Disease Database from UCI
website then pre-processed the dataset and select 16 important features.
28
Logistic algorithm individually and compute the accuracy. Finally, we
used proposed Ensemble Voting method and compute best method for
diagnosis of heart disease.
➢ MODULES
The entire work of this project is divided into 4 modules.They are:
a. Data Pre-processing
b. Feature
c. Classification
d. Prediction
a. Data Pre-processing:
This file contains all the pre-processing functions needed to process all
input documents and texts. First we read the train, test and validation
data files then performed some preprocessing like tokenizing,
stemming etc. There are some exploratory data analysis is performed
like response variable distribution and data quality checks like null or
missing values etc.
b. Feature:
Extraction In this file we have performed feature extraction and
selection methods from sci -kit learn python libraries. For feature
selection, we have used methods like simple bag-of-words and n-grams
and then term frequency like tf-tdf weighting. We have also used
word2vec and POS tagging to extract the features, though POS tagging
and word2vec has not been used at this point in the project.
c. Classification:
Here we have built all the classifiers for the breast cancer diseases
detection. The extracted features are fed into different classifiers. We
have used Naive-bayes, Logistic Regression, Linear SVM, Stochastic
gradient decent and Random forest classifiers from sklearn. Each of the
extracted features was used in all of the classifiers. Once fitting the
model, we compared the f1 score and checked the confusion matrix.
29
After fitting all the classifiers, 2 best performing models were
selected as candidate models for heart diseases classification. We have
performed parameter tuning by implementing Grid Search CV methods
on these candidate models and chosen best performing parameters for
these classifier.
30
Fig : Data Flow diagram Level 0
31
IMPLEMENTATION
➢ SOURCE CODE:
I hope you now have understood why we need to predict the grades of a student.
Now let’s see how we can use machine learning for the task of student grades
prediction using Python. I will start this task by importing the necessary Python
libraries and the dataset:
Dataset
In [1]:
Import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle
data = pd.read_csv("student-mat.csv")
data.head()
32
Out [1]:
The dataset that I am using for the task of students grade prediction is based on
the achievements of the students of the Portuguese schools. In this dataset the G1
represents the grades of the first period, G2 represents the grades of the second
period, and G3 represents the final grades. Now let’s prepare the data and let’s
see how we can predict the final grades of the students:
In [2]:
In the above code, I first selected the necessary columns that we need to train a
machine learning model for the task of student grades prediction. Then I declared
that the G3 column is our target label and then I split the dataset into 20% testing
and 80% training. Now let’s see how to train a linear regression model for the
task of student grades prediction:
In [3]:
linear_regression = LinearRegression()
linear_regression.fit(xtrain, ytrain)
accuracy = linear_regression.score(xtest, ytest)
print(accuracy)
33
Out [3]:
0.8432876775479776
The linear regression model gave an accuracy of about 84% which is not bad in
this task. Now let’s have a look at the predictions made by the students’ grade
prediction model:
In [4]:
Predictions=linear_regression.predict(xtest)
for i in range(len(predictions)):
print(predictions[x], xtest[x], [ytest[x]])
Out [4]:
[[16.16395534 14.23423176 14.08532841 5.28096434 14.23423176]
[16.16395534 16.16395534 14.08532841 5.28096434 7.97291422]
[14.52779998 11.92149651 14.08532841 9.13993948 4.71694746]
...
[ 4.71694746 11.92149651 3.9451298 9.13993948 9.13993948]
[12.56424351 4.92497623 3.9451298 5.28096434 5.28096434]
[11.92149651 9.05247158 3.9451298 5.28096434 16.16395534]] [[[15 16 2 0 2]
[15 14 2 0 2]
[15 14 3 0 6]
[ 7 6 2 0 10]
[15 14 2 0 2]]....
34
CONCLUSION
Predicting student grades is one of the key performance indicators that can help
educators monitor their academic performance. Therefore, it is important to have
a predictive model that can reduce the level of uncertainty in the outcome for an
imbalanced dataset. This paper proposes a multiclass prediction model with six
predictive models to predict final student’s grades based on the previous student
final examination result of the first-semester course. Specifically, we have done
a comparative analysis of combining oversampling SMOTE with different FS
methods to evaluate the performance accuracy of student grade prediction. We
also have shown that the explored oversampling SMOTE is overall improved
consistently than using FS alone with all predictive models. However, our
proposed multiclass prediction model performed more effectively than using
oversampling SMOTE and FS alone with some parameter settings that can
influence the performance accuracy of all predictive models. Here, our findings
contribute to be a practical approach for addressing the imbalanced multi-
classification based on the data-level solution for student grade prediction.
35