0% found this document useful (0 votes)
16 views35 pages

Student Grade Prediction

The document discusses the use of predictive analytics in higher education institutions to improve student academic performance by predicting final course grades using machine learning algorithms. It highlights the challenges of imbalanced multi-classification datasets and presents a comparative analysis of various algorithms, including Logistic Regression and Random Forest, to identify the most accurate model for grade prediction. Additionally, it addresses data cleaning, correlation analysis, and feasibility studies related to the implementation of the predictive models.

Uploaded by

Sai Vetcha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views35 pages

Student Grade Prediction

The document discusses the use of predictive analytics in higher education institutions to improve student academic performance by predicting final course grades using machine learning algorithms. It highlights the challenges of imbalanced multi-classification datasets and presents a comparative analysis of various algorithms, including Logistic Regression and Random Forest, to identify the most accurate model for grade prediction. Additionally, it addresses data cleaning, correlation analysis, and feasibility studies related to the implementation of the predictive models.

Uploaded by

Sai Vetcha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Introduction

In higher education institutions (HEI), every institution has its student academic
management system to record all student data containing information about
student academic results in final examination marks and grades in different
courses and programs. All student marks and grades have been recorded and used
to generate a student academic performance report to evaluate the course
achievement every semester. The data keep in the repository can be used to
discover insightful information related to student academic performance.
Solomon et al. indicated that determining student academic performance is a
crucial challenge in HEI. Due to this, many previous researchers have well-
defined the influence factors that can highly affect student academic
performance .However, most common factors are relying on socio-economic
background, demographics and learning activities compared to final student
grades in the final examination. As for this reason, we observe that the trend of
predicting student grades can be one of the solutions that are applicable to
improve student academic performance.

Predictive analytics has shown the successful benefit in the HEI. It can be a
potential approach to benefit the competitive educational domain to find hidden
patterns and make predictions trends in a vast database. It has been used to solve
several educational areas that include student performance, dropout prediction,
academic early warning systems, and course selection. Moreover, the application
of predictive analytics in predicting student academic performance has increased
over the years.

The ability to predict student grade is one of the important area that can help to
improve student academic performance. Many previous research has found
variant machine learning techniques performed in predicting student academic
performance. However, the related works on mechanism to improve imbalanced
multi-classification problem in predicting students’ grade prediction are difficult
to found. Therefore, in this study, a comparative analysis has been done to find
the best prediction model for student grade prediction by addressing the following
questions:

• RQ1:

Which predictive model among the selected machine learning


algorithms performs high accuracy performance to predict student’s
final course grades?

1
• RQ2:

How imbalanced multi-classification dataset can be addressed with


selected machine learning algorithms using oversampling Synthetic
Minority Oversampling Technique (SMOTE) and feature selection (FS)
methods?

To address the above-mentioned questions, we collect the student final course


grades from two core courses in the first semester of the final examination result.
We present a descriptive analysis of student datasets to visualize student grade
trends, which can lead to strategic planning in decision making for the lecturers
to help students more effectively. Then, we conduct comparative analysis using
six well-known machine learning algorithms, including LR, NB, J48, SVM, kNN
and RF on the real student data of Diploma in Information Technology (Digital
Technology) at one of Malaysia Polytechnic. As for addressing the imbalanced
multi-classification, we endeavor to enhance the performance of each predictive
model with data-level solutions using oversampling SMOTE and FS. The novel
contribution of this paper are summarized as follows:

• We proposed combination of modification on oversampling SMOTE and


two feature selection algorithms to automatically determine the sampling
ratio with best selected features to improve imbalanced multi-classification
for student grade prediction.

• Our comparative analysis showed that the ratio between the minority class
in imbalanced dataset does not necessarily to approach same ratio of
majority class to obtain better performance in student grade prediction.

• Our proposed model shows different impact in improving the performance


of student grade prediction model based on the versatility of two feature
selection algorithm after implementing SMOTE.

2
I.Data Cleaning

We model the grade prediction problem as a classification problem. We


find that the data contains a significant amount of NULL values for
each attribute. Table 1 summarizes the number of Non-null records for
each attribute. We consider only non-null values for our analysis. We
also remove the records containing a withdrawn (W ) grade since they
do not influence the statistics of the course. See the github link for
details about each attribute. We consider the 9 attributes for our
analysis as given in Table 2.

Table 1: Meta-Data
Number of non-
Attribute
Null Records
IDNO 203
Year 200
Attendance
73
%
M/F 200
CGPA 73
Mid
200
Semester
Mid Sem
200
Grade
Mid Sem
200
Collection
Quiz 1 (30) 202
Quiz 2 (30) 199
Part A (40) 202
Part B (40) 202
Grade 203

3
Table 2: Mean and Standard Deviation of each attribute considered

Mean Std
Mid
19.04 8.70
Semester
Quiz 1 12.95 6.00
Quiz 2 11.32 5.56
Part A 16.02 6.71
Part B 17.30 7.75
CGPA 8.30 1.17
Year 2.49 0.53
Attendance 4.50 0.74
Grade 6.72 2.06

II.Elementary Analysis using correlation

The correlation between two random variables is defined as:

Corr(X, Y ) = E
A positive correlation implies that both X and Y increase and decrease
together. We do note that correlation does not necessarily imply
causation. However, it could provide useful insights into the behaviour
of various variables. These are observed in tables 3-5.

Table 3: Correlation among attendance and grades. As we see,


attendance has very less correlation with Mid Sem Grades and final
Grade. One plausible reason could be due to the bucketing of the
Attendance attribute into 5 levels, due to which the variance in

4
attendance is at least five times as slow as variance in the grades. We also
see, Mid Sem Grades and final Grade have a much higher correlation.

Attendance Mid Grade


Sem
Grades

Attendance 1.00 0.03 0.16

Mid Sem
Grades 1.00 0.71

Grade 1.00

Table 4: Correlation among various scores. The top 5 correlations


have been marked in bold with an asterisk (*).

Mid Quiz Quiz Part Part Grade


Mid Sem 1 2 A B
Semester Grades

Mid
Semester 1.00 0.96 * 0.64 0.35 0.52 0.53 0.75 *

Mid Sem
Grades 1.00 0.61 0.37 0.5 0.53 0.72

Quiz 1
1.00 0.47 0.62 0.58 0.80 *

5
Quiz 2 1.00 0.59 0.51 0.63

Part A
1.00 0.66 0.81 *

Part B
1.00 0.76 *

Grade 1.00

Table 5: Correlation among Mid Sem Collection and performance. A


weak negative correlation is seen between the Mid Sem Collection and
the various scores. This suggests that students who have scored well, had
shown lesser tardiness in collecting their mid sem answer scripts.

Mid Quiz 1 Quiz 2 Part A Part B Grade


Sem
Mid
Sem
Grade
-0.21 -0.2 -0.19 -0.29 -0.29 -0.21 -0.28

Mid Sem
Collection

III .Grade prediction performance of various classifiers

We model the prediction of final Grade as a classification problem. Each


grade is assumed to be a class. Hence, we have 9 classes (A through E , and
NC ), since those denote the students who took the course till completion.

6
For these tests we ignore the class withdrawn (W ) because most of the data
is missing for such records.

Since the dataset is very small (197 records for scores, after removal of

W as mentioned above), we perform a stratified 5-fold cross-

validation for each of the classifiers to observe and compare their

stability. Hence we obtain a 80:20 train-test split. The stratification

aids in handling the class imbalance problem. We use Decision Tree,

Naive Bayes, SVM ( with linear, RBF and sigmoid kernels), and K-

Nearest Neighbor ( k = 1 to 20). In this section, the following

experiments are described. We list the scores in Appendix A .

EXPERIMENT 1. Predicting final grade using only test scores.


We use five attributes - Mid sem , Quiz 1, Quiz 2, Part A, Part B to predict the
final Grade .

Figure 1: Comparison of prediction accuracy using only test scores.


Note that an SVM with a linear kernel obtains the highest accuracy (mean:
0.88, std: 0.08). This indicates that the input data is “almost” linearly
separable.

7
EXPERIMENT 2.
Predicting final grade using only additional information.
We observe the prediction capability of the classifiers using only Year,
Attendance, CGPA & MidSem Collection to identify how well do these
attributes distinguish.

Figure 2: Comparison of prediction accuracy using only additional


information. Here too, a linear SVM achieves a good performance
(mean: 0.56, std:0.21). However, all the classifiers are not as stable in
this data (standard deviation quite high). Also, K-NN performs similar
to other classifiers

EXPERIMENT 3. Predicting final grade using both the kinds of attributes.


We combine the attributes in 1 and 2 and compare the results.

8
Figure 3: Comparison of prediction accuracy using all the
information. Here, the K-NN classifier performs similar to the SVM
with linear kernel.
From the observations in the three tests above, we find that the input
space is “almost” linearly separable when the test scores are chosen as
the input (linear SVM outperforms all other methods). When additional
information such as year and attendance is included, we find that the
performance of several classifiers reduces. Hence considering only the
test scores can aid in predicting the final grade of the student to a good
extent.

IV. Performance with Principal Components

Since we observe good classification with the attributes in section III


experiment 1, we ask what could happen if PCA was applied on those
dimensions. Thus we run Experiment 1, with PCA applied on the five
input dimensions. We consider the top 1 through 5 principal
components for this purpose.

The classification performance of various classifiers is shown in


Figures 4-8. We find that all classifiers perform worse when the data is
projected along its principal components. This is because, PCA does
not take class labels into account (it is class-agnostic). Therefore, PCA
is not a good choice of dimensionality reduction in the current
classification problem.

Figure 4: PCA with 1 component

9
Figure 5: PCA with 2 components

Figure 6: PCA with 3 components

Figure 7: PCA with 4 components

10
Figure 8: PCA with 5 components

11
ALGORITHMS
By using machine learning algorithms, we can predict how well the students are
going to perform so that we can help the students whose grades are predicted low.
Student Grades Prediction is based on the problem of regression in machine
learning. In the section below, I will take you through the task of Student Grades
prediction with machine learning using Python.

➢ Logistic Regression
A popular statistical technique to predict binomial outcomes (y = 0 or 1) is
Logistic Regression. Logistic regression predicts categorical outcomes (binomial
/ multinomial values of y). The predictions of Logistic Regression (henceforth,
LogR in this article) are in the form of probabilities of an event occurring, i.e. the
probability of y=1, given certain values of input variables x. Thus, the results of
LogR range between 0-1.

LogR models the data points using the standard logistic function, which is
an S-shaped curve also called as sigmoid curve and is given by the equation:

Logistic Regression Assumptions:


•Logistic regression requires the dependent variable to be binary.
•For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
•Only the meaningful variables should be included.
•The independent variables should be independent of each other·
•Logistic regression requires quite large sample sizes.
•Even though, logistic (logit) regression is frequently used for binary variables (2
classes), itcan be used for categorical dependent variables with more than
2 classes.
•In this case it’s called Multinomial Logistic Regression.

12
Fig 3.1: logistic regression

➢ Random Forest
Random forest is a supervised learning algorithm which is used for both
classification as well as regression .But however ,it is mainly used for
classification problems .As we know that a forest is made up of trees and more
trees means more robust forest .
Similarly ,random forest creates decision trees on data sample and then
getsthe predictionfrom each of them and finally selects the best solution by mea
ns of voting .It isensemble method which is better than a single decision tree
because it reduces the over-fitting by averaging the result .

13
Working of Random Forest with the help of following steps:
•First ,start with the selection of random samples from a given dataset.
•Next ,this algorithm will construct a decision tree for every sample .Then it will
get the prediction result from every decision tree .
•In this step, voting will be performed for every predicted result.
•At last ,select the most voted prediction results as the final prediction result.
The following diagram will illustrates its working-

Fig 3.2: Random Forest

➢ FEASIBILITY STUDY
A Feasibility Study is a preliminary study undertaken before the
real work of a project starts to ascertain the likely hood of the projects
success. It is an analysis of possible alternative solutions to a problem
and a recommendation on the best alternative.

• Economic Feasibility:
It is defined as the process of assessing the benefits and costs
associated with the development of project. A proposed system, which

14
is both operationally and technically feasible, must be a good
investment for the organization. With the proposed system the users are
greatly benefited as the users can be able to detect the fake news from
the real news and are aware of most real and most fake news published
in the recent years. This proposed system does not need any additional
software and high system configuration. Hence the proposed system is
economically feasible.
• Technical Feasibility:
The technical feasibility infers whether the proposed system can
be developed considering the technical issues like availability of the
necessary technology, technical capacity, adequate response and
extensibility. The project is decided to build using Python. Jupyter Note
Book is designed for use in distributed environment of the internet and
forthe professional programmer it is easy to learn and use effectively.
As the developingorganization has all the resources available to build
the system therefore the proposed system is technically feasible.

• Operational Feasibility:
Operational feasibility is defined as the process of assessing the
degree to which a proposed system solves business problems or takes
advantage of business opportunities. The system is self-explanatory
and doesn’t need any extra sophisticated training. The system has built-
in methods and classes which are required to produce the result. The
application can be handled very easily with a novice user. The overall
time that a user needs to get trained is 14less than one hour. As the
software that is used for developing this application is very economical
and is readily available in the market. Therefore the proposed system is
operationally feasible.

15
EFFORT, DURATION AND COST ESTIMATION
USING COCOMO MODEL
The Cocomo (Constructive Cost Model) model is the most
complete and thoroughly documented model used in effort estimation.
The model provides detailed formulas for determining the development
time schedule, overall development effort, and effort breakdown by
phase and activity as well as maintenance effort.

COCOMO estimates the effort in person months of direct labour.


The primary effort factor is the number of source lines of code (SLOC)
expressed in thousands of delivered source instructions (KDSI).The
model is developed in three versions of different level of detail basic,
intermediate, and detailed. The overall modelling process takes into
account three classes of systems.

1.Embedded: This class of system is characterized by tight constraints,


changing environment, and unfamiliar surroundings. Projects of the
embedded type are model to the company and usually exhibit temporal
constraints.

2.Organic: This category encompasses all systems that are small


relative to project size and team size, and have a stable environment,
familiar surroundings and relaxed interfaces. These are simple business
systems, data processing systems, and small software libraries.

3.Semidetached: The software systems falling under this category are


a mix of those of organic and embedded in nature.

Some examples of software of this class are operating systems,


database management system, and inventory management systems.

16
For basic COCOMO Effort = a*(KLOC) b

Type =c*(effort)d

For Intermediate and Detailed COCOMO Effort = a * (KLOC) b*


EAF (EAF = product of cost drivers)

Type of product A B C D

Organic 2.4 1.02 2.5 0.38

Semi Detached 3.0 1.12 2.5 0.35

Embedded 3.6 1.20 2.5 0.32

Table : Organic, Semidetached and Embedded system values

Intermediate COCOMO model is a refinement of the basic model,


which comes in the function of 15 attributes of the product. For each of
the attributes the user of the model has to provide a rating using the
following six point scale.

VL(Very Low) HI(High)


LO (Low) VH (Very High)
NM(Nominal) XH (Extra High)
The list of attributes is composed of several features of the software
and includes product, computer, personal and project attributes as
follows.

➢ Product Attributes
•Required reliability (RELY):It is used to express an effect of
software faults ranging from slight inconvenience (VL) to loss of life
(VH). The nominal value(NM) denotes moderate recoverable losses.

17
•Data bytes per DSI (DATA):The lower rating comes with lower
size of a database. Complexity (CPLX): The attribute expresses code
complexity again ranging from straight batch code (VL) to real time
code with multiple resources scheduling (XH)

➢ Computer Attributes
•Execution time (TIME) and memory (STOR) constraints:
This attribute identifies the percentage of computer resources used by
the system. NM states that less than50% is used; 95% is indicated by
XH.

•Virtual machine volatility (VIRT):It is used to indicate the


frequency of changes made to the hardware, operating system, and
overall software environment. More frequent and significant changes
are indicated by higher ratings.

•Development turnaround time (TURN):This is a time from when a


job is submitted until output becomes received. LO indicated a highly
interactive environment, VH quantifies a situation when this time is
longer than 12 hours.

➢ Personal Attributes:
•Analyst capability(ACAP) and programmer capability (PCAP):

•This describes skills of the developing team. The higher the skills, the
higher the rating.

•Application experience (AEXP), language experience (LEXP), and


virtual machine experience (VEXP):

•These are used to quantify the number of experience in each area by


the development team; more experience, higher rating.

18
➢ Project Attributes:

•Modern development practices (MODP):deals with the amount of


use of modern software practices such as structural programming and
object oriented approach.

•Use of software tools (TOOL):is used to measure a level of


sophistication of automated tools used in software development and a
degree of integration among the tools being used. Higher rating
describes levels in both aspects.

•Schedule effects (SCED):concerns the amount of schedule


compression (HI or VH),or schedule expansion (LO or VL) of the
development schedule in comparison to a nominal (NM) schedule

VL LO NM HI VH XH

RELY 0.75 0.88 1.00 1.15 1.40


DATA 0.94 1.00 1.08 1.16
CPLX 0.70 0.85 1.00 1.15 1.30 1.65
TIME 1.00 1.11 1.30 1.66
STOR 1.00 1.06 1.21 1.56
VIRT 0.87 1.00 1.15 1.30
TURN 0.87 1.00 1.15 1.30
ACAP 1.46 1.19 1.00 0.86 0.71
AEXP 1.29 1.13 1.00 0.91 0.82
PCAP 1.42 1.17 1.00 0.86 0.70
LEXP 1.14 1.07 1.00 0.95
VEXP 1.21 1.10 1.00 0.90
MODP 1.24 1.10 1.00 0.91 0.82
TOOL 1.24 1.10 1.00 0.91 0.83
SCED 1.23 1.08 1.00 1.04 1.10

Table 3.2:Project Attributes

19
Our project is an organic system and for intermediate
COCOMO Effort = a * (KLOC) b *EAF
KLOC = 115
For organic system
a = 2.4
b = 1.02
EAF = product of cost
Driver’s effort=2.4*(0.115)^1.02*1.30
= 1.034
Programmer months Time for development = C * (Effort) d
= 2.5 * (1.034)^0.38
= 2.71 months
Cost of programmer = Effort * cost of programmer per month
= 1.034 * 20000
= 20680
Project cost = 20000 +20680
= 40680
SOFTWARE REQUIREMENTS SPECIFICATION

INTRODUCTION TO REQUIREMENT SPECIFICATION:


Software Engineering by James F Peters & WitoldPedrycz Head
First Java by Ka. A Software Requirements Specification (SRS) is a
description of particular software product, program or set of
programs that performs a set of functions in a target environment
(IEEE Std. 830-1993).

20
a.Purpose
The purpose of software requirements specification specifies the
intentions and intended audience of the SRS.
b. Scope
The scope of the SRS identifies the software product to be
produced, the capabilities, application, relevant objects etc. We are
proposed to implement Passive Aggressive Algorithm which takes the
test and trained data set from the
c. Definitions, Acronyms and Abbreviations Software
Requirements Specification
It’s a description of a particular software product, program or set
of programs that performs a set of function in target environment.
d. References
IEEE Std. 830-1993, IEEE Recommended Practice for Software
Requirements Specifications thy Sierra and Bert Bates.
e. Overview
The SRS contains the details of process, DFD’s, functions of the
product, user characteristics. The non-functional requirements if any
are also specified.
f. Overall description
The main functions associated with the product are described in
this section of SRS. The characteristics of a user of this product are
indicated. The assumptions in this section result from interaction with
the project stakeholders.

REQUIREMENT ANALYSIS
Software Requirement Specification (SRS) is the starting point of
the software developing activity. As system grew more complex it
became evident that the goal of the entire system cannot be easily
comprehended. Hence the need for the requirement phase arose. The
software project is initiated by the client needs. The SRS is the means
of translating the ideas of the minds of clients (the input) into a formal
document (the output of the requirement phase.) Under requirement
specification, the focus is on specifying what

21
hasbeen found giving analysis such as representation, specification la
nguages and tools, andchecking the specifications are addressed during
this activity. The Requirement phase terminates with the production of
the validate SRS document.
Producing the SRS document is the basic goal of this phase. The
purpose of the Software Requirement Specification is to reduce the
communication gap between the clients and the developers. Software
Requirement Specification is the medium though which the client and
user needs are accurately specified. It forms the basis of software
development. A good SRS should satisfy all the parties involved in the
system.
➢ Product Perspective:
The application is developed in such a way that any future
enhancement can be easily implementable. The project is developed in
such a way that it requires minimal maintenance. The software used are
open source and easy to install. The application developed should be
easy to install and use. This is an independent application which can be
easily run on to any system which has Python installed and Jupiter
Notebook.
➢ Product Features:
The application is developed in a way that ‘Heart disease’ accuracy
is predicted using Random Forest. The data set is taken from
https://www.datacamp.com/community/tutorials/scikit-learn-credit-
card.We can compare the accuracy for the implemented algorithms.
User characteristics Application is developed in such a way that its
users are v Easy to use v Error free 20 v Minimal training or no training
v Patient regular monitor Assumption & Dependencies It is considered
that the dataset taken fulfils all the requirements.
➢ Domain Requirements:
This document is the only one that describes the requirements of the
system. It is meant for the use by the developers, and will also be the
bases for validating the final Heart disease system. Any changes made

22
to the requirements in the future will have to go through a formal
change approval process. User Requirements User can decide on the
prediction accuracy to decide on which algorithm can be used in real-
time predictions. Non Functional Requirements ÿ Dataset collected
should be in the CSV format ÿ .The column values should be
numerical values ÿ Training set and test set are stored as CSV files ÿ
Error rates can be calculated for prediction algorithms product.

➢ Requirements Efficiency:
Less time for predicting the Heart Disease Reliability: Maturity,
fault tolerance and recoverability. Portability: can the software easily
be transferred to another environment, including install ability.

➢ Usability:
How easy it is to understand, learn and operate the software
system Organizational Requirements: Do not block the some available
ports through the windows firewall. Internet connection should be
available Implementation Requirements The dataset collection, internet
connection to install related libraries. Engineering Standard
Requirements User Interfaces User interface is developed in python,
which gets input such stock symbol.

➢ Hardware Interfaces:
Ethernet on the AS/400 supports TCP/IP, Advanced Peer-to-Peer
Networking (APPN)and advanced program-to-program
communications (APPC). ISDN To connect AS/400 to an Integrated
Services Digital Network (ISDN) for faster, more accurate data
transmission. An ISDN is a public or private digital communications
network that can support data, fax, image, and other services over the
same physical interface. We can use other protocols on ISDN, such as
IDLC and X.25. Software Interfaces Anaconda Navigator and Jupiter
Notebook are used.

23
➢ Operational Requirements:
a)Economic: The developed product is economic as it is not required
any hardware interface etc.
Environmental Statements of fact and assumptions that define the
expectations of the system in terms of mission objectives, environment,
constraints, and measures of effectiveness and suitability (MOE/MOS).
The customers are those that perform the eight primary functions of
systems engineering, with special emphasis on the operator as the key
customer.
b)Health and Safety: The software may be safety-critical. If so, there
are issues associated with its integrity level. The software may not be
safety-critical although it forms part of a safety-critical system.
• For example, software may simply log transactions. If a system must
be of a high integrity level and if the software is shown to be of that
integrity level, then the hardware must be at least of the same integrity
level.
• There is little point in producing 'perfect' code in some language if
hardware and system software (in widest sense) are not reliable. If a
computer system is to run software of a high integrity level then that
system should not at the same time accommodate software of a lower
integrity level.
•Systems with different requirements for safety levels must be
separated. Otherwise, the highest level of integrity required must be
applied to all systems in the same environment.

SYSTEM REQUIREMENTS
➢ Hardware Requirements
Processor : above 500 MHz
Ram : 4 GB
Hard Disk : 4 GB
Input device : Standard Keyboard and Mouse.
Output device : VGA and High Resolution Monitor.

24
➢ Software Requirements
Operating System : Windows 7 or higher
Programming : Python 3.6 and related libraries
Software :
Anaconda Navigator and Jupyter Notebook

SOFTWARE DESCRIPTION
➢ Python
Python is an interpreted high-level programming language for
general-purpose programming. Created by Guido van Rossum and first
released in 1991,Python has a
design philosophy that emphasizes code readability, notably using sini
ficant whitespace. It provides constructs that enable clear programmin
g on both small and large scales. Python features a dynamic type
system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative,
functional and procedural, and has a large and comprehensive standard
library. Python interpreters are available for many operating systems.
C Python, the reference implementation of Python, is open source
software and has a community-based development model, as do nearly
all of its variant implementations. C Python is managed by the non-
profit Python Software Foundation.
➢ Pandas
Pandas is an open-source Python Library providing high-
performance data manipulation and analysis tool using its powerful
data structures. The name Pandas is derived from the word Panel
Data – an Econometrics from Multidimensional data. In 2008,
developer Wes McKinney started developing pandas when in need of
high performance, flexible tool for analysis of data. Prior to Pandas,
Python was majorly used for data mining and preparation. It had very
little contribution towards data analysis. Pandas solved this problem.

25
Using Pandas, we can accomplish five typical steps in the processing
and analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyse. Python with

Pandas is used in a wide range of fields including academic and


commercial domains including finance, economics, Statistics,
analytics, etc.
Key Features of Pandas:
• Fast and efficient Data Frame object with default and customized
indexing.
• Tools for loading data into in-memory data objects from different file
formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
➢ NumPy
NumPy is a general-purpose array
processing package. It provides a high- performance multidimensional
array object, and tools for working with these arrays. It is the
fundamental package for scientific computing with Python.
It contains various features including these important ones:
• A powerful N-dimensional array object
• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random number
capabilities 24
• Besides its obvious scientific uses, NumPy can also be used as an
efficient multi-dimensional container of generic data. Arbitrary data-
types can be defined using Numpy which allows NumPy to seamlessly
and speedily integrate with a wide variety of databases.

26
➢ Sckit-Learn
• Simple and efficient tools for data mining and data analysis\
• Accessible to everybody, and reusable in various contexts
• Built on NumPy, SciPy, and matplotlib
• Open source, commercially usable - BSD license
➢ Matplotlib
•Matplotlib is a python library used to create 2D graphs and plots by
using python scripts.
•It has a module named pyplot which makes things easy for plotting by
providing feature to control line styles, font properties, formatting axes
etc.
•It supports a very wide variety of graphs and plots namely - histogram,
bar charts, power spectra, error charts etc.
➢ Jupyter Notebook
•The Jupyter Notebook is an incredibly powerful tool for interactively
developing and presenting data science projects.
•A notebook integrates code and its output into a single document that
combines visualizations, narrative text, mathematical equations, and
other rich media.
•The Jupyter Notebook is an open-source web application that allows
you to create and share documents that contain live code, equations,
visualizations and narrative text.

Fig 4.1 : Jupyter Notebook


27
•Uses include: data cleaning and transformation, numerical simulation,
statistical modelling, data visualization, machine learning, and much
more.
•The Notebook has support for over 40 programming languages,
including Python, R, Julia, and Scala.
• Notebooks can be shared with others using email, Drop box, Git
Hub and the Jupyter Notebook.
•Your code can produce rich, interactive output: HTML, images,
videos, LATEX, and custom MIME types.
•Leverage big data tools, such as Apache Spark, from Python, R and
Scala. Explore that same data with pandas, scikit-learn, ggplot2, Tensor
Flow.

SYSTEMDESIGN
➢ SYSTEM ARCHITECTURE
The below figure shows the process flow diagram or proposed work.
First we collected the C level and Heart Disease Database from UCI
website then pre-processed the dataset and select 16 important features.

Fig 5.1: System Architecture

For feature selection we used Recursive feature Elimination Algorithm


using Chi2 method and get 16 top features. After that applied ANN and

28
Logistic algorithm individually and compute the accuracy. Finally, we
used proposed Ensemble Voting method and compute best method for
diagnosis of heart disease.

➢ MODULES
The entire work of this project is divided into 4 modules.They are:

a. Data Pre-processing
b. Feature
c. Classification
d. Prediction
a. Data Pre-processing:
This file contains all the pre-processing functions needed to process all
input documents and texts. First we read the train, test and validation
data files then performed some preprocessing like tokenizing,
stemming etc. There are some exploratory data analysis is performed
like response variable distribution and data quality checks like null or
missing values etc.
b. Feature:
Extraction In this file we have performed feature extraction and
selection methods from sci -kit learn python libraries. For feature
selection, we have used methods like simple bag-of-words and n-grams
and then term frequency like tf-tdf weighting. We have also used
word2vec and POS tagging to extract the features, though POS tagging
and word2vec has not been used at this point in the project.
c. Classification:
Here we have built all the classifiers for the breast cancer diseases
detection. The extracted features are fed into different classifiers. We
have used Naive-bayes, Logistic Regression, Linear SVM, Stochastic
gradient decent and Random forest classifiers from sklearn. Each of the
extracted features was used in all of the classifiers. Once fitting the
model, we compared the f1 score and checked the confusion matrix.

29
After fitting all the classifiers, 2 best performing models were
selected as candidate models for heart diseases classification. We have
performed parameter tuning by implementing Grid Search CV methods
on these candidate models and chosen best performing parameters for
these classifier.

Finally selected model was used for heart disease detection


with the probability of truth. In Addition to this, we have also extracted
the top 50 features from our term-frequency tfidfVectorizer to see what
words are most and important in each of the classes.
We have also used Precision-Recall and learning curves to see how
training and test set performs when we increase the amount of data in
our classifiers.
d. Prediction:
Our finally selected and best performing classifier was algorithm
which was then saved on disk with name final_model.sav. Once you
close this repository, this model will be copied to user's machine and
will be used by prediction.py file to classify the Heart diseases. It takes
a news article as input from user then model is used for final
classification output that is shown to user along with probability of
truth.

➢ DATA FLOW DIAGRAM


The data flow diagram (DFD) is one of the most important tools used
by system analysis. Data flow diagrams are made up of number of
symbols, which represents system components. Most data flow
modelling methods use four kinds of symbols: Processes, Data stores,
Data flows and external entities.
These symbols are used to represent four kinds of system
components. Circles in DFD represent processes. Data Flow
represented by a thin line in the DFD and each data store has a unique
name and square or rectangle represents external entities.

30
Fig : Data Flow diagram Level 0

31
IMPLEMENTATION

➢ STEPS FOR IMPLEMENTATION:


1. Install the required packages for building the Passive Aggressive
Classifier
2. Load the libraries into the workspace from the packages.
3. Read the input data set.
4. Normalize the given input dataset.
5. Divide this normalized data into two parts:
a. Train data
b. Test data (Note: 80% of Normalized data is used as Train data,
20% of the Normalized data is used as Test data.)

➢ SOURCE CODE:
I hope you now have understood why we need to predict the grades of a student.
Now let’s see how we can use machine learning for the task of student grades
prediction using Python. I will start this task by importing the necessary Python
libraries and the dataset:

Dataset

In [1]:

Import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle

data = pd.read_csv("student-mat.csv")
data.head()

32
Out [1]:

Note: The dataset contains more columns.

The dataset that I am using for the task of students grade prediction is based on
the achievements of the students of the Portuguese schools. In this dataset the G1
represents the grades of the first period, G2 represents the grades of the second
period, and G3 represents the final grades. Now let’s prepare the data and let’s
see how we can predict the final grades of the students:

In [2]:

data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]


predict = "G3"
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])

from sklearn.model_selection import train_test_split


xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)

In the above code, I first selected the necessary columns that we need to train a
machine learning model for the task of student grades prediction. Then I declared
that the G3 column is our target label and then I split the dataset into 20% testing
and 80% training. Now let’s see how to train a linear regression model for the
task of student grades prediction:
In [3]:
linear_regression = LinearRegression()
linear_regression.fit(xtrain, ytrain)
accuracy = linear_regression.score(xtest, ytest)
print(accuracy)

33
Out [3]:
0.8432876775479776

The linear regression model gave an accuracy of about 84% which is not bad in
this task. Now let’s have a look at the predictions made by the students’ grade
prediction model:
In [4]:
Predictions=linear_regression.predict(xtest)
for i in range(len(predictions)):
print(predictions[x], xtest[x], [ytest[x]])

Out [4]:
[[16.16395534 14.23423176 14.08532841 5.28096434 14.23423176]
[16.16395534 16.16395534 14.08532841 5.28096434 7.97291422]
[14.52779998 11.92149651 14.08532841 9.13993948 4.71694746]
...
[ 4.71694746 11.92149651 3.9451298 9.13993948 9.13993948]
[12.56424351 4.92497623 3.9451298 5.28096434 5.28096434]
[11.92149651 9.05247158 3.9451298 5.28096434 16.16395534]] [[[15 16 2 0 2]
[15 14 2 0 2]
[15 14 3 0 6]
[ 7 6 2 0 10]
[15 14 2 0 2]]....

34
CONCLUSION

Predicting student grades is one of the key performance indicators that can help
educators monitor their academic performance. Therefore, it is important to have
a predictive model that can reduce the level of uncertainty in the outcome for an
imbalanced dataset. This paper proposes a multiclass prediction model with six
predictive models to predict final student’s grades based on the previous student
final examination result of the first-semester course. Specifically, we have done
a comparative analysis of combining oversampling SMOTE with different FS
methods to evaluate the performance accuracy of student grade prediction. We
also have shown that the explored oversampling SMOTE is overall improved
consistently than using FS alone with all predictive models. However, our
proposed multiclass prediction model performed more effectively than using
oversampling SMOTE and FS alone with some parameter settings that can
influence the performance accuracy of all predictive models. Here, our findings
contribute to be a practical approach for addressing the imbalanced multi-
classification based on the data-level solution for student grade prediction.

35

You might also like