0% found this document useful (0 votes)
53 views6 pages

Student Performance Prediction Using Multi-Layers Artificial Neural Networks A Case Study On Educational Data Mining

Uploaded by

Karol Skowronski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views6 pages

Student Performance Prediction Using Multi-Layers Artificial Neural Networks A Case Study On Educational Data Mining

Uploaded by

Karol Skowronski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Student Performance Prediction using Multi-Layers

Artificial Neural Networks: A Case Study on


Educational Data Mining
Saud Altaf Waseem Soomro Mohd Izani Mohamed Rawi
Pir Mehr Ali Shah Arid Manukau Institute of Technology, Universiti Teknologi MARA,
Agriculture University New Zealand Malaysia
Rawalpindi, Pakistan [email protected] [email protected].
[email protected] z my

ABSTRACT 1. INTRODUCTION
In recent years, Neural Network (NN) has seen widespread and
successful implementations in a wide range of data mining In recent years, the use of internet-based educational tools has
applications, often surpassing other classifiers. This study plans to grown rapidly [1] as well as the research surrounding them (see
research of NN that are a fitting classifier to foresee understudy Figure 1). These tools provide a clear advantage for students and
execution from Learning Management System information with teachers alike, with the ability to access and share course data from
regards to Educational Data Mining. The dataset utilized for this anywhere in the world, track student progress and provide rich
examination is a Moodle log document containing log data around educational content. These tools generate vast amounts of data
900 understudies more than 10 college classes. To assess the obtained in a non-obtrusive manner that can give a better look into
applicability of Neural Networks, two case studies compare their the way students learn and interact with course materials. The
predictive performance on this dataset. The features used for challenge is to put these data to good use to improve on the
training originate from LMS data obtained during the length of educational process. One of the purposes these data can be used for
each course, and range from usage data like time spent on each is the prediction of whether a student is going to pass or fail a
course page, to grades obtained for course assignments and quizzes. course. Being able to predict student performance enables a teacher
After training, the Neural Network outperforms all six classifiers in or educational institution to provide appropriate assistance to
terms of accuracy and is on par with the best classifiers in terms of students that are at risk to miss the mark. Assisting them in a timely
recall. We also assessed the effect course predictors have on manner will reduce the number of students failing a course and may
predictive performance by leaving out the course identifiers in the indirectly reduce the amount of students dropping out of their
data. This does not affect predictive performance of the classifiers. educational program.
Furthermore, the Neural Network is trained on individual course This is a societal interest that can have a positive impact on
data to assess difference in classification performance between students, parents, teachers and educational institutions equally.
courses. The results show that half of these course classifiers better When the data comes from an educational setting we are dealing
generally trained classifiers. The importance of individual with a sub domain of data mining called Educational Data Mining,
predictors used for classification was also investigated, with or EDM. This is a field of research that applies data mining,
previously obtained grades contributing most to successful statistics and machine learning to data derived from educational
predictions. We can conclude that the proposed neural network environments. It seeks to extract meaningful information from vast
architecture works well with the selecting the feature data sets. It amounts of raw data that can be used to improve and understand
seems to the results, accuracy in student performance prediction in learning processes [2]. In order to extract interesting information,
feature vector has been achieved and satisfactory through like predicting if a student requires academic assistance, we can
appropriate classification to take better decision for efficient make use of machine learning algorithms that can automatically
prediction of student performance. predict this outcome based on the data. In the field of EDM, a wide
set of machine learning algorithms have already been used to
CCS Concepts various degrees of success like Naive Bayes Classifiers, k-Nearest
• Information systems ➝ Information systems applications • Neighbours, Random Forests, Decision Tree Classifiers, Support
Data mining➝ Data Stream Mining. Vector Machine algorithms and Neural Networks [3]-[6].
Most EDM studies investigating student performance prediction
Keywords have used small samples with little diversity in the courses they
Student Performance Prediction, Educational data mining, Neural analysed [4]. This results in potentially low portability of the results
Network, Classification, training, data sets. due to the small sample size and differences that might exist
between courses; a liberal arts course requires a different approach
from a technical course. Furthermore, most studies use a wide
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are variety of grades obtained in previous courses or previous academic
not made or distributed for profit or commercial advantage and that curricula [5], where these previous grades have been shown to be
copies bear this notice and the full citation on the first page. Copyrights strong predictors of future academic success [5-[6]. But these
for components of this work owned by others than ACM must be honored. grades might not always be available to use as predictors, thus
Abstracting with credit is permitted. To copy otherwise, or republish, to limiting the predictive capacity of the algorithms devised in these
post on servers or to redistribute to lists, requires prior specific permission studies.
and/or a fee. Request permissions from [email protected].
ICISDM 2019, April 6–8, 2019, Houston, TX, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6635-9/19/04…$15.00
https://doi.org/10.1145/3325917.3325919
59
The goal of the classifiers will be to predict if a student will require space). Multilayer Feed Forward Neural Network is often called
academic assistance, because he is at risk to fail the course, or does Multi-Layer Perceptions (MLP) because of their similarity to the
not require any assistance. As such, it can be cast as a binary Human perception [3]. MLP have minimum three layers of neurons,
classification problem, where the two prediction labels are input, hidden and output respectively. This means, it is only
"requires assistance" for students that are at risk to fail the course, interconnects within the network and not connected with the
and "does not require assistance" for students that are not at risk. surroundings. Mostly, only hide layer is used for the perception. In
The data used to perform this classification is extracted from the certain cases, an MLP can have more than one hidden layer where
log file of a Campus Management System (CMS) containing the inputs units are linear. But, it has been proven that one hidden
information about 900 students over 10 courses. This allows us to layer is sufficient to estimate any continuous non-linear function
compare the predictive performance between courses and assess if provided that sufficient number of input units has been inserted into
predictors identifying individual courses have an effect on network [4]. The general structure of a fully connected MLP with
performance. Additionally, the effects of sample size and the input nodes, hidden neuron and output neurons is displayed in
importance of individual predictors will be investigated. following Figure (1).

2. EDUCATIONAL DATA MINING


Where,  ji represent the connection between i th input layer

Research performed in this study can be classified under EDM. neuron and j th hidden layer neuron. Similarly, kj symbolize the
EDM is a sub-group of data mining that focuses on researching, th
developing and applying various automated methods to explore connection between j hidden layer and k th output layer neuron
large-scale data coming from educational settings. This is done to v 
and S is the signal vector.
increase the understanding of the way students learn, study
educational questions and improve the effectiveness of teaching A feed-forward network with Xi input and Yk output signal is
and learning activities [7]. This goal is achieved by transforming
the raw data into information that can have a direct impact on shown in Figure 4.2. The computational procedure within i th layer
educational practice and research [8]. can be illustrated by the following Equation (1).
EDM is becoming increasingly widespread nowadays. The past
couple of years has seen a rapid rise of the number of research
papers dedicated to EDM in its various forms [9]. This has been
S v   f i   i  g i 1   (1)
linked to the increase of available educational data and the  A

widespread availability of cheap computing power and accessible g i 1    1  (2)
digital tools [10]. With such a wide availability of high-quality data
and the potential to derive valuable educational insights,
  S v 1 

S v   [ S1v  S2  v  v  v 
educational institutions, governments and researcher are
increasingly looking for ways to put these techniques to good use. Where, S3 …. SNi ]T is the signal
The data analyzed in EDM come from various sources like
Learning Management Systems, administrative data from vector at the output of the i th layer; f i  is the activation function
universities and schools and other structured or unstructured
databases pertaining to education. Due to the habitually large size of the neurons in the i th layer; g is the bipolar sigmoid function of
of these databases, they require a computerized approach to discern th
A is the vector containing the input signal for i  1.
i layer;
the patterns and relationships they contain [10]. In this study, the


Bias
data contains interaction records for 900 students, making it too
voluminous to derive useful insights by hand or through non-  10 2 
automated means. In this case, an EDM approach is
recommendable to extract the information it contains.
In this study, the insights that are extracted from the data concern
the prediction of student performance, which is a subdomain of
EDM. This can be used to prevent students from failing courses by
intervening in their educational process, predict a student’s
potential to plan an optimal curriculum, give students insight in
their learning process or develop more effective instruction
Input
Layer
X1

X2
1

.
.
.
111
121 
131
 11j 
S v 

S v 

S v 
.
.
.
 y1

y2 Output
Layer

techniques [6]. The focus of this study is the applicability of .


specific machine learning methods in the forecast of whether a .
yk
Xi n

.
student does or does not require academic assistance for a certain  kj
i  ji k
course. Such knowledge can help to prevent students from dropping v 
out of their courses or educational program [8]. S

j
3. PROPOSED MULTILAYER FEED Hidden Layer

FORWARD NEURAL NETWORK Figure 1: Proposed Schematic Architecture for the MLFFNN
ARCHITECTURE The input array vector feature elements are inserted into network
Multi-Layer Feed Forward Neural Network (MLFFNN) is the through the weight matrix  between i  1th and i th layer as
easiest type of ANN in which information travels in unidirectional
i.e. from input to output. MLFFNN permits the creation of decision follows:
boundary that formed by different hyper planes (n-dimensional

60
7. Number of quizzes made
 11i  12i  13i   1iX 1  8. Total number of emails sent
 i  
i
i  i 
 21 22 23  2iX 1 
i
9. Number of CMS forum posts
10. Grade
 i    31i  32i  33i   3iX 1  (3) The normalization was performed using Scikit-Learn Standard
 i
 Scalar function. This scalar works by subtracting the column mean
       and dividing by the column standard deviation for each column.
This results in a mean of 0 and variance of 1 for each feature. All
 Xi 1  Xi 3  Xi 3   Xi X 1 
 i i i i i
predictors are used to forecast whether a apprentice will overtake or
be unsuccessful the course as follows.
All the neurons in a certain layer are supposed to be similar in all
features and the number of hidden layers can be dynamically
adjusted according the inputs. So, the network output of the
5. RESULTS
processed information in neural network is presented by the As a case study, in this experiment we want to measure the effect of
following output array vector. knowing what course an instance belongs to has on the
y  S r   [ y1 y2 y 3 …. yk ] T
(4)
classification performance. In order to get a better insight in the
importance of each predictor, feature importance statistics were
extracted from the data including CourseID using the Random
Where, r is the number of processing hidden layer toward output Forest classifier. The results can be found in Figure (2). These
layer. measures can give an insight in how informative certain features
Mostly, MLP are support Back Propagation Neural Network are for classification.
(BPNN) algorithm with supervised training. It is primitively known
as the generalized delta rule [6]. The training of a MLP network Table 1: Descriptive statistics of used predictors
becomes more complex due to the effect of training algorithm and Predictors
Existing rate in Standard
Mean
weight adjustment between the input, hidden and output layer. CMS Data Set Deviation
BPNN can be divided into two levels. Firstly, the comparable CourseID 10 - -
training of Perception and Adeline for weight calculation between
the hidden and output layer is calculated. Secondly, if no desired Total learning
900 660.8 661.2
output is available from the hidden layer, the errors from the output sessions
layer is back propagated and try different architecture of weights Total length of
900 30.7 28.6
between the input and hidden layer. session
Average of all
900 20.8 22.6
4. DATA SETS AND PREDICTOR session length
PREPARATION Total assessments
450 5.0 4.7
The data used in this study was a log file of their Campus in one semester
Management System (CMS) that containing every single user
action logged by the system. It spans the academic year 2016-2017 Mean assessment
430 6.9 6.8
grade
and contains log information about 10 courses with a total of 900
students. The large sample size heightens the probability of the data Number of quizzes
being diverse and representative of a wide variety of students. We 400 6.0 6.3
made
will use this large dataset to assess the importance of sample size
on the classification performance of the Neural Network. Total number of
280 1.9 2.3
Furthermore, the availability of the 10 courses allows us to emails sent
compare the effect courses have on classification performance
which can give an insight if the model could be applicable to other Number of CMS
220 2.6 2.3
unseen courses. forum posts
Once the predictors were extracted, the data needed to be Grade 900 5.6 5.3
normalized. Normalization is necessary for some machine learning
algorithms to work properly, like the k-Nearest Neighbour that
depends on separate measurements for its goal work. In the event The feature with the highest importance is regularity, a measure of
that a component has a scope of qualities (fluctuation) that how regularly a student accessed the course page, with 12.9%.
outperforms that of different highlights, it may command the target CourseID obtained a lower importance score at 5.3%, ranking ninth
capacity of the classifier and make it troublesome for different out of fourteen features while Mean Quiz Grade was ranked eight
highlights with littler change to impact the learning procedure. with 8.7%. Predictors like assignment grade, number of
Total 10 possible predictors are used and its decretive statistics assignments made, number of messages sent and number of CMS
analysis values are as follows: forum posts all had feature importance below 1.0% and thus have
1. CourseID low predictive value. The low importance of assignment grade,
2. Total learning sessions while still being a previously obtained grade, could be attributed to
3. Total length of session the fact that it is only available as a predictor for 2 out of the 10
4. Average of all session length courses.
5. Total assessments in one semester
6. Mean assessment grade

61
Relative feature importance (%) 14
12
10 Subsequently, Neural Network is trained on the data of each course
8 individually and its classification accuracy and recall are measured
6 for each course. This in order to determine if there is an advantage
4 to training the network on all the data or on data for each individual
2 course as well as examining the differences in predictive
0
performance between courses. The results of this experiment can be
CourseID

Total length of session

Grade
Number of CMS forum

Average of all session


Number of quizzes

Mean assessment grade

Total learning sessions


Total number of emails

Total assessments in
found in Table (4).

one semester
made The next step is to identify the uncertainty in data sets in different

length
time frames from predictors using multi-layer FFNN. It can be seen
posts
sent

in Table (3) that architecture [4x12x3] has presented the good


Mean Squared Error (MSE) performance in classification among all
tested architectures. The reasonable numbers of epochs were used
Features during the training process in suitable processing time to attain the
necessitated precision, which demonstrates the good efficiency in
the all tested architectures, along with the smallest number of error
percentage.
Figure 2: Random Forest relative feature importance.

Table 4: Performance of Neural Network per course


Course Course name Credit No. of Accuracy Baseline Recall Quiz Assessments
Code hours Students Accuracy Grades Grades
CS-301 Introduction to 3 (2-2) 280 74.3 69.7 68.9 Yes Yes
Computing
CS-323 Programming 4 (3-2) 270 75.8 69.6 72.7 No Yes
Fundamentals
SSH-303 Professional Ethics 3 (3-0) 110 97.1 96.2 97.6 Yes Yes
ENG-325 Communication Skills 3 (3-0) 130 80.0 78.5 80.9 Yes Yes
MTH-310 Multivariable Calculus 3 (3-0) 110 90.0 88.5 89.2 Yes No
CS-572 Numerical Analysis 3 (2-2) 90 65.2 62.7 64.7 Yes Yes
CS-632 Artificial Intelligence 3 (2-2) 70 95.0 94.9 95.9 Yes Yes
CS-666 Web Engineering 3 (2-2) 87 92.1 90.2 92.0 No Yes
CS-682 System Programming 3 (2-2) 75 83.1 79.4 80.2 Yes Yes
CS-692 Visual Programming 3 (2-2) 90 90.0 87.9 88.0 Yes No

Table 3: Performance of different architectures for classification


Architecture MSE No. of Epoch Accuracy Error
-3
5.69x10 69 91.5 8.5
-3
6.29x10 72 94.4 5.6
[4x8x3] -3
7.67x10 85 91.6 8.4
-3
8.02x10 99 92 8
9.49x10-3 117 96.2 3.8
-3
8.99x10 125 96.3 3.7
[4x12x3] -3
9.02x10 132 97.4 2.6
-3
9.79x10 131 97.1 2.9
-3
8.11x10 369 92.1 7.9
5.23x10-3 325 91.5 8.5
[4x15x3] -3
5.85x10 344 81.1 18.9
-3
6.56x10 362 85.8 14.2

After measuring the neural system testing execution of test perplexity grids for the different sorts of error that happened along
information, the following stage was to gauge the grouping with the preparation procedure and reduce them. To construct the

62
confusion network, test highlight information is given into the recognized. The blue cell shows the general rate of tried cases that
neural system display (see Figure 3). For measure the uncertainty were characterized accurately in green and the other way around in
level and accuracy, we tested the three architectures of available red. If there should be an occurrence of test 1, Figure (3)
data sets and train it to achieve the reasonable accuracy rate. In demonstrates each class had most extreme of 1300 testing trials that
display, the confusion grid is holding the data about the are as of now foreordained in show. Keeping in mind the end goal
examination among anticipated and focused on grouping classes. to perusing vertically, 947 trials were effectively named class 1. A
Figure (3) demonstrates the confusion networks for the three sum of 13 trials were wrongly named class 2, and 35 trials were
procedure phase of preparing, testing and approval of data sets wrongly grouped in class 3.
maturing process tests separately. Four other anticipated classes In class 4, an aggregate of 5 trials were just mistakenly grouped
(level and vertical) were depicted to reflect about all sample because of complex nature of flag and blending of various elements
highlight sequence collections. On account of abundant in informational indexes. At the point when the perplexity
arrangement of a focused on class trial, the targeted cells are framework is perused on a level plane, 24 trials of class 2, 30 trials
appeared in green. Every corner to corner cell demonstrates the of class 3 and 9 trials of class 4 were erroneously characterized in
quantity of cases that have been arranged effectively by the neural display. At long last, last column (in dim shading) demonstrates the
system, to distinguish highlight condition, regardless of whether effective characterization rate of each objective class. The sums of
sound or maturing of data. The cells in red shading call attention to 3952 testing trials were characterized and the last execution rate of
the quantity of cases that have been wrongly ordered by the ANN accomplishment was 97.4 percent.
show Confusion
or where theMatrix forofSample
condition 4
test highlights wereConfusion
not plainlyMatrix for Sample 3 Confusion Matrix for Sample 1
932 28 9 14 94.8% 956 28 25 17 93.2%
1 20.3% 0.6% 0.2% 0.3% 5.2% 1 20.8% 0.6% 0.5% 0.4% 6.8% 1
947 24 30 36 91.3%
20.6% 0.5% 0.7% 0.8% 8.7%

17 1172 0 0 98.6% 12 1171 0 3 98.7%


Output Class

2 2 13 1172 0 13 97.8%
Output Class

0.4% 25.5% 0.0% 0.0% 1.4% 0.3% 25.5% 0.0% 0.1% 1.3% 2

Output Class
0.3% 25.5% 0.0% 0.3% 2.2%

21 0 1188 11 97.4% 18 0 1175 1 98.4%


3 0.5% 0.0% 25.8% 0.2% 2.6% 3 0.4% 0.0% 25.5% 0.0% 1.6% 3
35 0 1160 3 96.8%
0.8% 0.0% 25.2% 0.1% 3.2%

30 0 3 1175 97.3% 14 1 0 1179 98.7%


4 0.7% 0.0% 0.1% 25.5% 2.7% 4 0.3% 0.0% 0.0% 25.6% 1.3% 4
5 4 10 1148 98.4%
0.1% 0.1% 0.2% 25.0% 1.6%

93.2% 97.7% 99.0% 97.9% 97.1% 95.6% 97.6% 97.9% 98.3% 97.4% 94.7% 97.7% 96.7% 95.7% 96.2%
6.8% 2.3% 1.0% 2.1% 2.9% 4.4% 2.4% 2.1% 1.7% 2.6% 5.3% 2.3% 3.3% 4.3% 3.8%

1 2 3 4 1 2 3 4 1 2 3 4
Target Class Target Class Target Class
(a) (b) (c)
Figure 3: Confusion matrices (a) [4x8x3] (b) [4x12x3] (c) [4x15x3]

Just a 2.6 percent error ratio happened, which is a significant The goal of this research was to assess to what extent Neural
productive and sensible rate in natural prediction finding process. Networks can be used to predict student performance based on
In the confusion networks of different examples, the achievement CMS data. We demonstrated that predictive performance of the
rate is additionally satisfactory and demonstrates the better Neural Network on all targeted courses at once exceeded in terms
execution in recommended NN building model. From Figure (3), it of accuracy. Leaving out the course predictor did not have a major
can be observed that the selected neural network matrix [4x12x3] impact on this performance. However, the Neural Network on each
expressed an acceptable and exceptional precision was attained in course individually, which resulted in an increase in performance
analysis of ripening process in the characteristic vector, ranging for some course classifiers and a decrease for others architectures
from 96 to 98 percent. This indicates the efficiency in ANN compared to the performance of the Neural Network in discussed
algorithm performance to decrease the altitude of imprecision in case study. The effect of sample size was investigated, but no
assessment creation and strength of sample trained data. relation between sample size and non satisfactory accuracy was
found. Additionally, the feature importance analysis showed that
6. DISCUSSION previously obtained grades were the most valuable predictors for
individually trained classifiers. For classification and training
The goal of this study was to assess to what extent Neural
purposes, a supervised ANN architecture is presented to show the
Networks can be used to predict student performance: assessing
efficiency of data prediction. The simulated results showed that the
whether they did or did not need academic assistance, based on the
precise and generality behaviour of training process of different
CMS data. Everything considered in analysis, when looking at
courses and predictors. To improve the Mean Squared Error rate,
accuracy, the Neural Network outperforms in this study, which
three type of ANN architecture are tested and [4x10x3]
agrees with results by [6]. In terms of recall, it is on par with the
demonstrated the reasonable quantity of hidden layer with high
best performing classifiers tested here. Considering performance
accuracy rate in classification of features vector.
indicators we can say that the Neural Network is an excellent
Future development of this research would be extend toward the
classifier to predict student performance, and can thus be used to
utilization of complex data sets based of multiple departments and
predict whether a student requires academic help.
increase the number of targeted student’s data through CMS
7. CONCLUSION AND FUTURE database. A comparison of different faculties would be an
interested area with another artificial intelligence technique for
DIRECTION better precisely predicts when different parameters measurement

63
can be taken in parallel of complex datasets to create the Education Program" European Journal of Open, Distance and E-Learning,
complexity. 17.1, 118-133, 2014.
[6]. Tran, Thi-Oanh, Hai-Trieu Dang, Viet-Thuong Dinh, , "Performance
8. References and Citations Prediction for Students: A Multi-Strategy Approach" Cybernetics and
[1]. T. Devasia, Vinushree T P and V. Hegde, "Prediction of students Information Technologies, 17.2, 164-182, 2017.
performance using Educational Data Mining," 2016 International [7]. M. Goga, S. Kuyoro, N. Goga, "A Recommender for improving the
Conference on Data Mining and Advanced Computing (SAPIENCE), student academic performance”, Procedia - Social and Behavioral
Ernakulam, pp. 91-95, 2016. Sciences, vol. 180, pp. 1481–1488, May 2015.
[2]. Asif, R., Merceron, Syed Abbas Ali, Najmi Ghani Haider. Analyzing [8]. T. Mishra, D. Kumar & D.S.Gupta, “Mining Students’ Data for
undergraduate students' performance using educational data mining. Performance Prediction.” In Proceedings of International Conference on
Computer & Education vol. 113, 177-194, 2017. Advanced Computing & Communication Technologies, pp. 255-263,
[3]. O. Edin, S. Mirza, “Data Mining Approach For Predicting Student 2016.
Performance” Economic Review – Journal of Economics and Business, [9]. Altaf, S. et al. 2014. Fault diagnosis in Distributed Motor Network
Vol. X, Issue 1, May 2012 using Artificial Neural Network. (SPEEDAM2014) 22nd IEEE
[4]. P. Krina, V. Dineshkuma, S. Priyanka, “Performance prediction of International Symposium on Power Electronics, Electrical Drives,
students using distributed data mining.”, International Journal of Automation and Motion (2014).
Advanced Research in Computer and Communication Engineering”, Vol. [10]. R. Campagni, D. Merlini, R. Sprugnoli, and M. C. Verri, "Data
6, Issue 3, March 2017. mining models for student careers," Expert Systems with Applications,
[5]. Yukselturk, Erman, Serhat Ozekes and Yalın Kılıç Türel. "Predicting vol. 42, no. 13, pp. 5508–5521, Aug. 2015.
Dropout Student: An Application of Data Mining Methods in an Online

64

You might also like