0% found this document useful (0 votes)
16 views6 pages

Predicting Students Marks in Hellenic Open Univers

This document discusses predicting students' marks in the Hellenic Open University using regression analysis techniques. It compares six regression algorithms using a dataset of 354 students from an introductory informatics course. The M5rules algorithm proved most accurate at predicting marks. The authors constructed a software tool for tutors using M5rules to help identify students needing additional support.

Uploaded by

Mohd Danish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Predicting Students Marks in Hellenic Open Univers

This document discusses predicting students' marks in the Hellenic Open University using regression analysis techniques. It compares six regression algorithms using a dataset of 354 students from an introductory informatics course. The M5rules algorithm proved most accurate at predicting marks. The authors constructed a software tool for tutors using M5rules to help identify students needing additional support.

Uploaded by

Mohd Danish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221423304

Predicting Students Marks in Hellenic Open University

Conference Paper · January 2005


DOI: 10.1109/ICALT.2005.223 · Source: DBLP

CITATIONS READS
109 3,205

2 authors:

Sotiris Kotsiantis P. E. Pintelas


University of Patras University of Patras
267 PUBLICATIONS 17,138 CITATIONS 192 PUBLICATIONS 8,938 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by P. E. Pintelas on 27 May 2014.

The user has requested enhancement of the downloaded file.


Predicting Students’ Marks in Hellenic Open University

Sotiris B. Kotsiantis & Panayiotis E. Pintelas


Educational Software Development Laboratory
Department of Mathematics University of Patras
{sotos, pintelas}@math.upatras.gr

Abstract art regression algorithms to find out which algorithm is


more appropriate not only to predict student’s
The ability to provide assistance for a student at the performance accurately but also to be used as an
appropriate level is invaluable in the learning process. educational supporting tool for tutors. For the purpose
Not only does it aids the student’s learning process but of our study the ‘informatics’ course of the Hellenic
also prevents problems, such as student frustration Open University (HOU) provided the data set.
and floundering. Students’ key demographic Generally, the usage of regression analysis to
characteristics and their marks in a small number of classify data can be an extremely useful tool for
written assignments can constitute the training set for researchers and Open University administrators. A
a regression method in order to predict the student’s plethora of data can be utilized simultaneously to
performance. The scope of this work compares some of classify cases and the resultant model can be evaluated
the state of the art regression algorithms in the for usefulness relatively easily. The ability to develop a
application domain of predicting students’ marks. A predictive model based on the model produced through
number of experiments have been conducted with six the regression analysis procedure increases its
algorithms, which were trained using datasets usefulness substantially. Open Universities can utilize
provided by the Hellenic Open University. Finally, a this dynamic and powerful procedure to target services
prototype version of software support tool for tutors and interventions to students who need it most, thereby
has been constructed implementing the M5rules utilizing their resources more effectively.
algorithm, which proved to be the most appropriate The following section describes in brief the
among the tested algorithms. Hellenic Open University (HOU) distance learning
methodology and the data of our study. Some very
1. Introduction basic definitions about regression techniques are given
in section 3. Section 4 presents the experiment results
The application of Machine Learning Techniques in for all the tested algorithms and at the same time
predicting students’ performance proved to be helpful compares these results. Section 5 presents the
for identifying poor performers and it can enable tutors produced educational decision support tool. Finally,
to take remedial measures at an earlier stage, even section 6 discusses the conclusions and some future
from the very beginning of an academic year using research directions.
only students’ demographic data, in order to provide
additional help to the groups at risk [4]. The diagnosis 2. Hellenic Open University and Data
of students’ performance is increased as new Description
curriculum data is entered during the academic year,
offering the tutors more effective results. It was The mission of the Hellenic Open University
showed in [4] that the most accurate machine learning (HOU) is to offer university level education using the
algorithm for identifying predicted poor performers is distance learning methodology. The basic educational
the Naïve Bayes Classifier. However, that work could unit of the HOU is the course module (referred simply
only predict if a student passes a course module or not. as module from now on) that covers a specific subject
This paper uses existing regression techniques in in graduate and postgraduate level. For the purpose of
order to predict the students’ marks in a distance our study the ‘informatics’ course provided the
learning system. It compares some of the state of the training set. A total of 354 instances (student’s

Proceedings of the Fifth IEEE International Conference on Advanced Learning Technologies (ICALT’05)
0-7695-2338-2/05 $20.00 © 2005 IEEE
records) have been collected from the module 41% probability to pass the module. A similar situation
‘Introduction to Informatics’ (INF10) [12]. holds with the existence of children, a student with
Regarding the INF10 module of HOU during an children has 52% probability to pass the module while
academic year students have to hand in 4 written a student without children has only 43%. This is
assignments, optional participate in 4 face to face probably due to the fact that the family obligations is
meetings with their tutor and sit for final examinations known and has been taken under consideration prior to
after an 11-month-period. A student with a mark >=5 the commencement of the studies. It must be also
‘passes’ a lesson or a module while a student with a mentioned that the workload separates the probabilities
mark <5 ‘fails’ to complete a lesson or a module. just in the middle.
Generally, a student must submit at least three
assignments (out of 4). Subsequently, the tutors
Table 1. The attributes used and their values
evaluate these assignments and a mark greater or equal

Registry
to 20 should be obtained in total in order that each Sex male, female

(demographic) attributes
student successfully completes the INF10 module. Age 24-46
Students who meet the above criteria may sit the final Marital status single, married,
examination test. divorced, widowed
The attributes (features) of our dataset are Number of children none, one, two or more
presented in Table 1 along with the values of every Occupation no, part-time, fulltime

Student’s
Computer literacy no, yes
attribute. The set of the attributes was divided in 3
Job associated with no, junior-user, senior-
groups. The ‘Registry Class’, the ‘Tutor Class’ and the
computers user
‘Classroom Class’. The ‘Registry Class’ represents 1st face to face Absent, present
attributes which were collected from the Student’s meeting
Registry of the HOU concerning students’ sex, age, 1st written assignment no, 0-10
marital status, number of children and occupation. In 2nd face to face absent, present
Attributes from tutors’ records

addition to the above attributes, the previous –post meeting


high school– education in the field of informatics and 2nd written no, 0-10
the association between students’ jobs and computer assignment
knowledge were also taken into account. If a student 3rd face to face absent, present
has attended at least a seminar (of 100 hours or more) meeting
on Informatics after high school then he/she would 3rd written no, 0-10
qualify as ‘yes’ in computer literacy. Moreover, assignment
4th face to face absent, present
students who use software packages (such as word
meeting
processor) at their job without having any deep 4th written assignment no, 0-10
knowledge in informatics were considered as ‘junior- Class Final examination 0-10
users’, while students who work as programmers or in test
data processing departments were considered a ‘senior
users’. The remaining students’ jobs were listed as ‘no’
concerning association with computers. On the contrary, as far as the demographic
‘Tutor Class’ represents attributes, which were attributes are concerned, stronger correlation exists
collected from tutors’ records concerning students’ between student performance and the existence of
marks on the written assignments and their presence or previous education in the field of Informatics. The
absence in face-to-face meetings. Finally, the ‘class ratio of students who have previous education in the
attribute’ represents the result on the final examination field of Informatics and pass the exams vs. them who
test. fail is 51–49%, while for the remaining students this
The analysis of the demographic attributes showed ratio drops to 28–72%. A similar correlation exists
that the ratio of men who passed the exams vs. men between the involvements in professional activities
who failed is 48–52%, while for women this ratio demanding the use of computer. The students who use
drops to 39–61%. Moreover, it should be noted that the the computer in their job have 52% probability to pass
percentage of students below 32 years old that pass the the module while the remaining students have only
exams is measured 46%, when the corresponding 32%.
number for older students is 44%. Another interesting Until now, we have described how each
fact is related to student performance and their marital demographic attribute influences the prediction based
status. It is just as possible for a married student to on our dataset. In order to show in which direction
pass the exams (51%) while a single student has only

Proceedings of the Fifth IEEE International Conference on Advanced Learning Technologies (ICALT’05)
0-7695-2338-2/05 $20.00 © 2005 IEEE
(pass or fail) each of the remaining attributes’ values The most well known model tree inducer is the M5ǯ
push the induction in Table 2 some practical [10]. M5rules algorithm produces propositional
probabilities are estimated. The interpretation of Table regression rules in IF-THEN rule format using routines
2 is easy enough and it shows, for example, that a for generating a decision list from M5ǯModel trees
student with a mark more than 6 in WRI-4, has about 4 [11]. BP is the most well known algorithms for
times more probabilities to pass than fail (0.65/0.17). training Neural Networks. The sequential minimal
optimization algorithm (SMO) SMO differs from most
Table 2.Influence of each attribute SVM algorithms in that it does not require a quadratic
Attribute Value Pass Fail programming solver. In [8] SMO is generalized so that
WRI-4 Mark<3 0.04 0.68 it can handle regression problems (SMOreg).
3=<Mark=<6 0.31 0.15
Mark>6 0.65 0.17 4. Experiments Results
WRI-3 Mark<3 0.03 0.61
3=<Mark=<6 0.21 0.2
The learning algorithms are useful as a tool for
Mark>6 0.66 0.19
identifying predicted poor performers [3]. With the
WRI-2 Mark<3 0.08 0.52
3=<Mark=<6 0.15 0.26
help of machine learning the tutors will be in position
Mark>6 0.77 0.22 to know from the beginning of the module, based only
FTOF-4 Absent 0.23 0.76 on curriculum-based data of the students whose of
Present 0.77 0.24 them will complete the module with enough accurate
FTOF-3 Absent 0.2 0.65 precision, which reaches 64% in the initial forecasts
Present 0.8 0.35 and exceeds 80% before the middle of the period [4].
WRI-1 Mark<3 0.02 0.19 After the middle of the period, we can use existing
3=<Mark=<6 0.14 0.35 regression techniques in order to predict the students’
Mark>6 0.84 0.46 marks.
FTOF-2 Absent 0.22 0.54 The experiments took place in two distinct phases.
Present 0.78 0.46 During the first phase (training phase) the algorithms
were trained using the data collected from the
Subsequently, in an attempt to show how much academic year 2000-1. The training phase was divided
each attribute influences the induction, we ranked the in 5 consecutive steps. The 1st step included the
influence of each one according to a statistical measure demographic data, the two first face-to-face meetings
– RRELIEF [9]. The demographic attributes that and written assignments as well as the resulting class
mostly influence the induction are the ‘sex’ and the (final mark). The 2nd step additionally included the
‘children’. In addition, it was found that 1st written third face-to-face meeting. The 3rd step additionally
assignment has not a large value of influence. The included the third written assignment. The 4th step
reason is that almost all students try harder with the additionally included the fourth face-to-face meeting
first written assignment thus making the offered and finally the 5th step that included all attributes
information of this attribute minimal and maybe described in Table 1.
confusing. Subsequently, ten groups of data for the new
academic year (2001-2) were collected from 10 tutors
3. Regression Issues and the corresponding data from the HOU registry.
Each one of these 10 groups was used to measure the
The problem of regression consists in obtaining a accuracy within these groups (testing phase). The
functional model that relates the value of a target testing phase also took place in 5 steps. During the 1st
continuous variable y with the values of variables x1, step, the demographic data as well as the two first face-
x2, ..., xn (the predictors). This model is obtained using to-face meetings and written assignments of the new
samples of the unknown regression function. These academic year were used to predict the class (final
samples describe different mappings between the student mark) of each student. This step was repeated
predictor and the target variables. 10 times (for every tutor’s data). During the 2nd step
For the propose of our comparison the six most these demographic data along with the data from the
common regression techniques namely Model Trees third face-to-face meeting were used in order to predict
[10], Neural Networks [5], Linear regression (LR) [2], the class of each student. This step was also repeated
Locally weighted linear regression (LWR) [1] and 10 times. During the 3rd step the data of the 2nd step
Support Vector Machines [7] are used. along with the data from the third written assignment

Proceedings of the Fifth IEEE International Conference on Advanced Learning Technologies (ICALT’05)
0-7695-2338-2/05 $20.00 © 2005 IEEE
were used in order to predict the student class. The to indicate that no measurement was recorded. After
remaining steps use data of the new academic year in opening the data set that characterizes the problem for
the same way as described above. These steps are also which the user wants to take the prediction, the tool
repeated 10 times. automatically uses the corresponding attributes for
It must be mentioned that we used the free available training.
source code by [11] for our experiments. In Table 3, After the training of the model, the user is able to
the most easily understandable measure - mean see the produced regressor (The tool is available in the
( )
absolute error: p1 − a1 + ! + pn − an n where pi: web page:
http://www.math.upatras.gr/~esdlab/Regression-tool/).

1 The tool (Figure 1) can also predict the output of either
predicted values, ai: actual values and a =
n
¦a i
- of
a single instance or an entire set of instances (batch of
i

each algorithm for all the testing steps of the instances). It must be mentioned that for batch of
experiment is presented. instances the user must import an Excel cvs file with
all the instances he/she wants to have predictions.
Table 3. Mean absolute error of each algorithm for all
the testing steps
M5ǯ BP LR LWR SMOreg M5rules
WRI-2 1.83 2.15 1.89 1.84 1.84 1.83
FTOF-3 1.74 2.08 1.83 1.79 1.78 1.74
WRI-3 1.55 1.79 1.6 1.53 1.56 1.55
FTOF-4 1.54 1.8 1.56 1.5 1.55 1.54
WRI-4 1.23 1.65 1.5 1.4 1.44 1.21

According to the results, the M5rules is the most


accurate regression algorithm to be used for the
construction of a software support tool (even though in
most of the testing steps there is not statistically
significant difference between algorithms according to
the corrected resampled t-test [6]). However, another
advantage of M5rules except for its better performance
is its better comprehensibility.
Figure. 1. The prototype tool
5. Software Support Tool The ranking of the attributes’ influence brought
considerable benefits; by helping the tutors to better
A prototype version of the software support tool
understand the characteristics of the population that
has already been constructed and is in use by the
mostly affect academic achievement. For example, the
tutors. The tool expects the training set as a
prototype tool for the used dataset shows that the
spreadsheet in CSV (Comma-Separated Value) file
attributes that mostly influence the induction are the
format. The tool assumes that the first row of the CSV
‘WRI-4’ and the ‘WRI-3’ (Figure 2).
file is used for the names of the attributes. There is not
any restriction in attributes' order. However, the class
attribute must be in the last column. It must be
mentioned that the used attributes are not a conclusive
list. An extension can introduce new attributes that
were not in the current database, but are collectable by
tutors and may potentially contribute to the prediction
of academic achievement. For example, measures of
different intellectual abilities, interests, motivation, and
personality traits of students.
Once the database is in a single relation, each
attribute is automatically examined to determine its
data type (for example, whether it contains numeric or
symbolic information). A feature must have the value ?

Proceedings of the Fifth IEEE International Conference on Advanced Learning Technologies (ICALT’05)
0-7695-2338-2/05 $20.00 © 2005 IEEE
1) Are there groups of students who use online
resources in a similar way? Based on the usage of
the resource by other students in the group, can we
help a new student use the resources better?
2) Can we classify the learning difficulties of the
students? Can we help instructors to develop the
homework more effectively and efficiently?

7. References
[1] Atkeson, C. G., Moore, A.W., & Schaal, S. (1997).
Locally weighted learning. Artificial Intelligence
Review, 11, 11–73.
[2] Fox, J. (1997), Applied Regression Analysis, Linear
Models, and Related Methods, ISBN: 080394540X,
Sage Pubns.
[3] Kotsiantis, S., Pierrakeas, C., Pintelas, P.(2003),
Preventing student dropout in distance learning systems
Figure 2. Ranking the attributes’ influence to the final using machine learning techniques, Lecture notes in AI,
Springer-Verlag Vol 2774, pp 267-274.
prediction in our use case
[4] Kotsiantis S., Pierrakeas C., Pintelas P. (2004),
Predicting Students’ Performance in Distance Learning
6. Conclusion Using Machine Learning Techniques, Applied Artificial
Intelligence (AAI), Volume 18, Number 5 / May-June
This paper aims to fill the gap between empirical 2004, pp. 411 - 426.
prediction of student performance and the existing [5] Mitchell, T. (1997), Machine Learning. McGraw Hill.
regression techniques. Our data set is from the module [6] Nadeau, C., Bengio, Y. (2003), Inference for the
Generalization Error. Machine Learning, 52, 239-281.
INFO but most of the conclusions are wide-ranging
[7] Platt, J. (1999). Using sparseness and analytic QP to
and present interest for the majority of programs of speed training of support vector machines. In: Kearns,
study of Hellenic Open University and more generally M. S., Solla, S. A. & Cohn D. A. (Eds.), Advances in
for all the distance education programs. It would be neural information processing systems 11. MA: MIT
interesting to compare our results with those from Press.
other open and distance learning programs offered by [8] Shevade, S., Keerthi, S., Bhattacharyya C., and Murthy,
other open Universities. So far, however, we have not K. (2000). Improvements to the SMO algorithm for
been able to find such results. SVM regression. IEEE Transaction on Neural
Generally, the education domain offers many Networks, 11(5):1188-1183.
[9] Sikonja M. and Kononenko I. (1997), An adaptation of
interesting and challenging applications for data
Relief for attribute estimation in regression,
mining. Firstly, an educational institution often has Proceedings of the Fourteenth International Conference
many diverse and varied sources of information. There (ICML'97), ed., Dough Fisher, pp. 296-304. Morgan
are the traditional databases (e.g. students’ Kaufmann Publishers.
information, teachers’ information, class and schedule [10] Wang, Y. & Witten, I. H. (1997). Induction of model
information, alumni information), online information trees for predicting continuous classes, In Proc. of the
(online web pages and course content pages) and more Poster Papers of the European Conference on ML,
recently, multimedia databases. Secondly, there are Prague (pp. 128–137).
many diverse interest groups in the educational domain [11] Witten, I.H., Frank, E. (2000), Data Mining: Practical
Machine Learning Tools and Techniques with Java
that give rise to many interesting mining requirements.
Implementations, Morgan Kaufmann, San Mateo, CA.
For example, the administrators may wish to find out [12] Xenos, M., Pierrakeas C. and Pintelas P. (2002). A
information such as admission requirements and to survey on student dropout rates and dropout causes
predict the class enrollment size for timetabling. The concerning the students in the course of informatics of
students may wish to know how best to select courses the Hellenic Open University, Computers & Education
based on prediction of how well they will perform in (39): 361–377.
the courses selected.
In a next study we intend to apply data mining
methods with the goals of answering the following two
research questions:

Proceedings of the Fifth IEEE International Conference on Advanced Learning Technologies (ICALT’05)
0-7695-2338-2/05 $20.00 © 2005 IEEE
View publication stats

You might also like