Charitopoulos 2017
Charitopoulos 2017
Abstract— Educational data mining applies data mining analyzed by pattern analysis, classification and recognition
methods and tools to education-related data, typically collected methods, [1], [2], [3], in order to answer research questions
through the use of an e-learning platform. Data stored in an e- referring to learners’ practices, strategies, learning outcomes
learning platform database include user-platform interaction and academic performance, [4], [5], [6]. Answers are expected
events (counts of scrolls, mouse clicks or page loads), platform to help optimize the education offered in terms of quality,
access times per session or in total, times between events and efficiency, personalization and accessibility to wider social
various assessment scores such as grades per quiz or per session groups. Eventually, what is sought is an efficient knowledge
test, final grades, etc. In the present paper we focus on the time representation scheme that would facilitate the transformation
between actions (TBA) taken by the learner while he/she
of data into knowledge on the underlying systems and on their
interacts with the platform. TBA values relay information on the
interrelations. Education is considered as a complex system
mode of interaction of an individual learner with the platform.
The two major questions addressed are (i) whether TBA values
where strongly non-linear cause-effect relations prevail, [7].
follow any probability density function (PDF) and if so, which is Educational data mining has been extensively used by
the PDF that optimally fits the data, and (ii) whether the researchers who investigate a broad spectrum of Education-
parameters of such optimally fitted PDFs might serve as features related issues. Problems addressed include, among others,
for the clustering of the learning content modules or sessions into
clusters of similar characteristics or functionalities. Results verify x the type of relations among the various features or
that skewed (asymmetric) PDFs can be fitted on the TBA value quantities represented by the mined data, e.g., causality,
histograms with adequate accuracy. Furthermore, the (non-)linearity or correlation,
parameters of few optimally fitted PDFs, used as a feature vector,
result in a meaningful clustering of learning content parts into x the type of relations between access and usage data and
clusters of similar “character”. Clustering results may then be learning outcomes achieved through an e-learning activity,
used as a recommendation to the course designer / instructor, to and
improve content structure or to optimally distribute/sequence
parts of the course material. x the feasibility of prediction of the learning outcomes
(students’ performance, in terms of grades) on the basis of
Keywords— Educational data mining, clustering, maximum e-learning platform access, usage and interaction data.
likelihood parameter estimation, moodle, e-learning Diverse factors are empirically selected and included in
I. INTRODUCTION existing research on these problems, e.g., [8], [9], [10], [11],
[12], [13], [14]. Selection of factors is mostly based on
Educational data mining applies data mining methods and intuition as to their capacity to convey information on the
tools to education-related data, typically collected through the learner’s behavior, status, strategies and achievements.
use of e-learning platforms for teaching and learning. The Examples include student demographics, the learner-platform
popularity of e-learning platforms and their widespread use interaction frequency and regularity, the time spent in the
across all grades of typical and continuing education and platform either for study or for evaluation, the level of
training, has made a wealth of data available for extraction and academic performance achieved and the predictability of
analysis. Data are collected and stored during the interaction of academic performance on the basis of platform access and
the learners with the e-learning environment. The stored data usage data. In [14], standard data mining methods, such as
include actions taken by the user while interacting with the Linear Regression Analysis and Clustering, are employed in
platform (e.g., counts of scrolls, mouse clicks or page loads), order to characterize the relations among the various factors
platform access times per session or in total, times between examined. Results obtained show that user-platform interaction
actions, as well as various assessment scores such as grades per and time-spent-in-platform factors are fairly linearly correlated
quiz or per session test and final course grades. Following their (although this does not necessarily imply a causality relation)
extraction from the e-learning platform databases, data are while student performance factors do relate to student-platform
Moreover, data are extracted uniformly for each and every The API controller manages the logger API and the
individual platform user; inactive users that enrolled in the AddLog action manages the endpoint, which receives the
course but did not attend or dropped the course midway are events along with their metadata and registers them in the
cleared from the data set by post-processing. Analysis of the database. Metadata is processed by a separate software
post-processed data aims to relate students’ academic component.
performance in the course to their behavior as e-learning The platform grants access to the stored data, at the address
platform users. The later is comprised of factors such as elemoodle.teipir.gr/logusers, to anyone who can provide
participation, focus, personal pace of study, time spent versus ‘teacher’ role credentials in the moodle server.
the results obtained, etc.
C. Data extraction and presentation
Learners’ actions or events are logged through a distinct
software component, the logger, developed and embedded in The software application developed for data extraction and
the moodle server installation package as a web service. presentation allows the user to specify through a graphical user
Actually it is an API within moodle, used to asynchronously interface (GUI) the group of data to be extracted. Requests for
send events for storage. data extraction may refer to the data corresponding to a specific
user ID or a specific course module (Section), along with a
In terms of software architecture, this moodle add-on offers time period specified through a calendar. The presentation of
three functionalities: the extracted data may be requested either in the form of raw
data or in the form of statistics obtained by processing of the
x Registration of the time stamp of an event, through the raw data.
JQuery Javascript library,
Consequently, the GUI offers four (4) menu tabs, namely, III. PDF FITTING AND OPTIMAL PARAMETER ESTIMATION
1. moodledata – by user, In order to answer the first research question, frequency
histograms of the calculated TBA values are plotted for each
2. moodledata – by module, one of the nine (9) Sections of the material. Fig. 8 shows
3. statistics – by user, and sample histograms for Sections A and M, respectively. Their
asymmetric form with a heavier right-side tail prompts us
4. statistics – by module. towards skewed rather than symmetric PDFs for fitting: the
exponential distribution, the log-normal distribution and the
Fig. 6 shows a sample screen of the GUI for the
gamma distribution are the candidates examined.
moodledata – by user menu tab.
The user of the application (‘teacher’ role in the platform) 70
60
specifies through the GUI either an individual user ID or a 50
group of user IDs or all user IDs of students subscribed in the
Frequency
40
specific course, along with the time period of interest. 30
20
Logged data that correspond to the request are extracted 10
from the database and presented either in a detailed or in a 0
0,00001
0,00005
0,00009
0,00013
0,00017
0,00021
0,00025
0,00029
0,00033
0,00037
0,00041
0,00045
0,00049
0,00053
0,00057
0,00061
0,00065
0,00069
0,00073
0,00077
0,00081
0,00085
0,00089
0,00093
0,00097
More
summarized form (statistics menus). Fig. 7 shows a sample
screen that presents the extracted data for a specified group of Bin (TBA values, time)
user IDs (menu moodledata-by user).
70
60
50
Frequency
40
30
20
10
0
0,00001
0,00005
0,00009
0,00013
0,00017
0,00021
0,00025
0,00029
0,00033
0,00037
0,00041
0,00045
0,00049
0,00053
0,00057
0,00061
0,00065
0,00069
0,00073
0,00077
0,00081
0,00085
0,00089
0,00093
0,00097
More
Fig. 10. Data histogram (blue) and MLE fitted PDF (red) for Section M.
Exponential (top), Log-normal (middle), Gamma (bottom).
Fig. 9. Data histogram (blue) and MLE fitted PDF (red) for Section A. C 0.4909 – 10.2121 0.7025 1.8722 0.2622
Exponential (top), Log-normal (middle), Gamma (bottom). 0.9501 – 9.7881 1.0061 1.0858 0.8750
E
A practical consideration regards the considerably raised G 0.9637 – 9.7784 1.0004 1.0776 0.8943
value in the final bin of the histogram plots shown in Fig. 8. K 0.8312 – 9.8730 0.9426 1.1855 0.7011
This is due to representation only: all data (TBA) values
M 0.5359 – 10.1366 0.7115 1.8018 0.2974
beyond the limit of the horizontal axis of these plots are
counted in that last bin. However, when fitting PDF curves on P 0.7906 – 9.9166 0.9299 1.2001 0.6588
the data histograms, these raised last values are excluded from R 0.8715 – 9.8420 0.9577 1.1499 0.7579
the fit so as not to distort the curve fitted to the previous bins.
T 0.8632 – 9.8643 0.9701 1.1239 0.7680
Fig. 12. Scatter plot of the Section clustering results, projected down to the 3-
D space {μexp, μlgn, σlgn}. Cluster 1 Sections (red dots) and Cluster 2 Sections
Fig. 11. The numerical results in Table I, in the form of ‘plots’ across (blue dots), shown as points in this 3-D space.
Sections. Exponential (top), Log-normal (middle), Gamma (bottom).