ED593214 Data
ED593214 Data
In the context of university eduction, much effort has been 4.1 Course Dependency Graph
directed towards providing analytical tools to educators and The Course Dependency Graph is a graph whose node set
institutions. For example Zimmermann et al. [18] predict equals the set of all (regularly or irregularly offered) courses.
graduate performance, based on the students’ undergradu- A directed edge between course A and course B means that
ate performances. Saarela and Kärkkäinen [11] analyse un- when passing A before B then the chance of getting a better
dergraduate student data to indentify relevant factors for a grade in B is higher compared to the grade in B obtained
successful computer science education. for the order B before A.
4.2 Grade prediction where ctx is the performance context according to the above
We use a collaborative filtering [10] approach to predict stu- feature selection pipeline. Consequently, the parameter vec-
dent performance. One advantages of this approach is that tor for course j becomes
no imputation of missing entries is necessary but the opti-
mization only runs over existing entries. c̃j = (cj,1 , cj,2 , . . . , cj,n , cctx ctx
j,1 , . . . , cj,mj )
where S is the set of all students and C the set of all courses. 4.2.2 Minimization
The non-linear minimization problem in Eq. (1) is of high
4.2.1 Contextual Information dimensionality because of the parameter vectors si and cj
The above loss metric only depends on information about for i ∈ S, j ∈ C. It can most effectively be achieved using
the students’ performances, i.e. their grades. However, the stochastic gradient descent techniques with adaptive learn-
context of a performance can contain vital information. Usu- ing rates, because for this approach course vectors stabi-
ally, in the context of student records a wealth of data is lize more quickly. Specifically, we used the Adagrad algo-
readily available. This includes meta-data of a student such rithm [4], which avoids strong alteration of frequently con-
as age, gender, and nationality and data regarding the pro- sidered parameters, which is the case for many course pa-
gression of the student throughout study programs. More- rameters, while seldomly encountered parameters may be
over, information regarding the course, such as the lecturer, altered more, which is fitting for student parameters. We
is typically known. fixed a batch size of 1000 and performed 500,000 iterations
of the algorithm. Each minimization is performed for 5 dif-
A standard and straight-forward, approach to include such ferent initial random values. The value according to the
information is to pre-filter data [10]. This entails partition- smallest training loss is selected. This was performed for all
Dimensionality n
rameters and regularization parameters, i.e. for parameter
0.906 0.881 0.876 0.873 0.89
2
tuples (λ, n). Before minimization the data was normalized
Test MAE
along the lectures to zero mean and unit variance. 0.88
0.87
0.849 0.852 0.855 0.855 0.86
1
4.2.3 Evaluation
The most natural approach to evaluate the model is to split 0.85
0.05 0.1 0.15 0.2
the data by semesters. Given a fixed semester t the data up Regularization parameter
to (including) semester t − 1, i.e. Gt−1 , is used as a training
set. The data of semester t, i.e. Gt \ Gt−1 is used as a test (a)
set.
The measures of quality we use are the mean absolute er- 1.170
Dimensionality n
ror (MAE) and the root mean square error (RMSE). As a 1.173 1.133 1.120 1.110 1.155
Test RMSE
2
baseline we provide the RMSE and MAE for the mean pre- 1.140
dictor with respect to both, the students and the courses in
1.125
Table 1. 1.105 1.105 1.106 1.098
1
In the evaluation of the context-free model, we see, that 1.110
low-dimensional models (i.e. models with only few features) 0.05 0.1 0.15 0.2
perform best. The absolute values of these errors are fur- Regularization parameter
ther improved by pre-filtering the data considered. If, for
example, only Bachelor computer science students are con- (b)
sidered the test error decreases. The decay factor leads to
an improvement. For example, for n = 1 and λ = 0.1 the Figure 2: The MAE (a) and RMSE (b) for different dimen-
MAE decreases from 0.856 to 0.852. In Figure 2 the pre- sionalities n and regularization parameters λ. The models
diction results for the importance decay α = 0.1 are shown. were trained and tested on Bachelor CS students only. The
Given this loss function, the one-dimensional, less regular- loss is weighted by time with α = 0.1.
ized model outperforms the others in terms of both, the
MAE and the RMSE. The inclusion of contextural informa-
tion leads to a further reduction, such that for n = 1 and higher out-degree provide a good starting point. Further
λ = 0.1 the MAE is 0.8459, while the RMSE is 1.0904. note that for such students, Ri = ∅ and the grade predic-
tion can only give average values as no information about
Table 1: The RMSE and MAE for the mean predictors along their previous performance is available.
the student and the course axis, respectively.
To incorporate information about the predicted performance,
MAE RMSE we transform the predicted grades ĝi,j , such that good grades
map to large values and poor grades to small values, i.e., we
course 1.1130 1.3311 consider the value (5 − ĝi,j )/4 ∈ [0, 1].
student 0.9268 1.1883
We parameterize these factors into a linear model, that gives
us a raw, unfiltered recommendation value
5. RECOMMENDATION SYNTHESIS
0
The recommendation combines the course dependency graph, ri,j = cp pi,j + cg (5 − ĝi,j )/4 + cm deg+ (j), (2)
the grade prediction, and constraints based on the study
regulation in order to compute a recommendation score. A where cp , cg , cm ∈ [0, 1] provide a weighting for the three
larger score corresponds to a stronger recommendation. factors, i.e., cp + cg + cm = 1.