Hybrid Machine Learning Algorithms For P
Hybrid Machine Learning Algorithms For P
Abstract—The large volume of data and its complexity in In the age of the information revolution, analysis of the
educational institutions require the sakes from informative database in education environments such as learning analytics,
technologies. In order to facilitate this task, many researchers predictive analytics, educational data mining, and machine
have focused on using machine learning to extract knowledge learning techniques has become a hot area of research [3-5].
from the education database to support students and instructors The supervised learning was used to predict, classify the
in getting better performance. In prediction models, the students’ performance and analyze their learning behaviors to
challenging task is to choose the effective techniques which could follow up on their progress in classes. However, the
produce satisfying predictive accuracy. Hence, in this work, we challenging task is to find the optimal algorithm which could
introduced a hybrid approach of principal component analysis
produce satisfying results. Machine learning algorithms such
(PCA) as conjunction with four machines learning (ML)
algorithms: random forest (RF), C5.0 of decision tree (DT), and
as naïve Bayes, logistic regression, artificial neural networks,
naïve Bayes (NB) of Bayes network and support vector machine decision tree, random forest, support vector machine, k-nearest
(SVM), to improve the performances of classification by solving neighbor, and more, were popularly used to analyze and predict
the misclassification problem. Three datasets were used to academic performance [3-14]. The performance of each model
confirm the robustness of the proposed models. Through the is varied from dataset to dataset, which relies on the
given datasets, we evaluated the classification accuracy and root characteristics and quality of data.
mean square error (RSME) as evaluation metrics of the proposed
In the classification problem, a reason for misclassification
models. In this classification problem, 10-fold cross-validation
that declines the performance of the model is from the quality
was proposed to evaluate the predictive performance. The
proposed hybrid models produced very prediction results which
of data that disturbs the algorithms. Various literature has
shown itself as the optimal prediction and classification focused on using dimensional reduction (feature selection and
algorithms. feature extraction methods) to improve the prediction and
classification performance. In our work, we applied principal
Keywords—Student performance; machine learning component analysis (PCA) as a feature extraction technique to
algorithms; k-fold cross-validation; principal component analysis transform the original dataset into a new dataset of high quality.
We also introduced 10-fold cross-validation is to evaluate the
I. INTRODUCTION predictive performance of the models and to judge how they
The poor performance of students in high school has perform in a new dataset, the testing samples or test data.
become a worried-task for educators as it affects the secondary This paper aims at proposing a novel hybrid approach of
national exam and step to higher education. Mathematics is machine learning for solving the classification problem. The
considered as the basic background for many science subjects, proposed hybrid approach is the combination of four baseline
and give very strongly affect the national exam and for further machine learning algorithms with 10-fold cross-validation and
study in higher education [1]. For example, students who are principal component analysis.
poor in mathematics are much more likely to fail in diploma
national exams in Cambodia [2]. They later found themselves II. RELATED WORKS
harder to choose a major for higher study and hard to survive Supervised learning in machine learning requires an
in the university journey. Early prediction and classification of effective prediction model for solving prediction and
student performance level offers an early warning and gives a classification problems. As mentioned in the Introduction, the
recipe for improving the poor performance of students as well educational data mining (EDM) field has studied different
as for other managerial settings. Hence, we aim to deal with the machine learning techniques to determine these techniques
unknown behavior pattern of students which affects student obtaining a high accuracy to predict the future performance of
performance. There are various factors affect the performance students [3-5].
of students in mathematics; those factors consist of schooling
factors, domestics or home factors, and personal or individual Table I summarized the popular and state-of-the-art
factors. These related factors were used as predictive features classification algorithms, which were used to predict student
in predicting the achievement of students in mathematics. performance in educational datasets. Several works have been
investigated to find the best algorithms to predict future
performance.
32 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
TABLE. I. SUMMARY OF COMMON MACHINE LEARNING CLASSIFIERS separate each two different data classes. Once the data is more
WHICH ARE USED IN PREDICTING STUDENT PERFORMANCE
complex, then we create more dimensional space to have a
Ref. Main Results linear separation of data.
(i) C4.5 and Randomtree were proposed. Given a training sample ( xi , yi ), i 1, 2,..., m , where xi n
[6]
(ii) C4.5 could produce the highest accuracy.
and yi {1,1} are called the target classes, the classical SVM
(i) The six classifiers are decision tree (DT), random forest (RF),
artificial neural network (ANN), Navie Bayes (NB), logistic classifier is subject to solve the optimization problem:
[7]
regression (LR), and generalized linear model (GLM).
(ii) The RF was found to be the best classifier. 1 m
min wT w C i
(i) C4.5, NB, 3-nearest neighbor (3-NN), backpropagation (BP), w , b ,
2 i 1
[8] sequential minimal optimization (SMO), LR were proposed,
(ii) NB algorithms produced the highest classification result. subject to: yi ( w ( xi ) b) 1- i , i 0, i,
T
(1)
(i) Three tree-based classifiers: J48, Random Tree, and REPTree
[9] were used. where ( x) is treated for nonlinear function case mapping
(ii) J48 was found to be the best prediction model.
x into a higher dimensional space. The parameters w, b and
(i) NB, support vector machine (SVM), C4.5, CART are used to
[10] build the learning model. i represent the weight, bias, and slack variable, respectively.
(ii) SVM is the best model compared to NB, C4.5, and CART. And the optimal hyperplane is possibly to be solved using
(i) RF, multilayer perceptron (MLP), and ANN were used to Lagrangian and then transform it into a quadratic problem of
[11] classify student performance. the function W ( ) as in (2):
(ii) The RF algorithms generated the highest accuracy.
(i) J48, CART, and RF classifiers were proposed with principal
m
1 m m
[12] component analysis (PCA). max W ( ) i i j yi y j K ( xi , x j )
(ii) PCA-RF was found to generate the highest accuracy. i 1 2 i 1 j 1
m
[13]
(i) MLP, Radial Bias Function (RBF), SMO, J48, and NB are
proposed to combine with PCA.
subject to: i yi 0; i [0, C], i 1, 2,.., m,
(ii) PCA-NB generated the highest accuracy.
i 1 (2)
(i) Three Boosting algorithms (C5.0, AddaBoost M1., and
where K ( xi , x j ) ( xi )T ( x j ) is the kernel function and,
[14] AdaBoost SAMME) are proposed.
(ii) The C5.0 outperformed the other two boosting models. (1 , 2 ,..., m ) is a set of Lagrange multipliers.
III. MACHINE LEARNING ALGORITHMS The decision function can be written as:
We proposed hybrid models by a conjunction of machine m
learning algorithms with principal component analysis. We f ( x) sgn
y K ( x , x ) b .
i i i j
first proposed the baseline models. We then improved the i 1 (3)
performance of our proposed baseline models with k-fold
cross-validation. Lastly, we proposed the hybrid machine Different kernel functions are used to help SVM to
learning model by combining it with principal component maximize margin hyperplanes to obtain the optimal solution.
analysis as in Fig. 1. The most popular used kernels are the polynomial function,
sigmoid function, and radial basis function. SVM with radial
A. The Baseline Models bias function (RBF) kernel is one of the most commonly used
There are numerous effective machine learning approaches kernels for the multi-classification problem since it requires
that have been extensively applied to educational environments. fewer parameters comparing to the polynomial kernel.
For various purposes in educational settings, we need to take Consequently, RFB is an appropriate choice to be used kernel.
different machine learning techniques such as association rule Hence, this work applied RBF as a kernel function top to get
mining, regression analysis, classification, and clustering [3]. the optimal solution.
Classification is a common technique in machine learning that
was used in order to classify and predict the categories or
predefined classes of target variables. In this work, we
observed several machine learning classifiers and selected the
four state-of-the-art methods which are popularly used in
predicting academic performances [3-14]. The four proposed
algorithms are support vector machine, naïve Bayes C5.0 of the
decision tree, and random forest.
1) Support vector machine: A Support Vector Machine
(SVM) is a kind of classification algorithm obtained by the
mean of a separating hyperplane [15]. The concept of SVM is
to create a line or a hyperplane to separates the samples into
classes. SVM is used to observe for the optimal hypersurface to Fig. 1. Illustration of Task Procedure.
33 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
34 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
Take the remaining k 1 folds as the training set still contains the useful information of data. Removal of such
features can increase the search speed and accuracy rate.
Retain the evaluation score and discard the model
NB is a classifier that holds many advantages, yet the
4) Repeat the iteration until every single fold was treated greatest weakness of the NB classifier is that it relies on the
as a testing set. Finally, compute the average score of the often-faulty assumption of equally important and independent
recorded scores. features. If there are any features that are irrelevant to some
class Ck then the whole probability goes to zeros for that class
In our study, we chose the 10-fold cross-validation (will be
shortly called 10-CV) to access our proposed algorithms. This because of production in equation (5), which leads to
process is precisely illustrated in Fig. 3. misclassification. In order to solve this problem, feature
extraction will be the best tool to reduce irrelevant features and
C. The Proposed Hybrid Models also improve the classification performance.
The majority task in supervised machine learning is In the tree-based algorithms C5.0 and RF, the major
classification. The classification problem is a hot issue in data problem in the splitting process of the decision tree is
mining and machine learning. We proposed the four most overfitting. Overfitting caused by noisy data and irrelevant
popular classifiers that hold many merits. However, the major features that produce misclassification results. In return,
problem for those classifiers is overfitting and noisy data which overfitting lowering the accuracy of tree-based classifiers. To
leads to misclassification and deduce the accuracy of the reduce high dimensional data which, contains noisy and
classification. To overcome this matter, we try to reduce irrelevant data, a commonly-used technique is to use feature
irrelevant feature and non-correlated features which disturb in extraction in order to obtain a lower-input space that contains
the classification process. In data analysis, it requires more relevant and informative input features.
computational resources and consumes much time when that
data consists of a huge volume. Hence, the feature extraction In order to improve the performance of the proposed
approach to remove noises in data in order to reduce time and machine learning algorithms, we proposed commonly-used
resource usage and regain the high quality of data. The feature extraction approach: principal component analysis
dimensional reduction could improve accuracy and boost up (PCA) in this study. PCA is a statistical method that transforms
the performance by combining it with classification techniques. an original data set to a new dataset of a lower dimension. The
Using more high-quality data and feature reduction is one of original dataset consists of possibly correlated variables are
the effective approaches to improve the performance of converted into a set of linearly uncorrelated variables.
machine learning models. The four proposed models: support
PCA is one of the most popular dimensionality reduction
vector machine using radial basis function kernel (SVMRBF),
algorithm [17]. In the PCA procedure, the data is first
naïve Bayes (NB), decision tree C5.0, and random forest (RF)
transformed into standardized data with zero mean. The idea
are the affective algorithms for the classification problem, yet
behind getting the principle components is the covariance
there is no perfect algorithm in machine learning.
matrix is computed in order to obtain eigenvector and
SVM is a classifier with the use of support vectors called eigenvalues. The eigenvector with the highest eigenvalue is
hyperplanes to separate data into classes. Thus, for a high treated as the principal component of new data which shows
dimensional dataset, the input space is high and can be unclean the most significant relationship of input feature. PCA is less
which is mostly declining the performance of the SVM sensitive to different datasets than other holistic methods, so it
algorithm. Thus, it requires an effective feature extraction is the most widely used technique as one of the effective
method that discards noisy, irrelevant and redundant data, and feature reduction methods.
35 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
The procedure of transforming original dataset X of l The proposed hybrid models by conjunction machine
dimension consisting of possibly correlated features to a new learning models with PCA are introduced for predicting and
dataset Z of lower dimension m (m l ) consisting of linear classifying the academic performance. The best benefits of
uncorrelated features is as follows: PCA are summarized as follow:
a) Removing the high noises from samples and
1) Compute mean: From the already processed data, first,
uncorrelated features from the collected dataset in the
find the mean of each attribute using the equation:
preprocessing step.
1 n b) Reducing the high dimensional data to low
xi
n i 1 dimensional one which remains the important characteristics
(10)
of data that reduce overfitting problems.
2) Compute variance: In order to investigate and deviation c) Enhance the equality of features by getting rid of
of each feature in the dataset, we compute the variance using correlated features that effectively improve the performance of
equation (11): classification.
1
n In this proposed research, we proposed the hybrid models
Var ( X ) x 2
n 1 (x i - )2 by a conjunction of four baseline models (SVMRFB, NB, C5.0,
i 1 (11) and RF) with 10-fold cross-validation (10-CV) and principal
component analysis (PCA).
3) Compute covariance: Given two variables, denoted X
and Y , the covariance and correlation are calculated using IV. DATASETS AND PREPROCESSING
equation (12): A. Datasets
1
n
In our study, we tried to collect all unseen features affecting
Cov ( X , Y ) x 2
n 1 (x i X )( yi Y )
student performance in mathematics subjects. Datasets
i 1 (12) contained 43 features describing the information of the
Cov( X , Y ) equals to zero means that the two attributes X learning behaviors of each student and one target variable
describing the performance levels of students based on their
and Y are independent. Using equation (11) and (12), we can score. The predictive features consist of the features observing
obtain covariance matrix S, which the entry sij , i j , is the from three main affected factors. These main factors contain
covariance between the i th and j th variables, and diagonal sii the forty-three variables and their descriptions are shown in
Table II. Table III described the predefined classes of the target
is the variance of i th variables. variable.
4) Compute Eigenvalues and Eigenvectors: The features in To confirm the robustness and effectiveness of our
the new datasets are characterized by mean of eigenvectors and proposed algorithms, we used three datasets. The first two
eigenvalues. The obtained eigenvectors will tell the direction of datasets are generated datasets namely GDS1 (2000 samples)
new features space while the eigenvalues are its magnitude. and GDS2 (4000 samples) that were constructed based on
The eigenvalues are possible to obtain by solving the equation: proposed structures of predictive features to the output variable
as stated in [18-20]. The third dataset is the actual dataset that
Det ( S - I ) 0, (13) was collected from 22 high schools in Cambodia. The data
collection was made using questionnaires form. Students were
where the covariance matrix S is symmetric, is the asked to provide their demographical information related to
eigenvalue of the symmetric matrix S , and I is an identity external effects such as domestic factors, individual or student
matrix. The eigenvector v corresponding to each eigenvalue factors, and school factors. The score of mathematics of
can be computed via the equation: students in the semester I was obtained from the administrative
offices in each school. The dataset was named ADS3 that
( S - I )v 0 (14) consists of 1204 samples.
36 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
B. Preprocessing Tasks
TABLE. III. THE FACTORS AFFECTING STUDENT PERFORMANCE IN
MATHEMATICS Data preprocessing is an integral step in data mining that is
used to transform the raw dataset into a clean and executable
N Variables Description Type format to be ready for implementation. The preprocessing step
Domestic Factors is not only used to ensure the readiness of data suitable and
1 PEDU1 Father's educational level Nominal ready for modeling but also to improve the performance of the
2 PEDU2 Mother's educational level Nominal
models. The preprocessing tasks in this study contain some
operations such as data cleaning or cleansing, data
3 POCC1 Father’s occupational status Nominal transformation, and data discretization. During data collection,
4 POCC2 Mother’s occupational status Nominal the questionnaire completion was done with missing some
5 PSES Family’s socioeconomic Ordinal questions and inputting invalid value (outliers). In our datasets,
6 PI1 Parents’ attention to students’ attitude Ordinal the number of missing values is low, so we used the imputing
method in order to clean our data. We replaced the missing
7 PI2 Parents’ time and money spending Ordinal
value in our categorical variables by its modes or high
8 PI3 Parents’ involvement as education Ordinal frequency-category values. In the output variable, there is a
9 PS1 Parents’ feeling responsive and need Ordinal few missing value and outliers, then we replaced it by the mean
10 PS2 Parents’ respond to children’s attitude Ordinal value. For simplicity, we transformed some numerical features
11 PS3 Parents’ encouragement Ordinal into ordinal types. In our study, we also discretized the output
variables into four performance levels as shown in Table I.
12 PS4 Parents’ compliment Ordinal
13 DE1 Domestic environment for study Ordinal V. EVALUATION METRICS
14 DE2 Distance from home to school Nominal The performance of each proposed model in analyzing and
Student or Individual Factors predicting student performance can be evaluated from the
15 SELD1 Number of hours for self-study Nominal analysis of the graphical confusion matrix. Without loss of
generality, our output variable can be categorized into four
16 SELD2 Number of hours for private math study Ordinal
ordinal categories as mention in Table I. Table IV shows the
17 SELD3 Frequency of doing math homework Ordinal graphical confusion matrix which represents four classes of
18 SELD4 Frequency of absence in math class Ordinal student performance level in mathematics subject. Class 1
19 SELD5 Frequency of preparing for the math exam Ordinal presents the highest class, Class 2 denotes the second upper
20 SIM1 Student’ s interest in math Ordinal class, Class 3 describes the third class lower, and Class 4
denotes the lowest (poor) group of students. The below
21 SIM2 Student’s enjoyment in math class Ordinal
parameters are calculated.
22 SIM3 Student’s attention in math class Ordinal
23 SIM4 Student’s motivation to succeed in math Ordinal
A. Classification Accuracy
24 ANXI1 Student’s anxiety in math class Ordinal Accuracy is used to quantify the percentage of correctly
predicted. Here, we want to evaluate the potential of our
25 ANXI2 Student’s nervous in the math exam Ordinal
prediction model by measuring the percentage of correctly
26 ANXI3 Student’s feeling helpless in math Ordinal predicted the level of student performance as in (15):
27 POSS1 Internet’s use at home Binary
28 POSS2 Possession of computer Binary Accuracy
a ii
100%
29 POSS3 Student’s study desk at home Binary a ij
(15)
School Factors
30 CENV1 Classroom environment Ordinal
B. Root Mean Square Error (RMSE)
31 CU1 Content’s language in math class Nominal
We aim not only to predict the ability of students'
performance levels but also to estimate how much our
32 CU2 Class session Nominal
prediction is close to their performance level. We encoded
33 TMP1 Teacher mastering in math class Ordinal these ordinal performance levels {slow, average, good,
34 TMP2 Teacher’s absence in math class Ordinal excellent} as {1,2,3,4}, respectively. The RMSE can be
35 TMP3 Teaching methods in math class Ordinal computed as:
36 TMP4 Teacher’s involving in education’s content Ordinal M
( Plia - Plip ) 2
37 TAC1 Math teacher’s ability Ordinal RMSE
i 1 M
38 TAC2 Teacher’s encouragement to students Ordinal (16)
39 TAC3 Math teacher’s connection with students Ordinal where Pl a {1, 2,3, 4} is the actual performance level and
40 TAC4 Math teacher’s help Ordinal
Pl {1, 2,3, 4} is the predicted performance level.
p
37 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
TABLE. IV. GRAPHICAL CONFUSION METRIC TABLE. VI. PERFORMANCE OF BASELINE MODELS TO GDS2
38 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
From Table X, the NB accuracies improved rapidly from In this section, we proposed the hybrid models as the
65.44% to 90.66%. SVMRBF could yields around 4% better combination of 10-CV in the previous section to PCA in order
than the previous baseline SVMRBF. C5.0 and RF are tree- to avoid overfitting and more improvement in predicting
based classifiers that could produce a high risk of over-fitting. performance. Tables XI, XII, and XIII describe the results of
With a 10-CV, we can not only obtain better performance but the proposed models to the three datasets, GDS1, GDS2, and
also avoid overfitting problems too. By mean of 10-CV, ADS3, respectively.
accuracies of C5.0 and RF were improved to 94.82% and 98.22%
which improved 18% and 9%, respectively. We visualized the performance of the proposed models to
the three datasets GDS1, GDS2 and ADS3 in Fig. 4, 5, and 6,
C. Results of Proposed Hybrid Models respectively. In Fig. 4, the accuracy based in dataset GDS1,
Our proposed hybrid models were constructed by combing our proposed hybrid models boost the accuracy of SVMRBF
the baseline models with a feature reduction approach, PCA. from 75.01% to 83.88%, NB from 35.79% to 86.27%, C5.0
Feature extraction is one of the powerful methods in from 78.42% to 98.32%, and RF from 80.06% to 98.92%.
classification models that are used for the purpose of removing In Fig. 5, the hybrid models improved SVMRBF, NB, C5.0,
irrelevant or non-related features. Dimensionality reduction via and RF with accuracies of 20%, 23%, 12%, and 9%,
PCA [13] can definitely serve as regularization in order to respectively. In Fig. 6, the proposed hybrid SVMRBF could
prevent overfitting and improve the model accuracies. Often, improve the classification accuracy from 86.44% to 97.01%.
people end up making a mistake in thinking that PCA selects Classification through NB could yields 30% better than
some features out of the dataset and discards others. The baseline NB. The accuracies of C5.0 and RF were improved to
algorithm actually constructs a new dataset of properties based 99.25% and 99.72% correctly classified.
on a combination of the old ones.
TABLE. XI. PERFORMANCE OF BASELINE MODELS, BASELINE MODELS +10-CV, AND HYBRID MODELS TO GDS1
TABLE. XII. PERFORMANCE OF BASELINE MODELS, BASELINE MODELS+10-CV, AND HYBRID MODELS TO GDS2
39 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
TABLE. XIII. PERFORMANCE OF BASELINE MODELS, BASELINE MODELS+10-CV, AND HYBRID MODELS TO ADS3
40 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
41 | P a g e
www.ijacsa.thesai.org