Neema Mduma - Machine Learning Approach For Reducing
Neema Mduma - Machine Learning Approach For Reducing
2019-05-06
Mduma, Neema
International Journal of Advanced Computer Research
http://dx.doi.org/10.19101/IJACR.2018.839045
Provided with love from The Nelson Mandela African Institution of Science and Technology
International Journal of Advanced Computer Research, Vol 9(42)
ISSN (Print): 2249-7277 ISSN (Online): 2277-7970
Research Article
http://dx.doi.org/10.19101/IJACR.2018.839045
Abstract
School dropout is a widely recognized serious issue in developing countries. On the other hand, machine learning
techniques have gained much attention on addressing this problem. This paper, presents a thorough analysis of four
supervised learning classifiers that represent linear, ensemble, instance and neural networks on Uwezo Annual Learning
Assessment datasets for Tanzania as a case study. The goal of the study is to provide data-driven algorithm
recommendations to current researchers on the topic. Using three metrics: geometric mean, F-measure and adjusted
geometric mean, we assessed and quantified the effect of different sampling techniques on the imbalanced dataset for
model selection. We further indicate the significance of hyper parameter tuning in improving predictive performance.
The results indicate that two classifiers: logistic regression and multilayer perceptron achieve the highest performance
when over-sampling technique was employed. Furthermore, hyper parameter tuning improves each algorithm's
performance compared to its baseline settings and stacking these classifiers improves the overall predictive performance.
Keywords
Machine learning (ML), Imbalanced learning classification, Secondary education, Evaluation metrics.
Other approaches such as time series clustering [16, Machine learning techniques have been applied in
17] were presented to perform clustering, which are various platforms such as a massive open on-line
extensively used in recommender systems [3]. course (MOOC). MOOC platforms such as Coursera
and the edX is among popular used platforms for
*Author for correspondence student dropout prediction [9] and other learning
156
International Journal of Advanced Computer Research, Vol 9(42)
management system (LMS) such as Moodle [16]. On personalized linear multi-regression (PLMR) while
addressing the problem of student dropout, several matrix factorization-based methods associate
existing works have focused on supervised learning standard matrix factorization (MF) approach. The
algorithms such as a naive Bayesian algorithm, mentioned approach was applied to the dataset
association rules mining, artificial neural networks- generated from George Mason University (GMU)
based algorithm, logistic regression, CART, C4.5, transcript data, University of Minnesota (UMN)
J48, (BayesNet), SimpleLogistic, JRip, transcript data, UMN LMS data, and Stanford
RandomForest, Logistic regression analysis, ICRM2 University MOOC data [11]. However, one limitation
[6]. However, under the classification techniques, of the standard MF method is that it ignores the
Decision Tree is highly used by researchers due to its sequence in which the students have taken the
simplicity and comprehensibility to uncover small or various courses and as such the latent representation
large data structure and predict the value [2]. of a course can potentially be influenced by the
performance of the students in courses that were
Other techniques such as survival analysis, which taken afterward.
provides various mechanisms to handle such
censored data problems that arise in modeling such In this paper, we present a thorough analysis of four
longitudinal data which occurs ubiquitously in commonly used machine learning algorithms on
various real-world application domains were Uwezo data on learning (http://www.
presented [13]. Ameri et al, developed a survival twaweza.org/go/uwezo-datasets) in Tanzania with the
analysis framework for early prediction using the aim to provide data-driven algorithm
Cox proportional hazards model (Cox) and applied recommendations to current researchers on the topic.
time-dependent Cox (TD-Cox), which captures the This is publicly available national wide dataset in
time-varying factors and can leverage this Tanzania which was generated from developing
information to provide more accurate prediction of country and therefore reflects the local context. Using
student dropout using the dataset of students enrolled new sources of student level dataset from Tanzania as
at Wayne State University (WSU) starting from 2002 a case study, we employ a comprehensive validation
until 2009 [7]. Besides, other researchers proposed a and enhancement to existing algorithms and apply
new data transformation model, which is built upon additional machine learning approaches to improve
the summarized data matrix of link-based cluster their predictive power. Specifically, we take a
ensembles (LCE) using educational dataset obtained detailed analysis of selected popular algorithms and
from the operational database system at Mae Fah analyse their performance on the dataset by first
Luang University, Chiang Rai, Thailand. Like several applying data pre-processing and feature engineering
existing dimension reduction techniques such as techniques which are very critical states for building
principal component analysis (PCA) and kernel high performance dropout prediction algorithm. This
principal component analysis (KPCA), this method was followed by a rigorous comparison of selected
aims to achieve high classification accuracy by machine learning algorithms using evaluation
transforming the original data to a new form. methods as proposed by [18] in which the best
However, the common limitation of these new performing algorithms were selected. Further, we
techniques is the demanding time complexity, such empirically quantified the effect of hyper parameters
that it may not scale up well to a very large dataset. (i.e. algorithm parameters) tuning and ensemble
Whilst worst-case traversal time (WCT-T) is not techniques for the selected algorithms with an aim to
quite for a highly time-critical application, it can be further improve their performance.
an attractive candidate for those qualities-led works,
such as the identification of those students at risk of In summary, the main objective of this study focused
underachievement [5]. Furthermore, matrix on applying machine learning techniques for
factorization as a clustering machine learning method predicting student dropout. In order to attain this
that can accommodate framework with some objective, the following three tasks were performed
variations was presented [10]. In this study, two respectively:
classes of methods for building the prediction models
were described. The first class builds these models by Building the model and analyzed the performance.
using linear regression approaches and the second Tuning models that performed well and employed
class builds these models by using matrix ensemble approach to improve the predictive
factorization approaches. Regression-based methods performance.
describe course-specific regression (CSpR) and
157
Neema Mduma et al.
Evaluated model performance using three metrics: In this dataset, we identified the following columns
geometric mean (Gm), F-measure (Fm) and have missed values as described in Table 1: Pupil
adjusted geometric mean (AGm). Teacher Ratio (PTR), Pupil Classroom Ratio (PCR),
Girl's Pupil Latrines Ratio (GPLR), Boy's Pupil
2.Materials and methods Latrines Ratio (BPLR), Parent Teacher Meeting
2.1Dataset descriptions and pre-processing Ratio (PTMR), Main source of household income
Data pre-processing includes data cleaning, (Income), Enumeration Area type (EAarea), Parent
normalization, transformation, feature extraction and who check his/her child's exercise book once in a
selection, etc., and the product of data pre-processing week (PCCB), Parent who discuss his/her child's
is the final training set. In selection, relevant target progress with teacher last term (PTD), Student who
data are selected from retained data (typically very did read any book with his/her parent in last week
noisy) and subsequently pre-processed. This goes (SPB), School has girl's privacy room (SGR),
hand in hand with the integration from multiple Household meals per day (MLPD). On handling
sources, filtering irrelevant content and structuring of missing values, PTR, PCR, GPLR, BPLR were
data according to a target tool [19]. In developing a imputed with medians and PTMR, Income, EAarea,
generalized algorithm, data pre-processing can often PCCB, PTD, SPB, SGR, MLPD were imputed with
have a significant impact. Based on the nature of zeros. We encoded the nominal features to conform
datasets in many domains, it is well known that data with Scikit-learn and change the dropout code: 1 to
preparation and filtering steps take a considerable represent not drop and 0 to represent dropout.
amount of processing time in ML problems.
The dataset consists of 18 features as described in
In this paper, Uwezo data on learning at the country Table 2 and approximately 61340 samples. Since our
level in Tanzania which was collected in 2015 was target variable is dropout, we checked the distribution
used. This dataset was collected by Twaweza of this variable in the dataset and observed that there
organization with the aim of assessing children’s was an imbalance for target variable with only 1.6%
learning levels across hundreds of thousands of dropout as shown in Figure 1. Data imbalance is a
households in East Africa. The dataset was cleaned serious problem which can be considered during pre-
by removing information from the data that could processing stage [21]. This happens when one class
lead to individuals or specific villages being located is under-represented relative to another [22, 23].
by end-users. Village id column was removed, since Classification of imbalance dataset is common in
it was not required in experimental stage. Various the field of student retention, mainly because the
approaches have been identified in handling missing number of registered students is always larger than
values, outliers, data and numeric values [20]. In this the dropout students. Several re-sampling
study, we converted data samples to numerical values techniques such as under-sampling, over-sampling
and performed PCA for handling outliers. Missing and hybrid methods can be applied to handle this
values were replaced using medians and zeros. problem [24].
158
International Journal of Advanced Computer Research, Vol 9(42)
50000
40000
30000
20000
10000
0
1 0
Dropout distribution
500
400
300
200
100
0
0 1
Dropout distribution
Under-sampling is a non-heuristic method that aims Technique) combines both under-sampling and over-
at creating a subset of the original dataset by sampling approaches [30, 31]. In this case, no
eliminating instances until the remaining number of sampling, under-sampling and over sampling
examples is roughly the same as that of the minority (SMOTE-ENN) and random under sampling
class [25, 26]. Over-sampling method create a technique as implemented in Imbalanced-Learn were
superset of the original data set by replicating some applied as demonstrated in Figure 2 and 3. SMOTE-
instances or creating new instances from existing ENN combines over and under sampling using
ones until the number of selected examples, plus the SMOTE and edited nearest neighbour (EN) to
original examples of the minority class is roughly generate more minority class in order to reinforce its
equal to that of the majority class [27−29]. Hybrids signal [32] and random under sampler is a fast
method such as synthetic minority oversampling and easy way to balance the minority class by
(SMOTE). randomly selecting a subset of data for the
targeted classes [33].
159
Neema Mduma et al.
35400
35200
35000
34800
34600
34400
34200
34000
33800
0 1
Dropout distribution
160
International Journal of Advanced Computer Research, Vol 9(42)
importance score is defined as the reduction in dropout prediction performance. Thereafter, the same
performance after shuffling the feature values. When experiment was repeated using six well performed
evaluation metric was used to measure the accuracy features obtained in the previous experiment. The
of the prediction, a higher value implies the feature is results show clearly that a student's gender has a
more important. strong contribution to the dropout prediction
performance as presented in Figure 5. These
The results presented in Figure 4, show that student experimental results, support researchers' findings on
gender (Gender), PCCB, MLPD, SPB, PTD and dropout rate with gender association [35].
Student age (Age) have a strong contribution to the
14
Feature selection score (%)
12
10
8
6
4
2
0
Features
25
Feature selection score (%)
20
15
10
0
Gender PCCB MLPD
Best features
161
Neema Mduma et al.
of test set in order to observe how the model will repeated 5 times using different train/test/validation
behave in a real environment which is an imbalance. partitions of the data set. This cross-validation
The overall experimental procedure is summarized in procedure divides the data set into 5 roughly equal
Figure 6 wherein each experiment stratified k-fold parts. For each part, it trains the model using the four
cross validation was used. In this experiment, k=5 remaining parts and computes the test error by
fold out-of-bag overall cross validation was used and classifying the given part. Finally, the results for the
the entire process involves executing all selected five test partitions were averaged.
classification algorithms in which all executions were
2.4Evaluation metrics
In measuring the performance of student dropout AGm = {
algorithms, several researchers use various evaluation
metrics [1, 7, 8]. With respect to evaluation (4)
measures, we used Gm, Fm and AGm as evaluation Where:
criteria. The Gm is a measure of the ability of a TN is true negative, TP is true positive, FN is false
classifier to balance TPrate (sensitivity) and TNrate negative and FP is false positive.
(specificity) [36] as defined in Equation 2. This TPrate = the percentage of positive
measure is maximum when the true positive rate instances correctly classified.
(TPrate) and the true negative rate (TNrate) are TNrate = the percentage of negative
equal. Furthermore, in order to ensure TPrate to
the changes in the positive predictive value instances correctly classified.
(precision) than in TPrate, Fm was used as Positive predictive value (PPV) =
defined in Equation 3. This is the weighted
harmonic mean of the TPrate and precision [7, 3.Results
37, 38]. Besides, AGm as defined in Equation 4 3.1Experiment 1: model selection
was used to obtain the highest TPrate without The aim of this experiment was to identify
decreasing too much the TNrate [18]. classifier with the best performance for this
Gm = √ (2) problem. In this phase, selection of classifiers was
Fm = 2PPV. (3) based with all domains, including linear, ensemble,
instance and neural network classifiers with
consideration of the classification and nature of the
162
International Journal of Advanced Computer Research, Vol 9(42)
dataset. Linear models were represented by a actual dataset which is an imbalance. From the
logistic regression classifier (LR), ensemble models result presented in Figure 7 and 10, two classifiers:
were represented by random forest (RF), instance LR and MLP show better generalization results.
model was represented by K-Nearest-Neighbors They show better validation results for the three
(KNN) and neural networks models were metrics used. Considering the case when under-
represented by a multilayer perceptron (MLP). The sampling is used as observed in Figure 8 and 11,
experiment was repeated for three different cases: all classifiers have considerably the same
SMOTE-ENN and results in both three cases are generalization results for both metrics. The
presented. Results were presented on a separate experiment conducted without sampling as
graph based on the scale of evaluation metrics. For presented in Figure 9 and 12 reveal that, only LR
GM and Fm metrics with scale range between 0 classifier show better performance than others.
and 1, the results were combined in the same However, the score rates are less than 1 for AGm as
graphs (Figure 7-9), while AGm metric with scale compared to when LR is used with oversampling
range above 0 were presented in a separate graph case. Therefore, for the next experiment the
(Figure 10-12). To select the best classifiers, following two classifiers: LR and MLP were
validation results were considered because it gives considered with oversampling case.
an estimate on how the classifier will perform on
0.9
0.8
0.7
Accuracy (%)
0.6
0.5
0.4 Gm
0.3 Fm
0.2
0.1
0
KNN LR RF MLP
Classifiers
0.9
0.8
0.7
Accuracy (%)
0.6
0.5
0.4 Gm
0.3 Fm
0.2
0.1
0
KNN LR RF MLP
Classifiers
163
Neema Mduma et al.
0.7
0.6
0.5
accuracy (%)
0.4
0.3 Gm
0.2
Fm
0.1
0
KNN LR RF MLP
Classifiers
1.4
1.2
1
Accuracy (%)
0.8
0.6
AGm
0.4
0.2
0
KNN LR RF MLP
Classifiers
1.4
1.2
1
Accuracy (%)
0.8
0.6
AGm
0.4
0.2
0
KNN LR RF MLP
Classifiers
1
0.9
0.8
0.7
Accuracy (%)
0.6
0.5
0.4 AGm
0.3
0.2
0.1
0
KNN LR RF MLP
Classifiers
165
Neema Mduma et al.
Results presented in Table 4 reveal that, dropout problem and hyper parameter tuning
performance of the tuned algorithms (LR2 and improves algorithm performance. Compared to the
MLP2) was improved compared to untuned results presented by [2] as described in Table 5, J48
algorithms (LR and MLP). Furthermore, the showed better results on proposing student advising
stacking classifier (ENB) shows considerably better model for enhancing students’ academic
validation and test results followed by the tuned performance and decreasing dropout. Three
logistic regression model (LR2). decision tree classification algorithms, namely J48,
random tree and reduces error pruning (REP) tree
4.Discussion were used in a real dataset representing students'
Although a number of literatures have shown the records in a managerial higher institute in Giza
feasibility of explaining student dropout, few works Egypt. The approach used in our presented study,
have actually attempted to predict student dropout. focused on analyzing four supervised learning
In this study, we use machine learning techniques classifiers that represent linear, ensemble, instance
that are able to automatically identify features that and neural networks rather than focusing only on
are relevant. With the right model, it was possible decision tree classification algorithms.
to predict students’ dropout as well as explain the
variables that are likely to be useful in the Furthermore, on investigating prediction algorithm
prediction. We achieved this by employing the for academic performance on tackling the problem
ensemble classifier that tends to do better than a of student dropout [1]. LR achieved the highest
single individual classifier. This classifier which performance on comparison results of classification
was produced by soft combining the tuned LR2 and performance for all the six classifiers which are LR,
MLP2 achieved better results followed by the tuned MLP, sequential minimal optimization (SMO),
LR2. The machine learning approach of combining naive Bayes (NB), J48 and RF using six metrics on
multiple classifiers has been proposed for the dataset collected from rural and peri-urban
improving predictive performance [43], and primary schools in Kenya as shown in Table 6. LR
generates better results [44]. Furthermore, we achieved better results in our presented study.
observed student gender as the leading feature
which shows high contribution to the student
Table 6 A comparison of the classifiers' performance using the six selected metrics [Comparison from [1]]
Model Recall Specificity ROC F-Measure Kappa RMSE
LR 0.924 0.686 0.887 0.897 0.6345 0.3375
MLP 0.873 0.660 0.851 0.865 0.5407 0.4124
SMO 0.911 0.703 0.807 0.894 0.6309 0.3893
NB 0.701 0.801 0.846 0.784 0.4403 0.4264
J48 0.905 0.670 0.822 0.884 0.5941 0.3720
RF 0.907 0.684 0.870 0.888 0.6082 0.3471
166
International Journal of Advanced Computer Research, Vol 9(42)
performance compared to its baseline settings and [8] Aulck L, Velagapudi N, Blumenstock J, West J.
stacking these classifiers improves the overall Predicting student dropout in higher education. arXiv
predictive performance. Also, the contribution of preprint arXiv:1606.06364. 2016.
each feature on the prediction performance with [9] Chen Y, Chen Q, Zhao M, Boyer S, Veeramachaneni
K, Qu H. DropoutSeer: visualizing learning patterns in
student gender being the leading feature was massive open online courses for dropout reasoning
shown. For future work, we plan to explore and prediction. In conference on visual analytics
different datasets so as to show and compare results science and technology 2016 (pp. 111-20). IEEE.
of different train, test and validation and evaluate [10] Hu Q, Polyzou A, Karypis G, Rangwala H. Enriching
several imbalance techniques for student dropout course-specific regression models with content
prediction using more measures for results features for grade prediction. In international
comparison. This will include extending the conference on data science and advanced analytics
experiment by applying under sampling approach 2017 (pp. 504-13). IEEE.
with penalized models on resolving the imbalance [11] Elbadrawy A, Polyzou A, Ren Z, Sweeney M, Karypis
G, Rangwala H. Predicting student performance using
issue. Besides, we will generalize the study and add personalized analytics. Computer. 2016; 49(4):61-9.
more features so as to evaluate feature subsets for [12] Iqbal Z, Qadir J, Mian AN, Kamiran F. Machine
better understanding of the underlying process. learning based student grade prediction: a case study.
arXiv preprint arXiv:1708.08744. 2017.
Acknowledgment [13] Wang W, Yu H, Miao C. Deep model for dropout
The authors would like to thank the African Development prediction in MOOCS. In proceedings of the
Bank (AfDB), Data for Local Impact (DLi), Eagle international conference on crowd science and
Analytics Company, Late Dr. Yaw-Nkansah Gyekye and engineering 2017 (pp. 26-32). ACM.
Anthony Faustine for supporting this study. [14] Hamedi A and Dirin A. A Bayesian approach in
students' performance analysis. International
Conflicts of interest conference on education and new learning
The authors have no conflicts of interest to declare. technologies. 2018.
[15] https://icsh.es/2017/11/12/i-congreso-internacional-
References multidisciplinario-de-educacion-superior/. Accessed
[1] Mgala M. Investigating prediction modelling of 26 October 2018.
academic performance for students in rural schools in [16] Hung JL, Wang MC, Wang S, Abdelrasoul M, Li Y,
He W. Identifying at-risk students for early
Kenya (Doctoral dissertation, University of Cape
interventions-a time-series clustering approach. IEEE
Town). 2016. Transactions on Emerging Topics in Computing.
[2] Mohamed MH, Waguih HM. A proposed academic 2017; 5(1):45-55.
advisor model based on data mining classification [17] Młynarska E, Greene D, Cunningham P. Time series
techniques. International Journal of Advanced clustering of MOODLE activity data. In Irish
Computer Research. 2018; 8(36):129-36. conference on artificial intelligence and cognitive
[3] KH, Van Der Schaar M. A machine learning approach science University College Dublin, Dublin, Ireland,
for tracking and predicting student performance in 2016.
degree programs. IEEE Journal of Selected Topics in [18] Yan J, Han S. Classifying imbalanced data sets by a
Signal Processing. 2017; 11(5):742-53. novel re-sample and cost-sensitive stacked
[4] Feng W, Tang J, Liu TX. Understanding dropouts in generalization method. Mathematical Problems in
MOOCs. Association for the Advancement of Engineering.2018.
Artificial Intelligence. 2019. [19] Alasadi SA, Bhaya WS. Review of data preprocessing
[5] Iam-On N, Boongoen T. Generating descriptive model techniques in data mining. Journal of Engineering and
for student dropout: a review of clustering approach. Applied Sciences. 2017; 12(16):4102-7.
Human-Centric Computing and Information Sciences. [20] Shahul S, Suneel S, Rahaman MA, Swathi JN. A
2017; 7(1). study of data pre-processing techniques for machine
[6] Kumar M, Singh AJ, Handa D. Literature survey on learning algorithm to predict software effort
educational dropout prediction. International Journal estimation. Imperial Journal of Interdisciplinary
of Education and Management Engineering. 2017; Research. 2016; 2(6):546-50.
7(2):8-19. [21] Krawczyk B. Learning from imbalanced data: open
[7] Ameri S, Fard MJ, Chinnam RB, Reddy CK. Survival challenges and future directions. Progress in Artificial
analysis based framework for early prediction of Intelligence. 2016; 5(4):221-32.
student dropouts. In proceedings of the ACM [22] Galar M, Fernández A, Barrenechea E, Bustince H,
international on conference on information and Herrera F. New ordering-based pruning metrics for
knowledge management 2016 (pp. 903-12). ACM. ensembles of classifiers in imbalanced datasets. In
proceedings of the international conference on
167
Neema Mduma et al.
computer recognition systems 2016 (pp. 3-15). [36] Márquez‐Vera C, Cano A, Romero C, Noaman AY,
Springer, Cham. Mousa Fardoun H, Ventura S. Early dropout
[23] Borowska K, Topczewska M. New data level prediction using data mining: a case study with high
approach for imbalanced data classification school students. Expert Systems. 2016; 33(1):107-24.
improvement. In proceedings of the international [37] Rovira S, Puertas E, Igual L. Data-driven system to
conference on computer recognition systems 2015 (pp. predict academic grades and dropout. PLoS one. 2017;
283-94). Springer, Cham. 12(2):e0171207.
[24] Rout N, Mishra D, Mallick MK. Handling imbalanced [38] Aulck L, Aras R, Li L, L'Heureux C, Lu P, West J.
data: a survey. In international proceedings on STEM-ming the tide: predicting STEM attrition using
advances in soft computing, intelligent systems and student transcript data. arXiv preprint
applications 2018 (pp. 431-43). Springer, Singapore. arXiv:1708.09344. 2017.
[25] Saini AK, Nayak AK, Vyas RK. ICT Based [39] Rojas-Domínguez A, Padierna LC, Valadez JM, Puga-
Innovations. Proceedings of CSI. 2015. Soberanes HJ, Fraire HJ. Optimal hyper-parameter
[26] Dattagupta SJ. A performance comparison of tuning of SVM classifiers with application to medical
oversampling methods for data generation in diagnosis. IEEE Access. 2018; 6:7164-76.
imbalanced learning tasks (Doctoral dissertation). [40] Probst P, Wright MN, Boulesteix AL.
2017. Hyperparameters and tuning strategies for random
[27] Stefanowski J. On properties of undersampling forest. Wiley Interdisciplinary Reviews: Data Mining
bagging and its extensions for imbalanced data. In and Knowledge Discovery. 1804:e1301.
proceedings of the international conference on [41] Dalvi PT, Vernekar N. Anemia detection using
computer recognition systems 2016 (pp. 407-417). ensemble learning techniques and statistical models.
Springer, Cham. In international conference on recent trends in
[28] Moreno MF. Comparing the performance of electronics, information & communication technology
oversampling techniques for imbalanced learning in 2016 (pp. 1747-51). IEEE.
insurance fraud detection (Doctoral dissertation). [42] Feng W, Huang W, Ren J. Class imbalance ensemble
2017. learning based on the margin theory. Applied
[29] Santoso B, Wijayanto H, Notodiputro KA, Sartono B. Sciences. 2018; 8(5):815.
Synthetic over sampling methods for handling class [43] Abuassba AO, Zhang D, Luo X, Shaheryar A, Ali H.
imbalanced problems: a review. In IOP conference Improving classification performance through an
series: earth and environmental science 2017 (p. advanced ensemble based heterogeneous extreme
012031). IOP Publishing. learning machines. Computational Intelligence and
[30] Skryjomski P, Krawczyk B. Influence of minority Neuroscience. 2017.
class instance types on SMOTE imbalanced data [44] Afolabi LT, Saeed F, Hashim H, Petinrin OO.
oversampling. In first international workshop on Ensemble learning method for the prediction of new
learning with imbalanced domains: theory and bioactive molecules. PloS one. 2018; 13(1):e0189538.
applications 2017 (pp. 7-21).
[31] Ahmed S, Mahbub A, Rayhan F, Jani R, Shatabda S, Neema Mduma is a PhD fellow in the
Farid DM. Hybrid methods for class imbalance department of Information and
learning employing bagging with sampling techniques. Communication Science and
In international conference on computational systems Engineering (ICSE) at the Nelson
and information technology for sustainable solution Mandela African Institution of Science
2017 (pp. 1-5). IEEE. and Technology (NM-AIST). Her focus
[32] Douzas G, Bacao F. Geometric SMOTE: effective is on supporting education and
oversampling for imbalanced learning through a currently she is conducting a study on
geometric extension of SMOTE. arXiv preprint developing a machine learning approach for predicting
arXiv:1709.07377. 2017. student dropout.
[33] Elhassan T, Aljurf M. Classification of imbalance data Email: [email protected]
using tomek link (T-Link) combined with random
under-sampling (RUS) as a data reduction method. Khamisi Kalegele is a Lecturer and
Global Journal of Technology and Optimization. 2016, Researcher at the Tanzania
S1: 111. Commission of Science and
[34] Khaldy MA, Kambhampati C. Resampling Technology (COSTECH). He
imbalanced class and the effectiveness of feature graduated with a PhD in Information
selection methods for heart failure dataset. Sciences from Tohoku University,
International Robotics & Automation Journal. 2018; Japan in 2013; MEng in Computer
4(1):1-10. Science from Ehime University in
[35] Kim D, Kim S. Sustainable education: analyzing the Japan and BSc Computer Engineering and IT from
determinants of university student dropout by University of Dar Es Salaam. His research areas are Data
nonlinear panel data models. Sustainability. 2018; Science, E-health and Machine Learning in Education.
10(4):1-18. Email: [email protected]
168
International Journal of Advanced Computer Research, Vol 9(42)
169