Prediction Clustering
Prediction Clustering
Keywords: Educational Data Mining, k-Means Algorithm, k-Medoids Algorithm, Fuzzy C Means
Algorithm, Expectation Maximization Algorithm.
1. Introduction
Data mining is a process of extracting previously unknown, valid, potential useful and
hidden patterns from large data sets. As the amount of data stored in educational data bases is in
increasing rapidly. In order to get required benefits from such large data and to find hidden
relationships between variables using different data mining techniques developed and used.
Clustering is most widely used techniques in data mining. The aim of clustering is to partition
students in to homogeneous groups according to their characteristics and abilities [1].
Usually educational organizations used to collect huge amount of data which would be
relevant to faculty members, students, etc. But the importance of data that is collected is unknown.
The data that are used in generating simple queries or traditional reports may be in significant,
which will not contribute to the process of inference/decision making in the educational
organizations. The collected data may also contain such insignificant data. Also the volume and
complexity of the collected data may be very high such that it is not easy to handle. If that is the
case then the collected data may not be used and memory is occupied unnecessarily. The available
data can be made usable if and only if it is converted into useful information by exploiting
potentiality of the collected data. A wide range of data mining algorithms is used to extract useful
information from potential data gathered in various educational organizations.
309
International Journal of Pure and Applied Mathematics Special Issue
There are increasing research interests in education field using data mining. Application of
Data mining techniques concerns to develop the methods that discover knowledge from data and
used to uncover hidden information. The discovered knowledge can be used to better understand
students‟ behavior, to assist instructors, to improve teaching, to evaluate and improve e-learning
system, to improve student academic performance; to improve curriculums and many others
benefits [2].
This study investigates the educational domain of data mining. This paper performs a
comparative analysis of four clustering algorithms namely k-means algorithm, k-Mediods
algorithm, Fuzzy C Means algorithm and Expectation Maximization algorithm. The performance of
these clustering algorithms is compared in terms of purity, normalized mutual information and time
taken to form a cluster. The student data was collected from different private Arts and Science
colleges. The collected academic data was grouped according to their similar characteristics,
forming clusters.
The rest of the article is organized as follows. Section 2 discusses about various research
articles related to data mining techniques for predict clustering students‟ performance Section 3
explores the basic concepts of k-Means algorithm, k-Medoids algorithm, FCM algorithm and EM
algorithm in detail. The clustering results of each algorithm were examined in detail and compared
with each other to evaluate the performance of the algorithms in section 4. Finally, concludes the
research work.
2. Related Work
In educational data mining various research have been done in predicting students‟
performance using different data mining techniques such as clustering, classification, neural
networks, etc. Some of the methodologies from different research articles were discussed in this
section. Educational Data Mining (EDM) is the field of study concerned with mining educational
data to find out interesting patterns and knowledge in educational organizations. In [3], the study
explores multiple factors theoretically assumed to affect students‟ performance in higher education,
and finds a qualitative model which best classifies and predicts the students‟ performance based on
related personal and social factors. In [3] four decision tree algorithms was used on the collected
student‟s data, namely, C4.5 decision tree, ID3 decision tree, CART decision Tree, and CHAID.
Durairaj et al., [4] propose Educational Data mining for Prediction of Student Performance
Using Clustering Algorithms. They predicting the students‟ performance, used weka data mining
through clustering, which paved way to strategic management tool. In [5] Prashant et al., examined
the clustering analysis in data mining that analyzes the use of k-means algorithm in improving
students academic performance in higher education and presents k-means clustering algorithm as a
simple and efficient tool to monitor the progression of students performance.
Shiwani and Roopali [6], had proposed a work to evaluate the performance of students of
Digital Electronics of university institute of engineering and technology. The researcher had applied
unsupervised learning algorithms such as K-means and Hierarchical clustering using WEKA tool as
an open source tool. The paper [7] focuses on the study of data mining techniques applied to small
data sets concerning higher education institutions, concludes that the use of these techniques in real-
310
International Journal of Pure and Applied Mathematics Special Issue
life situations is useful and promising, and can provide administrators with precious tools for
decision. Clustering is used in[8] for analyzing data concerning the evaluation of courses taken by
students, linked to their results in the corresponding exams. The work presented in[9]reviews
different clustering algorithms applied to educational data mining context while [10] is an
interesting review of recent educational data mining development whose contents are in turn
analyzed by a data mining approach.
Sarala et al., discussed [11] the applications of data mining in educational institution to
extract useful information from the huge data sets and providing analytical tool to view and use this
information for decision making processes by taking real life examples. The paper in [12] focuses
on set up a clustering algorithm which is most suitable for predicting students performance in
educational data mining. The objective of this research work is to gain an insight into how
clustering analysis can be done in educational domain and to highlight the potential characteristics
of the clustering algorithms within the educational data set. In [13] a new model was used to predict
the student performance using a neural network. The model helps to accurately predict students at
risk of dropping and reduce dropout rates. This comparison between planned and actual
performance indicates that the model works in the estimation of student performance. A Research
work done by Veeramuthu et al. [14], had designed a model to present as a guideline for higher
educational system to improve their decision making processes. The authors aim to analyze how
different factor affect a student learning behavior and performance using K-means clustering
algorithm. A work done by Sivaram et al., [15] had surveyed the applicability of clustering and
classification algorithms for recruitment data mining techniques that fit the problems which are
determined. A study has been made by applying K-means, fuzzy C-means clustering and decision
tree classification algorithms to the recruitment data of an industry.
3. Clustering Algorithms
Clustering is process, grouping a set of physical or abstract objects into classes of similar
objects. A cluster is a collection of data objects that are similar to one another with in the same
cluster and are dissimilar to the objects in other clusters. Data clustering is alternatively referred to
an unsupervised learning and statistical data analysis. Cluster analysis is an important human
activity. Cluster analysis has been widely used in numerous applications including pattern
recognition, data analysis, image processing and market research. Clustering is a descriptive task
that seeks to identify homogenous group objects based on the values of their attributes. Clustering
has many requirements like scalability, dealing with different types of attributes, discovery of
clusters with arbitrary shape, minimal requirements for domain knowledge to determine input
parameters, ability to deal with noisy data, high dimensionality, interpretability and usability.
Clustering techniques can be broadly classified into many categories; partitioning, hierarchical,
density-based, grid-based, model-based algorithms.
311
International Journal of Pure and Applied Mathematics Special Issue
the objects into k partitions (k ≤n), where each partition represents a cluster. The clusters are
formed to optimize an objective partitioning criterion, such as a dissimilarity function based on
distance. The algorithm is composed of the following steps:
Step 1:Place k points into the space represented by the objects that are being clustered. These point
are present initial group centroids.
Step 2:Assign each object to the group that has the closest centroid.
Step 3:When all objects have been assigned, recalculate the positions of the k centroids.
Step 4:Repeat steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups from which the metric to be minimized
can be calculated. The k-means simple clustering algorithm that has been improved to several
problem domains.
i=1,…,n; j=1,..,n
Step 2: Calculate Pij to make an initial guess at the centers of the clusters.
dij
Pij = n (2)
dij
i 1
i=1,…,n;j=1,…n
n
Step 3: Calculate Pij( j i...n) at each objects and sort them in ascending order. Select k objects
i 1
having the minimum value as initial group medoids.
Step 4: Assign each object to the nearest medoid.
Step 5: Calculate the current optimal value, the sum of distance from all objects to their medoids.
Step 6: Replace the current medoid in each cluster by the object which minimizes the total distance
to other objects in its cluster.
Step 7: Assign each object to the nearest new medoid.
312
International Journal of Pure and Applied Mathematics Special Issue
Step 8: Calculate new optimal value, the sum of distance from all objects to their new medoids. If
the optimal value is equal to the previous one, then stop the algorithm. Otherwise, go back
to the Step 6.
1
ij
c 2
dij m 1
dik
k 1 (3)
n
m
ij xi
i 1
vj n
, j 1,2.....c (4)
m
ij
i 1
where,
'n' is the number of data points.
'vj' represents the jth cluster center.
'm' is the fuzziness index m € [1, ∞].
'c' represents the number of cluster center.
'µij' represents the membership of ith data to jth cluster center.
'dij' represents the Euclidean distance between ith data and jth cluster center.
Main objective of fuzzy c-means algorithm is to minimize:
n c
m
J U ,V || xi vj || 2 (5)
i 1 j 1
where,
'||xi – vj||' is the Euclidean distance between ith data and jth cluster center.
Let X = {x1, x2, x3 ..., xn} be the set of data points and V = {v1, v2, v3 ..., vc} be the set of centers.
Step 1: Randomly select ‘c’ cluster centers.
313
International Journal of Pure and Applied Mathematics Special Issue
k 1
dik
Step 3: Compute the fuzzy centers 'vj' using:
n
m
ij xi
i 1
vj n
, j 1,2.....c (7)
m
ij
i 1
Step 4: Repeat step 2 and 3 until the minimum 'J' value is achieved or ||U(k+1) - U(k)|| < β.
Where,
„k‟ is the iteration step.
„ ‟ is the termination criterion between [0,1]
„U = (µij)n*c‟ is the fuzzy membership matrix.
„J‟ is the objective function.
Step 1: Initialization
Step 2:E-Step: This step is responsible to estimate the probability of each element belong to each
cluster.
Step 3: M-Step: This step is responsible to estimate the parameters of the probability distribution of
each class for the next step.
Step 4: Convergence Test: After each iteration is performed a convergence test which verifies if
the difference of the attributes vector of iteration to the previous iteration is smaller than an
acceptable, given by parameter.
4. Experimental Results
314
International Journal of Pure and Applied Mathematics Special Issue
This section explains the performance evaluation of proposed approach. The soil nutrients is
implemented using Java (version 1.7), and the experiments are performed on a Intel(R) Pentium
machine with a speed 2.13 GHz and 2.0 GB RAM using Windows 7 32-bit Operating System.
315
International Journal of Pure and Applied Mathematics Special Issue
316
International Journal of Pure and Applied Mathematics Special Issue
P wk cj
I ( , C) P wk cj log (10)
K J p( wk ) p(cj )
wk cj N N wk cj
= log (11)
K J N wk cj
Where P(wk) , P(cj) and P(wk cj) are the probabilities of a document being in cluster wk class cj
and in the intersection of wkand cj.
H is entropy,
H P wk log P wk
k
wk wk
log
k N N (10)
NMI is always a number between 0 and 1.
317
International Journal of Pure and Applied Mathematics Special Issue
800
700
600
500 Average
400 Good
300 Excellent
200
100
0
k-Means k-Medoids FCM EM
Figure 1 shows the distribution of cluster comparison. The distribution shows that the data points in
cluster-1 uniformly distributed except k-Means algorithm.
318
International Journal of Pure and Applied Mathematics Special Issue
Table 3 shows the execution time of clustering algorithms. The time consumption of FCM is less
compared to the EM. The lowest execution time is in K-Medoids. In figure 2, the x axis represents
the clustering algorithm and y-axis represent the time in milliseconds.
Execution Time in ms
600
500
400
Time in ms
300
200
100
0
K-Means K-Medoids FCM EM
Clustering Algorithm
Table 4 and Figure 3 show the comparison of purity and NMI values.
319
International Journal of Pure and Applied Mathematics Special Issue
Clustering Comparison
0.7
0.6
0.5
Value
0.4
0.3 Purity
0.2 NMI
0.1
0
K-Means K-Medoids FCM EM
Clustering Algorithm
From the comparison the purity value of EM and FCM is more compare to the k-Means and k-
Medoids algorithms. The NMI value of EM and FCM is less compared to the k-Means and k-
Medoids algorithms. From the comparison the clustering algorithm FCM and EM is better
compared to k-Means and k-Medoids in terms of distribution purity, and NMI but thee algorithms
take more execution time.
6. Conclusion
The research work has put an effort to reveal that the clustering techniques serve as
powerful tool in educational data mining. Here various clustering algorithms are discussed and by
using these algorithms, student‟s performance is evaluated. In this research work, clustering
algorithms k-Means, k-Medoids, FCM and EM were examined and compared based on the
performance of the algorithms using student data set. The taken parameters of students data set are
evaluated and the results are analysed. The parameters purity, NMI and etc are analysed in this
work.The clustering algorithms are evaluated using execution time, purity and NMI. The result
shows that FCM and EM algorithm performs well compared with other two clustering algorithms.
Reference
[1] Sreenivasarao, Vuda, and Capt Genetu Yohannes. "Improving academic performance of
students of defence university based on data warehousing and data mining" Global Journal of
Computer Science and Technology, 2012, Vol. 12(2), pp 29-36.
320
International Journal of Pure and Applied Mathematics Special Issue
[2] Romero, Cristobal, and Sebastian Ventura. "Educational data mining: A survey from 1995 to
2005.”, Expert systems with applications, 2007, Vol. 33, pp. 135-146.
[3] Saa, Amjad Abu. "Educational Data Mining & Students‟ Performance Prediction.",
International Journal of Advanced Computer Science and Applications, 2016, Vol. 7(5), pp.
212-220.
[4] Durairaj, M., and C. Vijitha., "Educational Data mining for Prediction of Student Performance
Using Clustering Algorithms." , International Journal of Computer Science and Information
Technologies , 2014, Vol. 5(4), pp. 5987-5991.
[5] Saxena, Prashant Sahai, and M. C. Govil., "Prediction of Student‟s Academic Performance
using Clustering.", Special Conference Issue: National Conference on Cloud Computing &
Big Data., 2014,
[6] Rana, Shiwani, and Roopali Garg., "Evaluation of student‟s performance of an institute using
clustering algorithms.", International Journal of Applied Engineering Research, 2016,
Vol.11(5), pp. 3605-3609.
[7] Natek, Srečko, and Moti Zwilling., "Student data mining solution–knowledge management
system related to higher education institutions.", Expert systems with applications,
2014, Vol. 41(14) pp. 6400-6407.
[8] Campagni, Renza, Donatella Merlini, and M. Cecilia Verri. "Finding Regularities in Courses
Evaluation with K-means Clustering.", CSEDU - 6th International Conference on Computer
Supported Education, 2014, Vol. 2, pp. 26-33.
[9] A. Dutt, S. Aghabozrgi, M.A.B. Ismail, H. Mahroeian, "Clustering algorithms applied in
educational data mining", International Journal of Information and Electronics Engineering,
2015, Vol. 5, pp. 280-291.
[10] Peña-Ayala, Alejandro. "Educational data mining: A survey and a data mining-based analysis
of recent works." Expert systems with applications, 2014, Vol.41 (4), pp.1432-1462.
[11] Sarala, V., and J. Krishnaiah. "Empirical Study Of Data Mining Techniques In Education
System.", International Journal of Advances in Computer Science and Technology (IJACST),
2015, pp. 15-21.
[12] C.Anuradha, T.Velmurugan, R. Anandavally, "Clustering algorithms in educational data
mining: a review ", International Journal of Power Control and Computation(IJPCSC) Vol 7.
No.1 – 2015 pp.47-52
[13] Shirodkar, Jateen Shet, and Viren Pereira., "Determining Students Performance Using the
Tool of Artificial Neural Network.", International Journal of Innovative Research and
Development, 2016, Vol. 5 No. 2, pp. 314-318.
[14] Veeramuthu, P., Dr R. Periyasamy, and V. Sugasini., "Analysis of Student Result Using
Clustering Techniques." IJCSIT), International Journal of Computer Science and Information
Technologies, 2014, Vol. 5, No. 4, pp. 5092-5094.
[15] Sivaram, N., and K. Ramar., "Applicability of clustering and classification algorithms for
recruitment data mining." , International Journal of Computer Applications, 2010, Vol. 4,
No. 5, pp. 23-28.
321
International Journal of Pure and Applied Mathematics Special Issue
322
323
324