Classification Model of Prediction For Placement of Students
Classification Model of Prediction For Placement of Students
net/publication/260185437
CITATIONS READS
60 5,838
1 author:
Saurabh Pal
Veer Bahadur Singh Purvanchal University
104 PUBLICATIONS 3,637 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Application of Data Mining Techniques with special reference to academic Performance Monitoring and evaluation in higher education View project
Prediction of Presence of Breast Cancer Disease in the Patient using Machine Learning Algorithms and SFS View project
All content following this page was uploaded by Saurabh Pal on 15 February 2014.
Saurabh Pal
Head, Department of MCA, VBS Purvanchal University, Jaunpur, India
Email: [email protected]
Abstract— Data mining methodology can analyze Applications (MCA) course provides professional
relevant information results and produce different computer technological education to students. This
perspectives to understand more about the students’ course provides state of the art theoretical as well as
activities. When designing an educational environment, practical knowledge related to information technology
applying data mining techniques discovers useful and make students eligible to stand in progressing
information that can be used in formative evaluation to information industry.
assist educators establish a pedagogical basis for taking The prediction of MCA students where they can be
important decisions. Mining in education environment is placed after the completion of MCA course will help to
called Educational Data Mining. Educational Data improve efforts of students for proper progress. It will
Mining is concerned with developing new methods to also help teachers to take proper attention towards the
discover knowledge from educational database and can progress of the student during the course. It will help to
used for decision making in educational system. build reputation of institute in existing similar category
In this study, we collected the student’s data that have institutes in the field of IT education.
different information about their previous and current The present study concentrates on the prediction of
academics records and then apply different classification placements of MCA students. We apply data mining
algorithm using Data Mining tools (WEKA) for analysis techniques using Decision tree and Naïve Bayes
the student’s academics performance for Training and classifier to interpret potential and useful knowledge [7].
placement. The rest of this paper is organized as follows: Section
This study presents a proposed model based on II presents different type of data mining techniques for
classification approach to find an enhanced evaluation machine learning Section III describes background and
method for predicting the placement for students. This history of educational data mining. Section IV describes
model can determine the relations between academic the methodology used in our experiments about applying
achievement of students and their placement in campus data mining techniques on the educational data for
selection. placement of students and the results obtained. Finally
we conclude this paper with a summary and an outlook
Index Terms— Knowledge Discovery in Databases, for future work in Section V.
Data Mining, Classification Model, Classification,
WEKA.
II. DATA MINING
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56
50 Classification Model of Prediction for Placement of Students
ultimately leads to strategic decisions and business function (when we are not so much familiar with the
intelligence. The simplest word for knowledge relationship between input and output attributes) which
extraction and exploration of volume data is very high sets the example determined by the vector attribute
and the more appropriate name for this term is values into one or more classes.
“Exploring the knowledge of database". A database is
C. C4.5 Tree
knowledge of discovery process. This process includes
the preparation and interpretation of results. The most commonly, and nowadays probably the
Classification is the most commonly applied data most widely used decision tree algorithm is C4.5.
mining technique, which employs a set of pre-classified Professor Ross Quinlan [2] developed a decision tree
attributes to develop a model that can classify the algorithm known as C4.5 in 1993; it represents the result
population of records at large. This approach frequently of research that traces back to the ID3 algorithm (which
employs decision tree or neural network-based is also proposed by Ross Quinlan in 1986). C4.5 has
classification algorithms. The data classification process additional features such as handling missing values,
involves learning and classification. In learning the categorization of continuous attributes, pruning of
training data are analyzed by classification algorithm. In decision trees, rule derivation, and others. Basic
classification test data are used to estimate the accuracy construction of C4.5 algorithms uses a method known as
of the classification rules. If the accuracy is acceptable divide and conquer to construct a suitable tree from a
the rules can be applied to the new data sets. The training set S of cases (Wu and Kumar, [3]):
classifier-training algorithm uses these pre-classified • If all the cases in S belong to the same class or
attributes to determine the set of parameters required for S is small, the tree is a leaf labelled with the
proper discrimination. The algorithm then encodes these most frequent class in S.
parameters into a model called a classifier. The widely • Otherwise, choose a test based on a single
used classification algorithms are
attribute with two or more outcomes. Make this
A. Naïve Bayesian Classification test the root of the tree with one branch for each
outcome of the test, partition S into
The Naïve Bayes Classifier technique is particularly corresponding subsets S1, S2, ……… according
suited when the dimensionality of the inputs is high. to the outcome for each case, and apply the
Despite its simplicity, Naive Bayes can often outperform same procedure recursively to each subset.
more sophisticated classification methods. Naïve Bayes
model identifies the characteristics of dropout students. There are usually many tests that could be chosen in
It shows the probability of each input attribute for the this last step. C4.5 uses two heuristic criteria to rank
predictable state. possible tests: information gain, which minimizes the
A Naive Bayesian classifier is a simple probabilistic total entropy of the subsets, and the default gain ratio
classifier based on applying Bayesian theorem (from that divides information gain by the information
Bayesian statistics) with strong (naive) independence provided by the test outcomes.
assumptions. By the use of Bayesian theorem we can J48 algorithm is an implementation of C4.5 decision
write tree algorithm in Weka software tool. Flowchart of
decision trees is presented by the tree structure. In every
internal node the condition of some attribute is being
examined, and every branch of the tree represents an
outcome of the study. The branching of the tree ends
We preferred Naive Bayes implementation because: with leaves that define a class to which examples belong.
Decision tree algorithm is a popular procedure today
• Simple and trained on whole (weighted) training because of its ease of implementation and in particular
data because of the possibility for the results to be
• Over-fitting (small subsets of training data) graphically displayed.
protection To evaluate the robustness of the classifier, the usual
• Claim that boosting “never over-fits” could not methodology is to perform cross validation on the
be maintained. classifier. In this study, a 3-fold cross validation was
• Complex resulting classifier can be determined used: we split data set randomly into 3 subsets of equal
reliably from limited amount of data size. Two subsets were used for training, one subset for
B. Multilayer Perceptron cross validating, and one for measuring the predictive
accuracy of the final constructed network. This
Multilayer Perceptron (MLP) algorithm is one of the procedure was performed 3 times so that each subset
most widely used and popular neural networks. The was tested once. Test results were averaged over 3-fold
network consists of a set of sensory elements that make cross validation runs. Data splitting was done without
up the input layer, one or more hidden layers of sampling stratification. The Weka software toolkit can
processing elements, and the output layer of the calculate all these performance metrics after running a
processing elements (Witten and Frank, [1]). MLP is specified k-fold cross-validation. The prediction
especially suitable for approximating a classification accuracy of the models was compared.
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56
Classification Model of Prediction for Placement of Students 51
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56
52 Classification Model of Prediction for Placement of Students
than in Malaysia, Singapore, Japan, China and Sri Lanka. The domain values for some of the variables were
It was also observed that there was an enhancement of defined for the present investigation as follows:
academic performance with the intensity of private • MR - Marks obtained in MCA. It is split into
tutoring and this variation of intensity of private tutoring three class values: First – ≥ 60%, Second – ≥
depends on the collective factor namely socioeconomic 45% and <60%, Third – ≥ 36% and < 45%.
conditions.
Yadav, Bhardwaj and Pal [19] obtained the university • SEM – Seminar Performance obtained. In each
students data like attendance, class test, seminar and semester seminar are organized to check the
assignment marks from the students’ database, to predict performance of students. Seminar performance
the performance at the end of the semester using three is evaluated into three classes: Poor –
algorithms ID3, C4.5 and CART and shows that CART Presentation and communication skill is low,
is the best algorithm for classification of data. Average – Either presentation is fine or
Communication skill is fine, Good – Both
presentation and Communication skill is fine.
IV. DATA MINING PROCESS • LW – Lab Work. Lab work is divided into two
classes: Yes – student completed lab work, No –
Knowing the factors for placement of student can help student not completed lab work.
the teachers and administrators to take necessary actions
so that the success percentage of placement can be • CS – Communication Skill. Communication
improved. Predicting the placement of a student needs a skill is divided into three classes: Poor –
lot of parameters to be considered. Prediction models Communication skill is low, Average –
that include all personal, social, psychological and other communication skill is up to mark, Good-
environmental variables are necessitated for the effective communication skill is fine.
prediction of the placement of the students. • GB – Graduation Background. This defines the
background of student. Whether students have
A. Data Preparations
done graduation is Art, Science or Computer.
The data set used in this study was obtained from • Placement - Whether the student placed or not
VBS Purvanchal University, Jaunpur (Uttar Pradesh) on after completing his/her MCA. Possible values
the sampling method for Institute of Engineering and are Yes if student placed and No if student not
Technology for session 2008-2012. Initially size of the placed.
data is 65.
B. Data selection and Transformation C. Implementation of Mining Model
In this step only those fields were selected which were Weka is open source software that implements a large
required for data mining. A few derived variables were collection of machine leaning algorithms and is widely
selected. While some of the information for the variables used in data mining applications. From the above data,
was extracted from the database. All the predictor and placement.arff file was created. This file was loaded into
response variables which were derived from the WEKA explorer. The classify panel enables the user to
database are given in Table I for reference. apply classification and regression algorithms to the
resulting dataset, to estimate the accuracy of the
resulting predictive model, and to visualize erroneous
TABLE I: STUDENT RELATED VARIABLES predictions, or the model itself. The algorithm used for
Variables Description Possible Values classification is Naive Bayes, Multilayer Perceptron
(MLP) and J48. Under the "Test options", the 10-fold
Sex Students Sex {Male, Female}
cross-validation is selected as our evaluation approach.
Since there is no separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of the
{First ≥ 60%, generated model. This predictive model provides way to
MR MCA Result Second ≥ 45 & <60% predict whether a new student will place or not in an
Third ≥ 36 & <45%} organization.
D. Results
SEM Seminar Performance {Poor , Average, Good}
To better understand the importance of the input
LW Lab Work { Yes, No } variables, it is customary to analyse the impact of input
variables during students' placement success, in which
CS Communication Skill {Poor , Average, Good} the impact of certain input variable of the model on the
output variable has been analysed. Tests were conducted
GB Graduation Background {Art, Computer, Science} using four tests for the assessment of input variables:
Placement Placement of Student {Yes, No}
Chi-square test, Info Gain test and Gain Ratio test.
Different algorithms provide very different results, i.e.
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56
Classification Model of Prediction for Placement of Students 53
each of them accounts the relevance of variables in a TABLE IV: TRAINING AND SIMULATION ERROR
different way. The average value of all the algorithms is Evaluation Classifiers
taken as the final result of variables ranking, instead of Criteria
selecting one algorithm and trusting it. The results
NB MLP J48
obtained with these values are shown in Table II.
Kappa statistic 0.7234 0.6001 0.5076
TABLE II: RESULT OF TESTS AND AVERAGE RANK Mean absolute error 0.2338 0.2212 0.3156
(MAE)
Variable Chi- Info Gain Average Root mean squared 0.3427 0.4234 0.453
squared Gain Ratio Rank error (RMSE)
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56
54 Classification Model of Prediction for Placement of Students
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56
Classification Model of Prediction for Placement of Students 55
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56
56 Classification Model of Prediction for Placement of Students
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 11, 49-56