0% found this document useful (0 votes)
41 views7 pages

ICSMB2016-C Anuradha

This document discusses a research paper that analyzes student academic performance using feature selection techniques and a naïve Bayes classifier. The paper aims to investigate the most relevant subset of features to achieve high predictive accuracy when determining student performance. It applies correlation-based feature subset evaluation and gain-ratio attribute evaluation for feature selection before using a naïve Bayes classifier in the WEKA tool. The results show the effectiveness of using feature selection to improve predictive accuracy while reducing the number of attributes used.

Uploaded by

habeeb4sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views7 pages

ICSMB2016-C Anuradha

This document discusses a research paper that analyzes student academic performance using feature selection techniques and a naïve Bayes classifier. The paper aims to investigate the most relevant subset of features to achieve high predictive accuracy when determining student performance. It applies correlation-based feature subset evaluation and gain-ratio attribute evaluation for feature selection before using a naïve Bayes classifier in the WEKA tool. The results show the effectiveness of using feature selection to improve predictive accuracy while reducing the number of attributes used.

Uploaded by

habeeb4sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/299993218

FEATURE SELECTION TECHNIQUES TO ANALYSE STUDENT ACADAMIC


PERFORMANCE USING NAÏVE BAYES CLASSIFIER

Conference Paper · February 2016

CITATIONS READS

26 1,697

2 authors, including:

Velmurugan Thambusamy
Dwaraka Doss Goverdhan Doss Vaishnav College
122 PUBLICATIONS 1,245 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Ph.D project View project

An Implementation of Substitution Techniques in Cloud Security using Playfair and Caesar algorithms View project

All content following this page was uploaded by Velmurugan Thambusamy on 08 April 2016.

The user has requested enhancement of the downloaded file.


The 3rd International Conference on Small & Medium Business 2016
January 19 - 21, 2016, Nikko Saigon Hotel, Hochiminh, Vietnam

FEATURE SELECTION TECHNIQUES TO ANALYSE


STUDENT ACADAMIC PERFORMANCE USING NAÏVE
BAYES CLASSIFIER
C.Anuradha1, T.Velmurugan2
1
Research Scholar, Bharathiar University, Coimbatore, India.
2
Associate Professor, PG and Research Dept. of Computer Science, D.G.Vaishnav College, Chennai-600106, India.
1
[email protected]; [email protected]

Abstract: Data mining provides educational institutions that the capability to explore, visualize and analyze large
amounts of data in order to reveal valuable patterns in students’ learning behaviors. Turning raw data into useful
information and knowledge also enables educational institutions to improve teaching and learning practices, and to
facilitate the decision-making process in educational settings. Thus, educational data mining is becoming an
increasingly important with a specific focus to exploit the abundant data generated by various educational systems
for enhancing teaching, learning and decision making. In EDM, Feature Selection is to choose a subset of input
variables by eliminating irrelevant features. Feature Selection Algorithm has proven to be effective in enhancing
learning efficiency, increasing predictive accuracy and reducing complexity of learned results. The primary objective
of this research work is to investigate the most relevant subset features for achieving high performance accuracy by
adopting Correlation based feature Subset Attribute evaluation and Gain-Ratio Attribute evaluation feature
selection techniques. For classification, the Naïve Bayes classifier is implemented by using WEKA tool. The outcome
shows the effectiveness in the predictive accuracy with minimum number of attributes. Also the results reveals that
the selected data features have found to be influenced the classification process of the student performance model.

Keywords: Educational Data Mining (EDM), Classification algorithm, Naïve Bayes Algorithm, Feature Selection,
Prediction.

I. INTRODUCTION patterns or knowledge from huge amount of data. As we


know large amount of data is stored in educational
Nowadays the field of data analytics and data mining database, so in order to get required data & to find the
(DM) is taking a new role. The role that is undertaking is hidden relationship, different data mining techniques are
as an enabler of educational institutions to improve key developed & used. There are varieties of popular data
performance indicators. The importance of data analytics mining task within the educational data mining e.g.
is growing and a new sub-field of studies is in its infancy. classification, clustering, outlier detection, association
This young field is called Educational Data mining and rule, prediction etc. We can use the data mining in
its main purpose is to analyze data by using a different educational system as: predicting drop-out student,
number of techniques. EDM integrates different relationship between the student university entrance
approaches as database systems, data warehousing, examination results & their success, predicting student's
statistics, machine learning and others. Moreover an academic performance, discovery of strongly related
experiment will be conducted with this educational data, subjects in the undergraduate syllabi, knowledge
the experiment will start with the description of the state discovery on academic achievement, classification of
of the art of EDM and it will continue with the students performance in computer programming course
development of a method for exploring data and according to learning style, investing the similarity &
predicting trends that will contribute to improve difference between colleges and schools. EDM develops
educational data or to analyze current problems to methods and applies techniques from statistics, machine
increase organizational performance. Educational Data learning, and data mining to analyze data collected
Mining is an emerging discipline, concerned with during teaching and learning. EDM tests learning
developing methods for exploring the unique types of theories and informs educational practice. As a result,
data that come from educational settings, and using those researchers try to determine the variables that are related
methods to better understand students, and the settings to academic achievement of students and may affect the
which they learn in. registration process. Therefore, one of the most important
Data mining is extraction of interesting (non-trivial, challenges that higher education faces is recognizing the
implicit, previously unknown and potentially useful) pattern of loyal students.

345
The 3rd International Conference on Small & Medium Business 2016
January 19 - 21, 2016, Nikko Saigon Hotel, Hochiminh, Vietnam

The effective feature selection techniques are for evaluation and naïve Bayes classifier for
required to analyze the efficient classification algorithms. classification purpose. Accuracy and time is the outcome
This research work attempts to foretell the students of the classification model and also various measures like
academic failure by reviewing the field of various feature sensitivity, specificity, precision and recall are also
selection algorithms based on the Naïve Bayes classifier. calculated [4]. A work carried out by Lumbini and Pravin
This research work is structured as follows. Section 2 [5] have proposed an experiment attempts the detection
illustrates the research work that has been conducted in of student’s failure to improve their academic
EDM. In section 3 consist of methods and materials of performance. They have applied different approaches to
the domain of study will be defined. The description of resolve the problem of high dimensionality and using
the process of building a model includes data collection classification algorithm on engineering students data set.
and used tools are given in Section 4. Then Section 5 Predictive Analytics Using Data Mining Technique
presents the experimentation and results obtained. Finally, [6] by Hina Gulati has presents the work of data mining
conclusion is given in Section 6. is predicting the dropout feature of students. Author also
applied some feature selection algorithms. Tool used for
II. RELATED WORKS feature selection and mining is weka. Another work by
Jai and David discussed about Analysis of Influencing
This section discusses about some of the research Factors in Predicting Students Performance Using MLP-
work carried out by various researchers in the same field. A Comparative Study [7]. This paper mainly focused on
A work done by Humera Shaziya et al. has presents an analyzing the prediction accuracy of the academic
approach to predict the performance of students in a performance using influencing factors by Multi Layer
semester exams. This approach is based on a Naive Perception algorithm and compares it with the prediction
Bayes classifier. The objective is to know what grades accuracy. Another research work carried out by Anal
students may obtain in their end semesters results. This and Devadatta have discussed about Application of
helps the educational institute, teachers and students i.e., Feature Selection Methods in Educational Data Mining.
all the stakeholders involved in an education system. Different feature selection algorithms are applied on this
Students and teachers can take necessary actions to data set and the results are obtained by Correlation Based
improve the results of those students whose result Feature Selection algorithm with 8 features. Then
prediction is not satisfactory. A training dataset of classification algorithms may be applied on this feature
students is taken to build the Naive Bayes model. The subset for predicting student grades [8]. Another work by
model is then applied on the test data to predict the end the same authors have discussed about Early Prediction
semester results of students. In this study, number of of Students Performance using Machine Learning
attributes is considered to predict the grade of a student Techniques. In this paper a set of attributes are first
[1]. defined. Then feature selection algorithms are applied on
Another work done by Tajunisha and Anjali have the data set to reduce the number of features. Five classed
discussed about Predicting Student Performance Using of Machine Learning Algorithm (MLA) are then applied
MapReduce. Authors introduced the MapReduce concept on this data set and it was found that the best results were
to improve the accuracy and reduce the time complexity. obtained with the decision tree class of algorithms [9].
In this work, the deadline constraint is also introduced.
Based on this, an extensional MapReduce Task III. METERIALS AND METHODS
Scheduling algorithm for Deadline constraints (MTSD) is
proposed. It allows user to specify a job’s (classification A feature selection algorithm can be seen as the
process in data mining) deadline and tries to make the job combination of a search technique for proposing new
to be finished before the deadline. Finally, the proposed feature subsets, along with an evaluation measure which
System has higher classification accuracy even in the big scores the different feature subsets. The simplest
data and it also reduced the time complexity [2]. Another algorithm is to test each possible subset of features
study focused on Predicting Students Final GPA Using finding the one which minimizes the error rate. The
Decision Trees by Mashael and Muna [3]. choice of evaluation metric heavily influences the
Authors applied the J48 decision tree algorithm to algorithm, and it is these evaluation metrics which
discover classification rules. They extracted useful distinguish between the three main categories of feature
knowledge and identified the most important courses in selection algorithms: wrappers, filters and embedded
the students study plan based on their grades in the methods. Wrapper methods use a predictive model to
mandatory courses. A work carried out by Karthikeyan score feature subsets. Filter methods use a proxy measure
and Thangaraju had proposed a work in genetic instead of the error rate to score a feature subset. This
algorithm and particle Swarm optimization search measure is chosen to be fast to compute, while still
techniques and correlation based feature selection is used capturing the usefulness of the feature set. Embedded

346
The 3rd International Conference on Small & Medium Business 2016
January 19 - 21, 2016, Nikko Saigon Hotel, Hochiminh, Vietnam

methods are a catch-all group of techniques which C. Naïve Bayes


perform feature selection as part of the model The Naïve Bayes classifier technique is used when
construction process [10]. dimensionality of the inputs is high. This is a simple
algorithm but gives good output than others. This
A. Correlation Based Feature Subset Selection classifier is used to predict the dropout of the students by
CFS is a correlation-based filter method CFS from calculating the probability of each input for a predictable
[11]. It gives high scores to subsets that include features state [13].
that are highly correlated to the class attribute but have
low correlation to each other Let S be an attribute subset IV. EXPERIMENTAL DATA
that has k attributes, rcf models the correlation of the
attributes to the class attribute, rff the intercorrelation The dataset is a collection of first year students
between attributes. information contains 5 undergraduate degree courses
meritS = k rcf / sqrt (k+k (k-1) rff) collected from SSBSTAS College, Thiruvalluvar
University, Tamilnadu for a period of 2013-2014. The
B. Gain Ratio Attribute Evaluator student data set of 257 records with 21 attributes that
Gain Ratio Attribute Evaluator is simple individual includes the gender, category of admission, living
attribute ranking mechanism. In this technique, each location, family size, and family type, annual income of
attribute is assigned a score where the score is delineated the family, father’s qualification and mother’s
by means of the difference of attributes entropy and its qualification. The attributes referring to the students’ pre-
class conditional entropy [12]. college characteristics included Students Grade in High
GainR (Class, Attribute) = (H (Class) - H (Class | School and Students Grade in Senior Secondary School.
Attribute)) / H (Attribute). The attributes describing other college features include
the branch of study of the students, place of stay,
Classification is a data mining task that predicts previous semester mark, class test performance, seminar
group membership for data instances. In this research performance, assignment, general proficiency, class
work classification techniques are used to predict the attendance and performance in the laboratory work.
class of the graduate student and how the other attributes Following Table 1 shows the description of attributes.
affects the performance. The classifier used in this study
is Naïve Bayesian algorithm.

Table 1: Student Data Set Description

Variables Description Possible Values


Gender Students Sex {Male, Female}
Branch Students Branch {BCA, B.SC, B.COM, B.A}
Cat Students category {BC, MBC, MSC, OC, SBC, SC}
HSG Students grade in {O – 90% -100%, A – 80% - 89%, B – 70% - 79%,
High School C – 60% - 69%, D – 50% - 59%, E – 35% - 49%,
FAIL - <35%}
SSG Students grade in {O – 90% -100%, A – 80% - 89%, B – 70% - 79%,
Senior Secondary C – 60% - 69%, D – 50% - 59%, E – 35% - 49%,
FAIL - <35% }
Medium Medium of instruction Tamil, English, others
LLoc Living Location of Student {Village, Taluk, Rural, Town, District}
HOS Student stay in hostel or not {Yes, No}
FSize student’s family size {1, 2, 3, >3}
FType Students family type {Joint, Individual}
FINC Family annual income {poor, medium, high}
FQual Fathers qualification {no-education, elementary, secondary, UG, PG, Ph.D}
MQual Mother’s Qualification {no-education, elementary,
secondary, UG, PG, Ph.D. NA}
PSM Previous Semester Mark {First > 60%, Second >45 &<60%, Third >36 &<45%
Fail < 36%}
CTG Class Test Grade {Poor, Average, Good}

347
The 3rd International Conference on Small & Medium Business 2016
January 19 - 21, 2016, Nikko Saigon Hotel, Hochiminh, Vietnam

SEM_P Seminar Performance {Poor , Average, Good}


ASS Assignment {Yes, No}
GP General Proficiency {Yes, No}
ATT Attendance {Poor , Average, Good}
LW Lab Work {Yes, No}
ESM End Semester Marks {First > 60% , Second >45 &<60% , Third >36 &<45%,
Fail < 36%}

For the purpose of designing and evaluating our shows the results of applying two feature selection
experiments, we have used WEKA. It is open source algorithms.
software which is freely available for mining data and
implements a large collection of mining algorithm. It can Table 2: Best Selected Attributes
accept data in various formats and also has converter
supported with it. So we have converted the student Algorithm Attributes Selected
dataset into CSV file. Under the “Test options”, the 10- cfsSubsetEval Branch,SSG,FINC,PSM,GP,ATT
fold cross-validation is selected as our evaluation process. Age,branch,cat,SSG,
The various performance Metrics are discussed as GainRatio medium,ATT,GP,FINC,
follows. AttributeEval FQUAL,MQUAL,
The Accuracy of the predictive model is calculated HSG,SEM_P,LOC
based on the True positive rate, false positive rate, and
precision and recall values [14]. TP rate(True Positive): A. Results of cfsSubset Evaluator
A positive test results accurately reflects the test for In this experiment Correlation Based Feature
activity. If the outcome from a prediction is p, and the selection algorithm is used with 6 attributes along with
actual value is also p, then it is called true positive (TP). Naïve Bayes classifier was implemented on the data set
and the results are presented in Table 3. It shows that
TP = TP/P where P= (TP+FN) classification results for Naïve Bayes correctly classifies
about 84.2% for 10 fold cross validation. Also True
TN (True negative): It has occurred when both the Positive rate is high for the class Second and first,
prediction outcome and the actual value are n in the Whereas TP rate is very low for the class Third. Fig.1
number of input data. shows the graphical representation of the classifier.

TN = TN/N, where N = (TN+FN)

FP rate(False positive): If the outcome from a prediction


is p and the actual value is n, then it is said to be false
positive (FP).
FP = FP / (FP+TN)
Precision: It is the fraction of retrieved instances that are
relevant.
Precision = TP/ (TP+FP)
Recall: It is a fraction of relevant instances that are
retrieved. TP/ (TP+FN)

V. RESULTS AND DISCUSSION

The present investigation focuses on two feature


selection techniques namely cfsSubsetEval and
GainRatioAttributeEval, which is one of the important Figure 1: Result of CfsSubset Evaluator
and frequently used in data preprocessing in data mining.
Using these attribute selection algorithms we can select B. Results of GainRatioAttributeEvaluator
the best attributes out of huge number of attributes of The present study implements GainRatio
students that affect the student’s performance. And the Attribute Evaluator with 13 attributes. The Result of
results are obtained with Naïve Bayes classifier. Table 2 Naïve Bayes classifier is shown in Table 4. It shows that

348
The 3rd International Conference on Small & Medium Business 2016
January 19 - 21, 2016, Nikko Saigon Hotel, Hochiminh, Vietnam

classifier correctly classifies about 74.4% for 10 fold


cross validation. True positive rate is high for the class
second and it is very low for the class Third. Fig.2 shows
the graphical representation of Naïve Bayes classification
algorithm.

Figure 2: Result of Gain-Ratio Attribute Evaluator

Table 3: Classifier Result for CfsSubsetEvaluator

Naïve Bayes – 10 fold cross validation


Class
TP Rate FP Rate Precision Recall F-Measure ROC Area
Second 0.888 0.205 0.664 0.888 0.759 0.885
Fail 0.333 0.034 0.429 0.333 0.375 0.823
First 0.759 0.162 0.859 0.759 0.806 0.875
Distinction 0.25 0.016 0.859 0.25 0.316 0.835
Third 0 0 0 0 0 0.059
Weighted Avg. 0.842 0.359 0.844 0.842 0.835 0.869

Table 4: Classifier Result for GainRatioAttributeEvaluator

Naïve Bayes – 10 fold cross validation


Class
TP Rate FP Rate Precision Recall F-Measure ROC Area
Second 0.873 0.25 0.683 0.873 0.767 0.859
Fail 0 0.015 0 0 0 0.631
First 0.764 0.155 0.848 0.764 0.804 0.886
Distinction 0.143 0.015 0.25 0.143 0.182 0.825
Third 0 0 0 0 0 0.053
Weighted Avg. 0.744 0.179 0.72 0.744 0.726 0.867

Table 5: Overall Accuracy of Feature Selection Algorithm

Naïve Bayes
Algorithm Second Fail First Distinction Third Weighted
Avg.
cfsSEval 0.888 0.333 0.759 0.25 0 0.842
GRAE 0.873 0 0.764 0.143 0 0.744

349
The 3rd International Conference on Small & Medium Business 2016
January 19 - 21, 2016, Nikko Saigon Hotel, Hochiminh, Vietnam

C. Performance comparison between the Feature REFERENCES


Selection Algorithms
[1] Humera Shaziya, Raniah Zaheer, Kavitha.G, “ Prediction of
The results for the performance of the selected students in Semester Exams using a Naïve Bayes Classifier”, Int.
feature selection algorithm on Naïve Bayes classifier is Journal of Innovative Research in Science, Engineering and
Technology, Vol.4, Issue 10, 2015, pp.9823-9829.
summarized in Table 5. The results of Feature Selection
[2] Tajunisha N,Anjali M, “ Predicting Student Performance Using
algorithm along with the naïve Bayes classifier reveals
MapReduce”, Int. Journal of Engineering and Computer Science,
that Correlation Based Feature subset Evaluator performs Vol.4, Issue 1, 2015,pp.9971-9976.
very well with 6 attributes in comparison with Gain Ratio [3] Mashael A. Al-Barrak, Muna Al-Razgan, “Predicting Students
which has 13 attributes. The overall accuracy of CFS Final GPA Using Decision Trees: A Case Study”, Int. Journal of
algorithm is about 84%. On the other hand Gain Ratio Information and Education Technology, Vol.6, No.7, 2016,
performs less accurate of just 74%. Also the pp.528-533.
classification accuracy is very good for the class Second [4] Karthikeyan.T, Thangaraju.P,”Genetic Algorithm based CFS and
and First. In addition, further analysis that the prediction Naïve Bayes Algorithm to Enhance the Predictive Accuracy”,
result shows that accuracy is low for the class Distinction Indian Journal of Science and Technology, Vol.8, No.26, 2015,
and very worst for the class Third. pp.1-8.
[5] Lumbini P.Khobragade, Pravin Mahadik,” Students Academic
Failure Prediction Using Data Mining”, Int. Journal of Advanced
VI. CONCLUSION
Research in Computer and Communication Engineering, Vol.4,
Issue.11,2015,pp.290-298.
In this research work, It is presented a case study in [6] Hina Gulati, “Predictive Analytics Using Data Mining Technique”,
educational data mining. The obtained results show that 2nd International Conference on Computing for Sustainable Global
the feature selection techniques can improve the accuracy Development, 2015, pp.713-716.
and efficiency of the classification algorithm by [7] Jai Ruby,K. David, “Analysis of Influencing Factors in Predicting
removing irrelevant and redundant attributes. It was Students Performance Using MLP-A Comparative Study”, Int.
especially used to improve the student performance. The Journal of Innovative Research in Computer and Communication
most relevant features are got by using GainRatio and Engineering, Vol.3, Issue.2, 2015, pp.1085-1092.
CFS subset evaluator. Naïve Bayes classifiers have been [8] Anal Acharya, Devadatta Sinha, “Application of Feature Selection
Methods in Educational Data Mining”, Int. Journal of Computer
applied on the selected features. From the results, it is Applications, Vol.103, No.2,2014,pp.34-38.
concluded that Correlation Based Feature Subset [9] Anal Acharya, Devadatta Sinha, “Early Prediction of Students
evaluator performs well with the Naïve Bayes classifier Performance using Machine Learning Techniques”, Int. Journal of
as compared with Gain-Ratio Attribute Evaluator. In Computer Applications, Vol.107,No.1,2014,pp.37-43.
[10] Guyon, Isabelle, and Andre Elisseeff, "An introduction to variable
future, this work extend the experiment with different and feature selection", The Journal of Machine Learning Research,
data mining techniques like clusters can applied with Vol. 3, 2003, pp. 1157-1182.
other feature selection algorithms on large data set in the [11] Hall, M. A., Smith, L. A, “Practical feature subset selection for
same educational field. machine learning”, Australian Computer Science Conference,
Springer, 1998, pp.181-191.
[12] Muhammad Naeem,”An Empirical Analysis and Performance
Evaluation of Feature Selection Techniques for Belief Network
Classification System”, Int. Journal of Control and Automation,
Vol.8, No.3, 2015, pp.375-388.
[13] Mital Doshi, Setu K Chaturvedi,”Correclation Based Feature
Seleciton(CFS) Technique to Predict Student Performance”, Int.
Journal of Computer Networks & Communications, Vol.6, No.3,
2014,pp.197-206.
[14] P.V.Praveen Sundar,” A comparative study for Predicting
Students Academic Performance using Bayesian Network
Classifiers”, IOSR Journal of Engineering, Vol.3, Issue
2,2013,pp.37-42.

Figure 3: Overall accuracy of Feature Selection


Algorithm

350

View publication stats

You might also like