V3i12 0295
V3i12 0295
net/publication/261360272
CITATIONS READS
7 233
1 author:
Supriya Byreddy
RK University
5 PUBLICATIONS 10 CITATIONS
SEE PROFILE
All content following this page was uploaded by Supriya Byreddy on 05 April 2014.
Abstract:- Now a days there is an increasing interest in data mining and educational systems, make educational data
mining as a new growing research community. The goal of institutions is to give quality education to its students. One
way to achieve highest level of quality in higher education system is by discovering knowledge for prediction
regarding enrolment of students in a particular course In Our Data driven data mining model, knowledge is originally
existed in data, but just not understandable for human. Data mining is taken as a process of transform knowledge
into some human understandable format like rule, formula, theorem, etc. This article provides a Review of the
available literature on Educational Data mining, Classification method and different feature selection techniques that
we should apply on Student dataset. The knowledge is hidden among the educational data set and it is extractable
through data mining techniques.
Keywords: Data Mining, Education data mining, Knowledge discovery from data (KDD), Decision Tree,
Classification techniques, Attribute Selection techniques.
I. INTRODUCTION
Data mining is the The iterative and interactive process of discovering valid, novel, useful, and understandable
knowledge ( patterns, models, rules etc.) in Massive databases
The main term that data mining support for data is
Valid: generalize to the future
Novel: what we don't know
Useful: be able to take some action
Understandable: leading to insight
Iterative: takes multiple passes
Interactive: human in the loop
Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from
data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. [1]
Over past few years, many numbers of engineering institutes have opened rapidly in India. This causes a cut
throat competition for attracting the student to get them enroll in their campus. Most of the institutes are opened in self-
finance mode, so all the time they feel short hand in expenditure. Quality education is one of the most promising
responsibilities of any University/ Institutions to their students. Quality education does not mean high level of knowledge
produced. But it means that education is produced to students in efficient manner so that they learn without any problem.
For this purpose quality education includes features like methodology of teaching, continuous evaluation, categorization
of student into similar type, so that students have similar objectives, demographic, educational background etc. [2]
Engineering degrees are mostly offered in different curriculum structures. Engineering students are to fulfill
strict requirements in order to graduate and hold a degree in engineering profession. Engineering students’ accounts for
numbers of departments mainly civil, electrical, mechanical, computer, electronics, communication, information
technology, chemical, mining, metallurgical, textile, environment etc., Most of the engineering institutes’ first five/six
major courses.
This education is residential and at the beginning, student affects due to various factors related to their
academic path. Most of the core courses are usually same for all the students in first year. They comprise essentially
Mathematics, Physics and chemistry courses. These course are the prerequisites of almost all major courses, students
are exposed to the fundamental and basic concepts required to pursue specialized theories on their further studies. Core
courses play a decisive role in the student performance and enrolled in this study.
So Due to a greater number of students and institutions, higher education institutions (HEIs) are becoming
more oriented to performances and their measurement and are accordingly setting goals and developing strategies for
their achievements[5]
Here as shown in Fig. 1, educators and academics responsible are in charge of designing, planning, building and
maintaining the educational systems. Students use and interact with them. Starting from all the available information
about courses, students, usage and interaction, different data mining techniques can be applied in order to discover useful
knowledge that helps to improve the e-learning process. The discovered knowledge can be used not only by providers
(educators) but also by own users (students). So, the application of data mining in educational systems can be oriented to
different actors with each particular point of view. [4].
The recent literature related to Educational data mining (EDM) is presented. Educational data mining is an
emerging discipline that focuses on applying data mining tools and techniques to educationally related data. Researchers
within EDM focus on topics ranging from using data mining to improve institutional effectiveness to applying data
mining in improving student learning processes. There is a wide range of topics within educational data mining. So this
paper will focus exclusively on ways that data mining is used to improve student success and processes directly related to
student learning. For Example, Student success and retention, personalized recommender systems, and evaluation of
student learning within course management system(CMS) are all topics within the broad field of educational data mining.
A large number of engineering students got failure during their Engineering Course. The paper is structured as
follows: In Section II presents KDD (Knowledge Discovery from Database) process. In Section III different decision
Tree Method of classification technique are explained, like ID3, C4.5, CART and ADT. In Section IV presents Different
Attribute Selection Techniques for filtering some best attributes from students database.
Knowledge discovery as a process is depicted in Figure 2 KDD have iterative sequence of the following
steps:[16]
© 2013, IJARCSSE All Rights Reserved Page | 629
Komal et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(12),
December - 2013, pp. 628-635
1. Develop an understanding for the application domain and identify the goal.
2. Create a target dataset
Selecting a dataset or focusing on a subset of samples or variables on which to make discoveries
3. Data cleaning and preprocessing
removing of noise and outliers from collecting necessary information to model or account for noise handling of missing
data accounting for time sequence information.
4. Data reduction and projection
Finding useful features to represent the data relative to the goal dimensionality reduction/transformation ==> reduce
number of variables identification of invariant representations
5. Selection of appropriate data-mining task
Summarization, classification, regression, clustering, etc.
6. Selection of data-mining algorithm(s)
Methods to search for patterns decision of which models and parameters may be appropriate match method to goal of
KDD process
7. Data-mining
Searching for patterns of interest in one or more representational forms
8. Interpretation and visualization
Interpretation of mined patterns visualization of extracted patterns and models visualization of the data given the
extracted models
Data mining includes fitting models to or determining patterns from observed data. The fitted models play the
role of brings knowledge. Deciding whether the model reflects useful knowledge or not is a part of the overall KDD
process for which subjective human judgment is usually required.
The more common Techniques in current data mining practice include the following.
1) Classification: classifies a data item into some of several predefined categorical classes.
2) Regression: maps a data item to a real valued prediction variable.
3) Clustering: Clustering is maximization of similarity and minimization of dissimilarity between categorical
classes.
4) Rule generation: extracts different classification rules from the data.
5) Discovering association rules: describes association relationship among different attributes.
6) Summarization: provides a compact description for a subset of data.
7) Dependency modeling: describes relating dependencies among variables.
C. CART
CART stands for Classification and Regression Trees introduced by Breiman [8]. It is also based on Hunt’s
algorithm. CART handles both categorical and continuous attributes to build a decision tree. It handles missing values.
CART uses Gini Index as an attribute selection measure to build a decision tree .Unlike ID3 and C4.5
algorithms, CART produces binary splits. Hence, it produces binary trees. Gini Index measure does not use probabilistic
assumptions like ID3, C4.5. CART uses cost complexity pruning to remove the unreliable branches from the decision
tree to improve the accuracy.
1. Cfssubseteval
Synopsis
Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature
along with the degree of redundancy between them.
Options:
Option Description
Identify locally predictive attributes. Iteratively adds attributes with the highest mutual
relationship with the class as long as there is not already an attribute in the subset that has a
locally Predictive
higher correlation with the attribute in question
Capabilities:
Capability Supported
Class Missing class values, Numeric class, Nominal class, Date class, Binary class
Empty nominal attributes, Nominal attributes, Numeric attributes, Unary attributes, Date
Attributes
attributes, Binary attributes, Missing values
Min # of
1
instances
2. Chisquaredattributeeval
Synopsis
Chisquaredattributeeval is evaluates an attribute by computing the value of the chi-squared statistic with respect to the
class.
Options
Option Description
binarizeNumericAttributes Only binarize numeric attributes instead of properly discretizing them.
Distribute the counts for missing values. Then Counts are distributed across other values
missing Merge
in proportion to their frequency. Or else, missing is treated as a separate value.
3. Consistency-subseteval
Synopsis
Evaluates a subset of attributes when the training instances are projected onto the subset of attributes by the level of
consistency in the class values
Capabilities
Capability Supported
Class Nominal class, Missing class values, Binary class
Date attributes, Empty nominal attributes, Nominal attributes, Numeric attributes, Binary attributes,
Attributes
Missing values, Unary attributes
Min # of
1
instances
4. Filteredattributeeval
Synopsis
Class for running an arbitrary attribute evaluator on data that has been passed through an arbitrary filter (note: filters that
alter the order or number of attributes are not allowed). Like the evaluator, the structure of the filter is based exclusively
on the training data.
Options
Option Description
attributeEvaluator The attribute evaluator to be used.
filter The filter to be used.
Capabilities
Capability Supported
Class Nominal class, Binary class
Missing values, Date attributes, Unary attributes, Empty nominal attributes, Numeric attributes,
Attributes
Nominal attributes, Binary attributes, Relational attributes, String attributes
Min # of
0
instances
5. OneRAttributeEval
Synopsis
By using the OneR classifier evaluates an attribute.
Options
Option Description
evalUsingTrainingData Use the training data to evaluate attributes rather than cross validation.
folds Set the number of folds for cross validation.
minimumBucketSize The minimum number of objects in a bucket (passed to OneR).
seed Set the seed for use in cross validation.
Capabilities
Capability Supported
Class Binary class, Missing class values, Nominal class
Date attributes, Nominal attributes, Empty nominal attributes, Missing values, Unary attributes,
Attributes
Numeric attributes, Binary attributes
Min # of
1
instances
Options
Option Description
filter The filter to be used.
subsetEvaluator The subset evaluator to be used.
Capabilities
Capability Supported
Class Nominal class, Binary class
Relational attributes, Nominal attributes, Missing values, Binary attributes, Empty nominal attributes,
Attributes
Unary attributes, Numeric attributes, String attributes, Date attributes
Min # of
0
instances
7. GainRatioAttributeEval
Synopsis
Evaluates an attribute by measuring the gain ratio with respect to the class.
GainR (Class, Attribute) = (H (Class) - H (Class | Attribute)) / H (Attribute).
Options
Option Description
Distribute counts for missing values. Counts are distributed over other values in proportion to their
missingMerge
frequency. Or else, missing is treated as a separate value.
Capabilities
Capability Supported
Class Missing class values, Nominal class, Binary class
Nominal attributes, Date attributes, Binary attributes, Empty nominal attributes, Numeric attributes,
Attributes
Missing values, Unary attributes
Min # of
1
instances
8. InfoGainAttributeEval
Synopsis
Evaluates an attribute by measuring the information gain with respect to the class.
Info Gain (Class, Attribute) = H (Class) - H (Class | Attribute).
Options
Option Description
Just binarize numeric attributes rather than
binarizeNumericAttributes
properly discretizing them
Distribute the counts for missing values. Counts
missing Merge are distributed over other values in proportion to
their frequency. Or else, missing is treated as a separate value.
Capabilities
Capability Supported
Class Binary class, Nominal class, Missing class values
Nominal attributes, Missing values, Numeric attributes, Unary attributes, Date attributes, Empty
Attributes
nominal attributes, Binary attributes
Min # of
1
instances
10. SymmetricalUncertAttributeEval
Synopsis
Evaluates an attribute by measuring the symmetrical uncertainty with respect to the class.
SymmU (Class, Attribute) = 2 * (H (Class) - H (Class | Attribute)) / H (Class) + H (Attribute).
Options
Option Description
Distribute counts for missing values. Counts are distributed over other values in proportion to their
Missing Merge
frequency. Or else, missing is treated as a separate value.
Capabilities
Capability Supported
Class Missing class values, Nominal class, Binary class
Nominal attributes, Date attributes, Binary attributes, Empty nominal attributes, Numeric attributes,
Attributes
Missing values, Unary attributes
Min # of
1
instances
V. CONCLUSION
Data mining have the ability to uncover hidden patterns in large databases; community colleges and universities can
build models that predict with a high degree of accuracy the behavior of population clusters. By acting on these
predictive models, educational institutions can effectively address issues ranging from transfers and retention, to
marketing and alumni relations.
REFERENCES
[1] U. Fayyad and R. Uthurusamy, “Data mining and knowledge discovery in databases,” Commun. ACM, vol. 39,
pp. 24–27, 1996.
[2] Shiv Kumar Gupta,Sonal Gupta &Ritu Vijay,” prediction of student success that are going to enroll in the Higher
technical education”, IJCSEITR, ISSN 2249-6831, Vol. 3, Issue 1, Mar 2013, pp. 95-108.
[3] Richard A. Huebner,Norwich University,” A survey of educational data- mining research”, Research in Higher
Education Journal,2012,pp-1-13
[4] C. Romero *, S. Ventura.(2007)”Educational data mining: A survey from 1995 to 2005”,ScienceDirect Expert
Systems with Applications 33 pp. 135–146,2007.
[5] Zeljko Garaca, Maja Cukusic, Mario jadric (2010), “Student Dropout Analysis with application of data mining
methods”,Vol 1,pp. 31-46