0% found this document useful (0 votes)
82 views4 pages

Implementation of C4.5 Algorithm To Evaluate The Cancellation Possibility of New Student Applicants at Stmik Amikom Yogyakarta

This document discusses using the C4.5 algorithm to build a decision tree to predict the likelihood of student applicants cancelling their applications at a university. It begins with background on the problem, noting that in 2006, 25.5% of accepted applicants at one university cancelled. It then provides an overview of case-based reasoning and decision trees. The rest of the document details the C4.5 algorithm and how it was used to build a decision tree model to classify applicants as likely to cancel or not cancel based on over 1,500 records of past applicants.

Uploaded by

Daniel Ka Chun
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views4 pages

Implementation of C4.5 Algorithm To Evaluate The Cancellation Possibility of New Student Applicants at Stmik Amikom Yogyakarta

This document discusses using the C4.5 algorithm to build a decision tree to predict the likelihood of student applicants cancelling their applications at a university. It begins with background on the problem, noting that in 2006, 25.5% of accepted applicants at one university cancelled. It then provides an overview of case-based reasoning and decision trees. The rest of the document details the C4.5 algorithm and how it was used to build a decision tree model to classify applicants as likely to cancel or not cancel based on over 1,500 records of past applicants.

Uploaded by

Daniel Ka Chun
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Proceedings of the International Conference on B-71

Electrical Engineering and Informatics


Institut Teknologi Bandung, Indonesia June 17-19, 2007

IMPLEMENTATION OF C4.5 ALGORITHM TO EVALUATE THE


CANCELLATION POSSIBILITY OF NEW STUDENT APPLICANTS
AT STMIK AMIKOM YOGYAKARTA
Kusrini1, Sri Hartati2

1 STMIK AMIKOM Yogyakarta, Jl. Ringroad Utara Condong Catur Sleman Yogyakarta Indonesia.
Telp. +628157988801. Email: [email protected]

2 Gadjah Mada University, Mathematic and Natural Science Faculty, Yogyakarta Indonesia. Email: [email protected]

Student applicant’s cancellation often occurs in STMIK AMIKOM Yogyakarta. A student candidate, who has been succeeded in
the admission test, cancels his/her application by disregarding the next phase of admission process (re-registration). This condition
causes a detrimental effect for STMIK AMIKOM, this makes the number of new students always go under the desired capacity. If
the possibility of the registration cancellation can be detected early, then the executive manager can make any attempts to keep the
candidate go through the admission process and subsequently minimize the rate of admission cancellation. A research to detect the
possibility of application withdrawal is carried out recalling a previous experience suitable for solving the current problem, the case
search and matching process is made easier, an indexing method is conducted before forming a decision tree. The decision tree is
developed using C4.5 algorithm, which is improvement from the predecessor ID3 algorithm. This application was designed to be
flexible. It allows modifications of variables or training cases. As the trial medium, it used more than 1500 data records of new
student applicants for 2006/2007 teaching season in STMIK AMIKOM Yogyakarta

1. Introduction Plenty of algorithms are developed to build decision tree


In the year of 2006 in STMIK AMIKOM Yogyakarta, like ID3, CART and C4.5 [5].
there are 1956 student candidates who had been succeeded in Our research is building an application to detect the
admission test, but 499 of them cancelled their application by possibility of application withdrawal is carried out recalling a
disregarding re-registration. 25.5 % potential student previous experience suitable for solving the current problem,
candidate could not be endured by STMIK AMIKOM. the case search and matching process is made easier, an
The cancellation should be minimized by the STMIK indexing method is conducted before building a decision tree.
AMIKOM management, since incoming students will The decision tree is developed using C4.5 algorithm, which is
become their new source for operational and development improvement from the predecessor ID3 algorithm. This
finances. application was designed to be flexible. It allows
If the possibility of a student candidate withdrawal can be modifications of variables or training cases. As the trial
detected early, it is expected the STMIK AMIKOM medium, it used more than 1500 data records of new student
management can make some action to make their student applicants for 2006/2007 teaching season in STMIK
candidate stay. AMIKOM Yogyakarta.
A technique to analyze the possibility is by doing
classification to a set of candidate application data. Whether a 2. Theory Background
candidate is going to withdraw his/her application or not, it 2.1 Case Based Reasoning
can be identified by search his/her classification. One of Case-based reasoning (CBR) is a problem solving
famous classification modeling is by using decision tree. technique based on previous experience knowledge [1].
Decision tree is categorized as a case indexing technique The problem-solving life cycle in a CBR system consists
with inductive approach in case based reasoning. Case essentially of the following four parts (see Fig. 1):
indexing refers to assigning indexes to cases for future 1. Retrieving similar previously experienced cases
retrieval and comparison. Inductive approaches are used to (e.g., problem–solution–outcome triples) which
determine the case-based structure, which determines the problem is judged to be similar
relative importance of features for discriminating among 2. Reusing the cases by copying or integrating the
similar cases, the resulting hierarchical structure of the case solutions from the cases retrieved
base provides a reduced search space for the case retriever. 3. Revising or adapting the solution(s) retrieved in an
This may, in turn, reduce the query search time [6]. Other attempt to solve the new problem
approaches in case indexing technique are Nearest-neighbor 4. Retaining the new solution once it has been
retrieval, Knowledge-guided approaches and Validated confirmed or validated
retrieval.

ISBN 978-979-16338-0-2 623


Proceedings of the International Conference on B-71
Electrical Engineering and Informatics
Institut Teknologi Bandung, Indonesia June 17-19, 2007

binary tree, C4.5 produces a tree of more variable


shape.
- For categorical attributes, C4.5 by default produces
a separate branch for each value of the categorical
attribute. This may result in more “bushiness” than
desired, since some values may have low frequency
or may naturally be associated with other values.
- The C4.5 method for measuring node homogeneity
is quite different from the CART method and is
examined in detail below.

2.3 C4.5 Algorithm


In general, steps in C4.5 algorithm to build decision tree
are[4]:
- Choose attribute for root node
- Create branch for each value of that attribute
- Split cases according to branches
- Repeat process for each branch until all cases in the
branch have the same class

Choosing which attribute to be a root is based on highest


gain of each attribute. To count the gain, we use formula 1,
below [4]:
n
Fig. 1. Case Based Reasoning Life Cycle [6] | Si |
Gain( S , A) = Entropy ( S ) − ∑ * Entropy( Si ) (1)
i =1 |S|
2.2 Decision Tree
A decision tree model consists of a set of rules for dividing with:
a large heterogeneous population into smaller, more {S1, ..., Si, …, Sn} = partitions of S according to values of
homogeneous groups with respect to a particular target attribute A
variable[5]. A decision tree may be painstakingly constructed n = number of attributes A
by hand in the manner of Linnaeus and the generations of |Si| = number of cases in the partition Si
taxonomists that followed him, or it may be grown |S| = total number of cases in S
automatically by applying any one of several decision tree
algorithms to a model set comprised of pre-classified data. while entropy is gotten by formula 2 below[4]:
The target variable is usually categorical and the decision
n
tree model is used either to calculate the probability that a
Entropy ( S ) = ∑ − pi * log 2 pi ................................ (2)
given record belongs to each of the categories, or to classify
i =1
the record by assigning it to the most likely class. Decision
trees can also be used to estimate the value of a continuous
with:
variable, although there are other techniques more suitable to
S : Case Set
that task[5].
n : number of cases in the partition S
Since the decision tree combines between data exploration
pi : Proportion of Si to S
and modeling, it is very good for beginning step in modeling
process even when it positioned as final model from some
other techniques. 3. Design
Badriyah, T.(2006) made a classification utility with To implement C4.5 algorithm in this decision tree creating,
decision tree for decision support system. The algorithm used we use some tables in relational databases. They are:
was ID3. The utility built in the Badriyah’s research has been - Data: {Student_id, Name, Religion, school_grade,
succeeded to build a decision tree and if-then rule to solved ...}. It is used for store candidate student data.
problem in decision support system[2]. - Atribut_List: {Atribut_Name, Is_Result,
The C4.5 algorithm is Quinlan’s extension of his own ID3 Is_Active}. It is used to store list of atribut that use
algorithm for generating decision trees [3]. Just as with to make decision tree. Is_Result is told about the
CART, the C4.5 algorithm recursively visits each decision atribut is a result variable or not, while is_active is
node, selecting the optimal split, until no further splits are told about the atribut is used or not.
possible. However, there are interesting differences between - Data_Value: {Atribut_Name, Atribut_Value,
CART and C4.5[5]: Min_Value, Max_Value}. It is used to store value
- Unlike CART, the C4.5 algorithm is not restricted to definition of each attribute. For example, student
binary splits. Whereas CART always produces a school grade will classify into some value: A for

ISBN 978-979-16338-0-2 624


Proceedings of the International Conference on B-71
Electrical Engineering and Informatics
Institut Teknologi Bandung, Indonesia June 17-19, 2007

grade between 8 and 10, B for grade between 7 to 8


and C for grade under 7.
- Cases: {Case_Id, Atribut_Name[0],
Atribut_Name[1], …Atribut_Name[n]}. It is used to
store cases data. Cases are taken from student_data
appropriate with selected atribut_name in
Atribut_List table.

Besides tables that we have explained before, we


dynamically create 2 kinds tables. They are Work:
{Atribut_Name, Gain) and Detail_Work: {Atribut_Name,
Atribut_Value, Case_Count, Case_Count_Result[0],
Case_Count_Result[1], …, Case_Count_Result[n],
Entropy}. Table Work is used to store data attribute that will
be chosen as a selected node, whereas table Detail_work is
used to store atribut_values of each atribut_name so that we
can generate gain value that store in table work.
Table work and detail_work are created dynamically from
root until leaf of decision tree.
The general steps of using our application are shown if
fig.2 below:
Fig. 4. Decision Tree interface

Possibility decision of a candidate is going to withdraw


his/her application can be seen by matching the candidate
data with the decision tree route from root to leaf. The leaf
obtained describes the possibility of the candidate is going to
leave or stay.
The decision tree produced conforms to the case’s data
input. In this application, the user allowed to add, replace or
delete case. In addition, the variable used to build the decision
tree can also be modified or managed from the application.
From 1956 data records of new student applicants in
2006’s admission time of STMIK AMIKOM Yogyakarta, we
used 1500 records for training the application. The remained
data was left as the new data input and we used the result to
test the application’s accuracy.

5. Conclusion
The application we have built can produce decision tree
that conforms to variables and case’s data given by user.
Accuracy level of the prediction data of this application is
very depended to chosen variable that will be the basis to
make the decision tree.
For the next improvement research, we can explore for
Fig.2. General steps of application variable(s) that can produce highest data accuracy level.

References
(1) Armengol, E., Onta, S., dan Plaza, E., Explaining similarity in CBR Eva
Armengol, Artificial Intelligence Research Institute (IIIA-CSIC). Campus
UAB, 08193 Bellaterra, Catalonia
(2) Badriyah, T., Rahmawati, R., Alat Bantu Klasifikasi dengan Pohon
4. Result Keputusan untuk Sistem Pendukung Keputusan, Proceedings: Seminar
The interface of our application created can be shown in Nasional Aplikasi Teknologi Informasi 2006, Jurusan Teknik
fig.3. below: Informatika, Universitas Islam Indonesia Yogyakarta (2006)
(3) Berry, Michael J.A., Linoff, Gordon S., Data Mining Techniques For
Marketing, Sales, and Customer RelationshipManagementSecond
Edition, Wiley Publishing, Inc., Indianapolis, Indiana (2004)
(4) Craw, S., Case Based Reasoning : Lecture 3: CBR Case-Base Indexing,
www.comp.rgu.ac.uk/staff/smc/teaching/cm3016/Lecture-3-cbr-indexing
.ppt (---)

ISBN 978-979-16338-0-2 625


Proceedings of the International Conference on B-71
Electrical Engineering and Informatics
Institut Teknologi Bandung, Indonesia June 17-19, 2007

(5) Larose, Daniel T., Discovering Knowledge in Data: an Introduction to


Data Mining, John Wiley and Sons, USA (2005)
(6) Pall, Sankar K., Shiu, Simon C.K., Foundation of Soft Case Based
Reasoning, John Wiley and Sons, USA (2004)

ISBN 978-979-16338-0-2 626

You might also like