Enrollment Prediction - Project
Enrollment Prediction - Project
1 Introduction
Following World War II, a great need for higher education institutions arose
in the United States, and the higher education leaders built institutions on
“build it and they will come” basis. After the World War II, enrollment in the
public as well as the private institutions soared (Greenberg, 2004); however, this
changed by 1990s, due to a significant drop in enrollment, universities were in
a marketplace with “hypercompetition,” and institutions faced the unfamiliar
problem of receiving less applicants than they were used to receive (Klein, 2001).
Today higher education institutions are facing the problem of student re-
tention, which is related to graduation rates; colleges with higher freshmen
retention rate tend to have higher graduation rates within four years. The av-
erage national retention rate is close to 55% and in some colleges fewer than
20% of incoming student cohort graduate (Druzdzel and Glymour, 1994), and
approximately 50% of students entering in an engineering program leave before
graduation (Scalise et al., 2000).
Tinto (1982) reported national dropout rates and BA degree completions
rates for the past 100 years to be constant at 45 and 52 percent respectively
with the exception of the World War II period (see Figure 1 for the comple-
tion rates from 1880 to 1980). Tillman and Burns at Valdosta State University
(VSU) projected lost revenues per 10 students, who do not persist their first
semester, to be $326,811. Although gap between private institutions and public
institutions in terms of first-year students returning to second year is closing,
the retention rates have been constant for a long period for both types of insti-
tutions(ACT, 2007). National Center for Public Policy and Higher Education
(NCPPHE) reported the U.S. average retention rate for the year 2002 to be
73.6% (NCPPHE, 2007). This problem is not only limited to the U.S. institu-
tions, but also for the institutions in many countries such as U.K and Belgium.
The U.K. national average freshmen retention for the year 1996 was 75% (Lau,
2003), and Vandamme (2007) found that 60% of the first generation first-year
students in Belgium fail or dropout.
1
Figure 1: BA Degree Completion Rates for the period 1880 to 1980, where
Percent Completion is the Number of BAs Divided by the Number of First-time
Degree Enrollment Four Years Earlier (Tinto, 1982)
2
2 Research Objective
Research objectives of this project were:
• To build models to predict enrollment using the student admissions data
• To evaluate the models using cross-validation, win-loss tables and quartile
charts
3 Classifiers
3.1 Decision Trees
Decisions tree are a collection of nodes, branches, and leaves. Each node rep-
resents an attribute; this node is then split into branches and leaves. Decision
trees work on the “divide and conquer” approach; each node is divided, us-
ing purity information criteria, until the data are classified to meet a stopping
condition. Gini index and information gain ratio are two common purity mea-
surement criteria; Classification and Regression Tree (CART) algorithm uses
Gini index, and C4.5 algorithm uses the information gain ratio (Quinlan, 1986,
1996). The Gini index is given by Equation 1, and the information gain is given
by Equation 2.
m
X X
2
IG (i) = 1 − f (i, j) = f (i, j) f (i, k) (1)
j=1 j6=k
m
X
IE (i) = − f (i, j) log2 f (i, j) (2)
j=1
3
Figure 2: CRISP-DM Model Version 1.0
where, m is the number of values an attribute can take, and f (i, j) is the
proportion of class in i that belong to the j t h class.
Figure 3 is an example of construction decision tree using the Titanic data
and Clementine software. Based on the impurity, Clementine selected the at-
tribute sex (male and female) as the root node, then for attribute value sex =
male, Clementine created one more split on age (child and adult).
3.2 Rules
Construction of rules is quite similar to the construction of decision trees; how-
ever, rules first cover all the instances for each class, and exclude the instances,
which do not have class in it. Therefore, these algorithms are called as covering
algorithms, and pseudocode of such algorithm is given in Figure 4 reproduced
from Witten and Frank (2005). A fictitious example of a rule learner is given
in Figure 5.
4
Figure 3: Construction of Decision Tree by Clementine
5
IF FinancialAid = "Yes" AND HighSchoolGPA > 3.00
THEN Persistence="Yes"
4 Data
The data for this research were from WVU’s data warehouse. WVU used Sun-
Gard Banner as an Enterprise Resource Planning (ERP) system to run student
operations. This system stores the data in a relational database form, which
an unit in the Office of Information Technology (OIT) used to SQL queries to
obtain data in flat files. This flow of data is represented in Figure 6.
6
contained 248 attributes with demographical and academic information of the
applicants. We performed some preprocessing on these data to use them in the
modeling process:
• All the data tables were joined to create a single table.
• Applications which were not accepted were removed, and the total number
of instances reduced to 112,390.
• Domain knowledge and common sense was used to remove some attributes
—email address, phone numbers, etc
5 Experiment
5.1 Feature Subset Selection (FSS)
Feature subset selection is a method to select relevant attributes (or features)
from the full set of attributes as a measure of dimensionality reduction. Al-
though some of the data mining techniques, such as decision trees, select rel-
evant attributes, their performance can be improved, as the experiments have
shown(Witten and Frank, 2005, p. 288).
Two main approaches of feature or attribute selection are the filters and
the wrappers (Witten and Frank, 2005). A filter is an unsupervised attribute
selection method, which conducts an independent assessment on general char-
acteristics of the data. It is called as a filter because the attributes are filtered
before the learning procedure starts. A wrapper is a supervised attribute selec-
tion method, which uses data mining algorithms to evaluate the attributes. It is
7
Figure 7: Enrolled Indicator
8
Figure 9: Residency Indicator
called as a wrapper because the learning method is wrapped in the attribute se-
lection technique. In an attribute selection method, different search algorithms
are employed, such as, genetic algorithm, greedy step-wise, rank search, and
others.
For this research, we used Wrapper and InfoGain; Wrapper included J48
tree learner and Naive Bayes learner as part of the attribute selection process.
We used these FSS techniques to generate rankings of attributes in order of
importance. We then used these rankings for adding attributes in the dataset
to evaluate the changes in accuracy on three different learners: J48, Naive
Bayes, and RIDOR. To avoid any learning bias, we cross-validated each learning
procedure 10 times.
5.2 xval
We ran a script called xval, which performed following actions:
For this experiment, we set value of < repeat > to 10, and we used Nbins
and Fayyad-Irani’s discretizers. We used five learners: JRip, J48, Aode, Bayes,
and OneR.
9
6 Results
Using the rankings obtained from the FSS experiment, we added each attribute
sequentially in the dataset and ran the learning procedure to observe the changes
in the accuracy. As shown in Figure 10, accuracy was between 83%-84% for all
combinations after adding the variable: FinancialAid Indicator.
(a) Accuracy Using Wrapper and J48 (b) Accuracy Using Wrapper and Bayes
Figure 10: Results Comparison for FSS, Accuracy, and Number of Attributes
Dataset, Data WRP NB J48, was created with two attributes selected using
wrapper, and dataset, Data IG, was created with seven attributes selected using
InfoGain, because the tree size was small with seven attributes (see Figure
11). Figure 12 shows the results obtained by using different learners on the
datasets created using Wrapper and InfoGain. RiDor with nbins discretizer was
the best for Data WRP NB J48 dataset (highlighted in Figure 12a), and J48
with Fayyad-Irani’s discretizer was the best for Data IG dataset (highlighted in
Figure 12b).
There was no significant difference found between these two datasets by any
of the learners; however, statistically, by means of t-test with 95% confidence,
J48 with Fayyad-Irani was the best and OneR with nbins was the worst, as
10
Figure 11: Number of Attributes vs. Tree Size using J48
shown in the win-loss table (Figure 13a). The quartile charts show the margin
of win or a loss on another learners, as shown in Figure 13b.
7 Conclusions
Overall, financial aid was the most important factor that attracted students to
enroll. Student enrolled at this institution if they received some form of financial
aid, regardless of their high school GPA and ACT/SAT scores. Therefore,
financial aid can be used as a controlling factor for increasing the quality of
incoming student.
11
(a) Results for Wrapper Dataset
12
(a) Win-Loss Table (b) Quartile Chart
FinancialAidIndicator = N
| ApplicationStypCode = 0: N
| ApplicationStypCode = A: N
| ApplicationStypCode = B: Y
| ApplicationStypCode = C: N
| ApplicationStypCode = D: Y
| ApplicationStypCode = E: Y
FinancialAidIndicator = Y: Y
Number of Leaves : 7
Size of the tree : 9
Correctly Classified Instances 93448 83.1462%
Figure 14: J48 tree with two attributes and accuracy of 83.15%
13
EnrolledIndicator = Y
Except (FinancialAidIndicator = N) and (ApplicationStypCode = A) =>
EnrolledIndicator = N
Except (FinancialAidIndicator = N) and (ApplicationStypCode = C) =>
EnrolledIndicator = N
Total number of rules (incl. the default rule): 3
Correctly Classified Instances 93349 83.0581 %
Figure 15: Ridor rules with two attributes and accuracy of 83.05%
8 Future Work
Attributes, such as, distance from the campus and first method of contact,
should be created to see their effect. Although financial aid was the most sig-
nificant factor resulting in enrollment, amount of financial aid offered should
also be included in the data. So that “bins” can be created on the amount of
financial aid offered and then learners can be used for classification using those
bins.
Even though financial aid helps recruiting students, it does not necessarily
help in retaining the students. In order to find attributes affecting retention,
enrolled indicator and “persistence indicator” should be combined. Similar ex-
periments would be necessary to find a relationship between the student demo-
graphic, academic information, and retention
9 Acknowledgments
Authors sincerely thank Dr Tim Menzies, our faculty adviser, and Roberta
Dean, director of institutional research, both from West Virginia University, for
the efforts, the expertise, and the help offered to us for this project.
14
References
ACT. ACT National Collegiate Retention and Persistence to Degree Rates, 2007.
http://www.act.org/research/policymakers/reports/retain.html.
C.M. Antons and E.N. Maltz. Expanding the role of institutional research at small
private universities: A case study in enrollment management using data mining.
New Directions for Institutional Research, 2006(131):69, 2006.
B.L. Bailey. Let the data talk: Developing models to explain IPEDS graduation rates.
New Directions for Institutional Research, 2006(131):101–11515, 2006.
K. Barker, T. Trafalis, and T. R. Rhoads. Learning from student data. Systems and
Information Engineering Design Symposium, pages 79–86, 2004.
L. Chang. Applying data mining to predict college admissions yield: A case study.
New Directions for Institutional Research, 2006(131), 2006.
N. Delavari and M. R. Beikzadeh. A new analysis model for data mining processes in
higher educational systems, 2004.
P.W. Eykamp. Using data mining to explore which students use advanced placement
to reduce time to degree. New Directions for Institutional Research, 2006(131):83,
2006.
M. Greenberg. How the GI bill changed higher education, June 18, 2004 2004.
15
K. H. Im, T. H. Kim, S. Bae, and S. C. Park. Conceptual modeling with neural network
for giftedness identification and education. In Advances in Natural Computation,
volume 3611, page 530. Springer, 2005.
TA. Klein. A fresh look at market segments in higher education. Planning for Higher
Education, 30(1):5, 2001.
J. Luan and A. M. Serban. Data mining and its application in higher education. In
Knowledge Management: Building a Competitive Advantage in Higher Education:
New Directions for Institutional Research. Jossey-Bass, 2002.
Y. Ma, B. Liu, C. K. Wong, P. S. Yu, and S. M. Lee. Targeting the right students
using data mining. In Conference on Knowledge Discovery and Data mining, pages
457–464, Boston, Massachusetts, 2000. ACM Press New York, NY, USA.
S. Massa and P.P. Puliafito. An application of data mining to the problem of the
university students’ dropout using markov chains. In Principles of Data Mining and
Knowledge Discovery. Third European Conference, PKDD’99, pages 51–60, Prague,
Czech Republic, 1999.
NCPPHE. Retention rates - first-time college freshmen returning their second year
(ACT), 2007.
E.N. Ogor. Student academic performance monitoring and evaluation using data min-
ing techniques. Electronics, Robotics and Automotive Mechanics Conference, 2007.
CERMA 2007, pages 354–359, 2007.
Z.A. Pardos, N.T. Heffernan, B. Anderson, and C.L. Heffernan. Using fine grained
skill models to fit student performance with bayesian networks. In 8th Interna-
tional Conference on Intelligent Tutoring Systems (ITS 2006), pages 5–12, Jhongli,
Taiwan, 2006.
16
A. Salazar, J. Gosalbez, I. Bosch, R. Miralles, and L. Vergara. A case study of knowl-
edge discovery on academic achievement, student desertion and student retention.
Information Technology: Research and Education, 2004. ITRE 2004. 2nd Interna-
tional Conference on, pages 150–154, 2004.
V. Tinto. Limits of Theory and Practice in Student Attrition. The Journal of Higher
Education, 53(6):687–700, 1982.
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann Publishers, San Francisco, 2 edition, 2005.
Chong Ho Yu, Samuel DiGangi, Angel Jannasch-Pennell, Wenjuo Lo, and Charles
Kaprolet. A data-mining approach to differentiate predictors of retention between
online and traditional students, 2007.
17