CUST THESIS - Student Evaluations
CUST THESIS - Student Evaluations
Recommender System
By
Muthara-Tul-Ain
By
Muthara-Tul-Ain
i
Copyright 2017 by CUST Student
All rights reserved. Reproduction in whole or in part in any form requires the
prior written permission of Muthara-Tul-Ain (MS133033) or designated
representative
ii
CAPITAL UNIVERSITY OF SCIENCE & TECHNOLOGY
ISLAMABAD
Islamabad Expressway, Kahuta Road, Zone-V, Islamabad
Phone: +92 51 111 555 666, Fax: 92 51 4486705
Email: [email protected], Website: http”//www.cust.edu.pk
CERTIFICATE OF APPROVAL
Muthara-Tul-Ain
MS133033
________________________________
iii
DEDICATED
TO
MY
RESPECTED
&
FOR
THEIR CONSTANT
iv
ACKNOWLEDGMENT
All praise and exaltation is due to ALLAH (S.W.T) The creator and sustainer
of all seen and unseen worlds. First and foremost I would like to express my
gratitude and thanks giving to Him for providing me the boundaries and
blessings to complete this work. Secondly, I would like to express my sincerest
appreciation to my supervisor Dr Nayyar Masood for his directions,
assistance, and guidance. I sincerely thanked for his support, encouragement
and technical advice in the research area. I am heartily thankful to him from the
final level, as he enabled me to develop an understanding of the subject. He has
taught me, both consciously and unconsciously, how good experimental work
is carried out. Sir you will always be remembered in my prayers.
I pray to ALLAH (S.W.T) that may He bestow me with true success in all
fields in both worlds and shower His blessed knowledge upon me for the
betterment of all Muslims and whole Mankind.
AAMEEN
Muthara-Tul-Ain
v
DECLARATION
It is declared that this is an original piece of my own work, except where
otherwise acknowledged in text and references. This work has not been
submitted in any form for another degree or diploma at any university or other
institution for tertiary education and shall not be submitted by me in future for
obtaining any degree from this or any other University or Institution.
Muthara-Tul-Ain
November, 2017
vi
ABSTRACT
Mining data and extracting information from huge databases has become an
interesting research area for the researchers. The idea to extract information
with the help of data mining techniques came into being since a couple of
decades ago. Initially, researchers were supposed to apply classification and
clustering techniques to partite the dataset and analyze the intrinsic features. On
the basis of such features, they make reasonable predictions. These predictions
have taken place in the field of educational data mining for many purposes such
as; predict the performance of students on the basis of factors associated with
them, to enable them suitable courses and appropriate teachers. These purposes
have been derived from the area of student retention and attrition. Our research
aims to achieve these purposes under the roof of student attrition and retention.
Moreover, we have identified such exiting factors which are beneficial for
predicting the performance to students, recommend them best suitable teachers
and help them to select the courses. We have applied classification algorithms
with respect to the nature of data which have been collected from the Capital
University of Science and Technology (CUST), ISB. In this study, the GPA of
first semester on the basis of Midterm and previous academic grades have been
tried to predict. For the second semester, we have predicted the CGPA of same
students by using their complete proceeding academic record with the help of
hybrid approach. The hybrid approach consists of combination of factors that
have been evaluated against our research questions. Moreover, we tried to
improve their performance by recommending those suitable courses and
teachers whose performance is better amongst others comparatively. The
reason of collecting the data from CUST is to validate the exiting factors in
local context (Pakistan). On the basis of classification algorithms such as;
Naïve Bayes and J48, we have become able to build the recommender system.
Then the factors which contributed well to validate the exiting factors in local
context have been measured. In the last, appropriate teacher allocation has been
measured by two ways; Statistical and Prediction. In statistical
experimentations, average performance of teachers, Z- test and ANOVA test
has been applied. In prediction experimentations, one subject teacher with other
vii
subject teacher’s name attribute, and one subject teacher without other subject
teacher’s name attribute and overall performance of teachers have been
computed with respect to each subject independently. With the help of our
research we might have become able to provide a way to educational
institutions to reduce the attrition and increase the retention rate.
viii
Table of Contents
Abstract .........................................................................................................................vii
Introduction ..................................................................................................................... 1
1.1 Background Of Research ......................................................................................... 2
1.2 Problem Statement ..................................................................................................... 6
1.3 Research Questions.................................................................................................... 6
1.4 Scope ......................................................................................................................... 6
1.5 Application Of Proposed Approach............................................................................ 6
1.6 Significance Of Solution ............................................................................................ 7
1.7 Organization Of Thesis .............................................................................................. 8
1.8 Definitions And Frequently Used Terms .................................................................... 8
Literature Review ............................................................................................................ 9
2.1 Educational Data Mining ......................................................................................... 10
2.2 Student Performance Prediction ............................................................................... 14
2.3 Commonly Used Approaches In Edm ...................................................................... 16
2.5 Commonly Used Attributes For Student Performance Prediction. ............................ 19
2.5 Literature Review Summary .................................................................................... 21
Research Methodology .................................................................................................. 22
3.1 Data Collection And Pre-Processing ........................................................................ 24
3.2 Classification ........................................................................................................... 27
3.3 Algorithms And Techniques .................................................................................... 29
3.4 Evaluation Of Research Questions ........................................................................... 30
Results & Evaluations.................................................................................................... 35
4.2 Final Attribute Selection .......................................................................................... 38
4.3 Result Of Evaluation Of Research Questions ........................................................... 38
Conclusion And Future Work ........................................................................................ 67
5.1 Future Work ............................................................................................................ 70
References ..................................................................................................................... 71
ix
List of Tables
Table 2.1: Literature Review Summary ........................................................................................... 21
Table 3.2: Summary Of Session-Wise Attrition............................................................................. 24
Table 3.3: Summary Of Session-Wise Attrition............................................................................. 25
Table 3.4: Selection Of Attributes ..................................................................................................... 28
Table 3.5: Attributes Selection To Dataset ...................................................................................... 31
Table 3.6: Pattern Of Combination Of Attributes .......................................................................... 32
Table 4.1: Occurrence Of Missing Values ....................................................................................... 36
Table 4.2: Handling Of Missing Values ........................................................................................... 37
Table 4.3: Attributes Table ................................................................................................................... 39
Table 4.4: Classification Of Attributes And Algorithms ............................................................. 40
Table 4.5 Ranked Attributes ................................................................................................................. 42
Table 4.6 GPA PREDICTION FOR THE FIRST SEMESTER .............................................................................. 42
Table 4.7: Cgpa Prediction For The Second Semester ................................................................. 44
Table 4.8: Pre-Qualification Vs Maths Courses ............................................................................. 47
Table 4.9: Average Of Cal-Ll Teachers ............................................................................................ 48
Table 4.10: Average Of Cal-I Teachers ............................................................................................ 49
Table 4.11: Overall Average Of Maths Teachers ......................................................................... 50
Table 4.12: Anova Analysis On Cal-1 .............................................................................................. 52
Table 4.13: Two Samples For Mean With Respect To Teacher “E”........................................ 53
Table 4.14: Two Samples For Means With Respect To Teacher “B” ..................................... 54
Table 4.15: Teacher B With Same Number Of Students ............................................................. 55
Table 4.16: Teacher B With Different Number Of Students ...................................................... 55
Table 4.17: Comparison Of All Programming Courses ............................................................... 56
Table 4.18: Comparison Of Actual And Predicted Grades Of Itc/Itp ...................................... 57
Table 4.19: Comparison Of Actual And Predicted Performance Of Oop/Cp
Teachers ............................................................................................................................................ 58
Table 4.20: Overall Performance Of Oop Teachers ...................................................................... 59
x
List of Figures
Figure 3.1: DATA FLOW DIAGRAM OF PROPOSED METHODOLOGY ......................................................... 23
Figure 4.1: Formats Of Data................................................................................................................. 36
Figure 4.2comparison Of Local And Existing Attributes ............................................................ 41
Figure 4.3: Gpa Prediction For The 1st Semester .......................................................................... 44
Figure 4.4: Cgpa Prediction For The Second Semester................................................................ 45
Figure 4.5: Pre-Qualification Vs Maths Courses ........................................................................... 48
Figure 4.6: Average Of Cal-Ll Teachers .......................................................................................... 49
Figure 4.7: Overall Average Of Maths Teachers ........................................................................... 50
Figure 4.8: Comparison Of Itc/Itp Teacher ...................................................................................... 51
Figure 4.9: Comparison Of Actual And Predicted Grades Of Itc/Itp ....................................... 56
Figure 4.10: Comparison Of Actual And Predicted Performance Of Oop/Cp
Teachers ............................................................................................................................................ 58
Figure 4.11: Overall Performance Of Oop Teachers .................................................................... 59
Figure 4.12: Overall Performance Of Oop Teachers .................................................................... 60
xi
List of Abbreviations
CUST: Capital University of Science and Technology
Cal-I: Calculus-I
Cal-II: Calculus-II
SEM: Semester
xii
Chapter 1
INTRODUCTION
Researchers have been working in the area of educational data mining for a
decade. Educational Data Mining (EDM) has become a broadened field in
which researchers are conducting their experiments to extract useful
information from the data belonging to the educational sectors for many
purposes. The purposes include identifying student attrition and retention rate,
students’ performance prediction, building a course recommender system, and
teacher recommender system, etc. In the field of Educational Data Mining
(EDM), researchers are busy in exploring effectiveness or role of different
types of variables to measure and predict the performance of students. Among
those variables, student’s age, academic record and biography of students are
involved. EDM is a growing research field, which carries data mining
techniques in educational system (Romero, C., & Ventura, S. 2007). In
previous study, (Romero, C., & Ventura, S. 2013) has explored the
phenomenon of making the student’s outcome better with the help of data
mining approaches. According to them, huge data from the institutions have
problems associated with it. Therefore, data mining approaches are not
supposed to be applied directly on this data. Therefore, knowledge discovery
process has been implemented. In educational institutions, data mining is
playing a pivotal role on datasets after preprocessing. Most widely used data
mining approaches involved classification and clustering, outlier detection,
association rule mining, pattern mining and text mining. Researchers are trying
to extract useful, novel and interesting information with the help of data mining
approaches (Romero, C., & Ventura, S. 2010). EDM process is used to convert
raw educational data into useful information.
In the field of EDM, researchers are active in three broad categories; Personal
recommender system and learning environment, course management system,
and student attrition and retention. Our research will cover all these mentioned
areas in different contexts such as identification of students with low academic
performance, building prediction model to predict the student’s performance by
1
using their historical data, improvement in raising the confidence level among
them, assist them to choose their courses, building a model that will assign
appropriate teachers to the students.
2
c) Student attrition and retention
Student attrition means the number of students who are leave their courses
without completing because of certain reasons. Those reasons may include poor
choices of courses, consistent undesirable result, poor performance, financially
instability and immature selection of courses etc whereas student retention
means the number of students who complete their course and acquire degree
despite of any kind of circumstance and gain good grades in the transcript. In
educational institutions, rate of student attrition and retention has reached up to
considerable unit (Stallone, M. N. 2011).
Student attrition and retention not only affect the performance of education
institution or department but also affect the faculty positions and raise the
financial problems for the parents eventually. To analyze the pattern of student
retention and attrition, one has to find out the reasons of this cause then take
steps to resolve this problem. Our research topic falls into the type student
attrition and retention which is further explained in the following subsection in
detail. For this research collected dataset from Capital University of science
and Technology, Department of Computer Science. Let’s consider first
semester and second semester courses because student attrition rate is high in
first and second semester. These semesters are helpful attrition of students and
predicted the student performance.
3
2016). EDM is a broad research field in which researchers are exploring a lot of
problems including student’s performance, attrition and retention, course
selection, and particular teacher’s selection. The widely covered research
problems in the domain of educational data mining are known as student’s
future grade prediction, performance prediction and course enrollment
recommender system.
This research is based on different aspects. One aspect is to identify the number
of those students who had to leave the institution because of their poor
performance in studies or financial problem. Then the number of those students
have been found who completed their degree belong to the spring semesters of
three years. Expectedly this research will help the students to improve their
results in the courses which have been taken for the experiments by
recommending them appropriate teachers and courses based on their associated
factors.
This is helpful not only for students to raise their grades but also for parents
not to suffer from financial problems. Another aspect which is being
considered is to retain maximum percentage of students in the university and
fulfill their degree requirements till its completion. Ultimately, if the grades of
students would be good their retention will become stronger in the institution.
This research might reduce the attrition rate because of the provision of teacher
recommender system, course recommender system and student performance
prediction. The outcome of our research has been discussed in detail as follows.
4
are removed and then classification and clustering are applied. By using such
techniques, data can be further divided into clusters and classes to analyze
useful and hidden patterns from it (Kabakchieva, D, 2013).
With respect to performance prediction of students, there are many cases such
as student’s next term grade prediction, presence performance based on
assessment in the current courses. To evaluate this problem, researchers have
used some of the characteristics of students such as admission records, High
school scores, SAT/ACT scores and grades of previously completed courses
(Elbadrawy, et al, 2016). More characteristics such as class test, seminar
attendance and marks, assignment marks (Baradwaj, B. K., & Pal, S. 2012).
5
The factors used in the research have already been applied in countries other
than Pakistan. Those countries are Canada, USA, England, and Nigeria. All
these factors have been applied in the colleges and universities of narrated
countries and now will be validated in the context of Pakistan.
RQ1. Whether exiting identified factors for grade prediction are valid in our
local context (Pakistan)?
RQ2. Which factors help in accurate prediction of students’ GPA of first
semester?
RQ3. Is it possible to improve the performance of a student by allocating of
appropriate teacher for a subject?
1.4 Scope
The application of this research is versatile. The experiments of this research
have been conducted on the dataset that has been collected from Capital
University of Science and Technology, Islamabad (Pakistan). Such factors can
be applied on the data of Government institutions as well as schools and
colleges.
6
To make the university management system efficient
This research might be highly beneficial for the university management system
to keep check on the performance of student’s belong to different departments.
This will not only help to predict and improve student’s performance but also
supports the parents economically.
Moreover, with the help of the contribution of this research there are chances to
propose teacher recommender system that will help to recommend suitable
teachers to the students by considering their current performance in the
semester and their relative interest in the courses that might improve their
grades.
7
1.7 Organization of thesis
This thesis comprises of five chapters. The first chapter states the introduction
of the proposed research. Second chapter is the literature review in which the
work and research contribution in chosen area by former researchers has been
discussed. Moreover, overall literature summary has been presented in
summarized way. Third chapter contains the methodology diagram that has
been adopted to perform the experiments and how experiments will be
performed. The results of experiments by using selected tools have been
presented in the chapter four. Last chapter contains the conclusion and future
work in which summary of whole thesis has been presented.
Performance Prediction
A student’s grade is used to acknowledge his/her related performance in
academia. It is now possible to predict the student’s performance with scientific
methods in which quantitative approaches have been used by the researchers
(Bhardwaj, B. K, 2012) (Pal, S. 2012).
Recommender System
Usually recommender system is built to assign a particular entity to required
entity. In the field of data mining, researchers have proposed some
recommender systems in which teacher recommender system and course
recommender systems are involved (Bozo, J., 2010) (Alarcón, R.,2010)(
Iribarra, S., 2010).
8
Chapter 2
LITERATURE REVIEW
The initiative of online course learning system is based on e-commerce
venture. With the growth of this venture, data from web resources started
collecting and storing in the excel that contains customer and product
information and order information (Kokina, J, 2017). E-commerce is a term
that refers to use as online business through internet. There are various websites
that are working for this purpose such as e-bay, Alibaba, and Amazon etc. It
can be said that data storage in excel from the resources has been derived from
e-commerce. Predicting student’s performance by using data mining techniques
to extract information from the academic dataset of universities has become
state of the art research in the scientific society. Universities are confronting
with some challenges now a day to analyze the performance of their students.
That’s why researchers are focusing on student’s profiles and characteristics to
make the university management aware of student’s performance and overall
academic result (Kabakchieva, D, 2013). There is another dimension of
student’s performance that is the dependence of student retention upon student
student’s performance. To minimize the problem of student retention cases in
the universities, different researchers have proposed different methods to
predict the performance of students in their future semester based on the
performance of previous one.
To predict the courses of next term grades, four parameters have been
considered in this study such as; admission records, High school scores,
SAT/ACT scores and grades of previously completed courses. Based upon
these parameters, recommender system can be trained to predict the grades of
students accurately in any of the educational institution. Historical information
about the course has also been considered in this study such as which course is
taught by which teacher and information about contents of the course. Many
researchers have used LMS and Moore to predict the successive chances of
success and failure of students. In this research, regression based methods such
as course specific regression (CSPR) and personalized linear multi regression
9
(PLMR) has been used. Another method known as matrix factorization based
methods in which standard matrix factorization (MF) has been used for the
grade prediction of students (Elbadrawy, A, et al, 2015).
In this chapter, the background of educational data mining and its branches will
be discussed in detail. Also the student’s performance prediction and data
mining approaches that are commonly used by researchers in the literature are
being discussed.
1
http://searchsqlserver.techtarget.com/definition/data-mining
2
http://www.educationaldatamining.org/
10
educational institutions stronger. Another purpose is to build the student career
by covering its each and every aspect such as improvement of their grades if
lacking, overall performance booster, support them financially, make them
enable to select the courses suggested by course recommender system, assign
them appropriate teachers based on their inclination of interest in course
selection and many more.
However, these are some common areas in which researchers are producing
their research by using different data mining techniques. In our research, same
areas are being covered under the dataset of Pakistani students. Three factors
are being applied on students those factors have been discussed briefly in
section 2.4. In current section, our focus will remain on key areas of EDM, in
which researchers are engaged.
11
educators to make their courses available on the internet. All the data on the
Moodle is managed by Moodle team. Moodle is now facilitating educational
institutions, community colleges, and schools to create online teaching system
(Dougiamas, M., 2003). It delivers courses online, unlike traditional
classrooms. Data of courses and students is stored in its specified database
which is further used by researchers that carries out for mining purposes to
extract useful information from it. Performance of Moodle has raise higher in
virtual environment. Other online course management system such as web
portals and learning management systems are serving for the same purpose now
days.
With the use of internet in classrooms and institutions, those institutions have
launched their proper websites or web pages, through which students can
register themselves into particular courses (Kaminski, J. 2005). In the area of
online course learning, researchers have presented many research articles. In
this domain LMS and Moore have contributed a lot. They have been offering
different courses on their websites and thousands of students from all over the
world registered themselves. This is how, huge amount of dataset is collected
from LMS and Moore and researchers have performed analysis on these dataset
to evaluate the performance of students who enrolled (Kizilcec, R. F, 2017)
(Pérez-Sanagustín, M., 2017) (Maldonado, J. J. 2017). Self regulated learning
(SRL) is a term that has been stated by narrated researchers. According to
them, students having strong SRL are good enough in planning, managing and
controlling as compared to the students having weak SRL. In this regard,
MOOCs have been providing support for the learners with the help of different
levels of SRL.
12
institutions are finding that factor that ultimately causes student attrition
(Azarcon Jr et al, 2014). After analyzing those factors, it is important for
educational institutions to make strategic adjustments accordingly to improve
student retention in institutions. In Gaviria, Colombia, people are closely
belongs to social mobility that is a causing the academic performance of the
students and student attrition rate may increase in educational institutions.
Researchers have used three drivers the student performance analytics. The first
driver is the volume of data that is being collected from learning management
system and student information system, second one is the e-learning and third
one is political concerns (Guarín, C. E. L., 2015)( Guzmán, E. L.,2015)(
González, F. A. 2015). The application of data mining in the field of
educational data mining has been emerged in different areas and researchers are
exploiting these areas with various dimensions.
There is another supplement for student learning through web that is known as
MOOCs. In online course learning and management system, high rate of
dropout of students have been identified by researchers. This problem has been
enlightened by the YANG in his research that is students are dropping at
considerable level in Massive Open Online Courses (MOOCs). To control the
student attrition problem from the coarser classes, researchers have proposed a
model in their research. This model is a helping hand to determine such
influential factors that are causing student drop out. So the predictors have been
tried to propose in this research that will determine the factors related to
student’s behavior and social position in the discussion in which they
participate in the forums (Yang, D.et al, 2013).
The problem of student attrition and retention is not new for the educational
institutions. It has been enlightened by the researchers from the fields of data
mining and information visualization. Now it has become very common
research problem for the researchers. Student attrition and retention problem
has been observed by the researchers when this problem was raised up to the
ratio of 50% on the colleges of Ontario (Drea, C. 2004). To reduce attrition
rates, institutions should focus on student retention. Researchers have analyzed
the factors that causes student attrition and in the research, Drea has addressed
13
both elements; student attrition and retention. To retain the persistence of
institutions, two theories have been formulated in this context. The first one is
student integration model by Tinto and student attrition model by Bean. Both
models work almost similar (Cabrera, A. F. 1993).
There are numerous reasons for what student attrition has become research
problem for the researchers. In those reasons, personal disappointments,
financial setbacks, and lowering of career and life goals are considerable.
Therefore scientists have carried out the retention and persistent in their
research to resolve this societal problem (Ramist, L.1981). This research focus
on student performance and also find a teacher methodology has positive
impact on student grade prediction and has reduced student retention rate.
14
students degrade up to the mark able level, it should be considered by the
department management and find out the factors that affects the performance of
students. Under consideration of this problem, researchers have evaluated their
studies against the parameters that affects the overall performance of students,
data mining techniques and data mining tools (Kaur, G, 2016)( Singh, W,
2016). In such parameters, psychological, personal and environmental factors
are involved. For the experiments, they have used Naïve Based and J48
techniques through WEKA.
15
(Alarcón, R., 2010) (Iribarra, S., 2010). As online course enrollment system has
taken the place globally because of internet. In this variation in technology,
there are several web pages and websites that are busy in providing best at their
own but still contain some missing elements in their services. Therefore,
researchers have worked upon this problem and proposed solution regardingly.
To make every course learnable creatively, researchers has focused on teacher
recommendation according to the course respectively. The new recommender
system is known as A3. With the help of this recommender system, not only
accurate teacher will be associated to the relative course contents but also best
contents of the course will be available on that site after some time (Tewari, A.
S, 2015).
To improve the performance of weak students in the class, with the help of data
mining techniques, collected data of 1100 students have been transformed in
Weka. Researchers have used freeware software such as Weka, Clementine and
Rapid-Miner in this work. Different classifiers such as Naïve Based, C4.5,
Neural networks and random forest have been used. The classifiers which have
been used are Adaboost, Bagging and boosting. It is found that some factors
have same effects on both countries and some have different. Moreover, it has
16
been found that male students suffer with stress more as compared to female.
The performance of male students in Mathematics and formal SCIENCE is
better whereas performance of female students aroused better in Literature and
mnemonic SCIENCE (Oskouei, R. J.,2014)( Askari, M. 2014).In this field of
research, there exist wide varieties of benchmark which are used to evaluate the
performance and accuracy of experiments conducted by using machine learning
approaches. Different researchers have used various types of educational
datasets and each dataset is unique in its attributes (Garcia-Saiz, D., 2011)
(Zorrilla, M. E, 2011). Therefore, in the study researchers have proposed meta
algorithm to preprocess dataset. Various data mining models have been studied
in this research to find the most accurate one with the help of Meta algorithm.
In the study, some factors which affects the grades of students of Iran and India
has been observed. In such factors, their respective gender, family background,
education level of their parents and their lifestyle has been encountered by
17
asking questions from them (Oskouei, R. J., & Askari, M, 2014). To evaluate
the performance of students and improve the management of educational
institutions, researchers have introduced pre university characteristics of
students which are known as student’s profile and place of secondary school,
final secondary education score, total admission score, and score achieved that
exams. In this research data of University of National and World Economy
(Bulgaria) has been collected and data mining techniques has been applied. In
such data mining techniques naïve bayes and bayes net, nearest neighbor
algorithm, and two rule learners One R and JRIP has been applied on the
dataset (Kabakchieva, D. 2013).
c) Pattern mining
As discussed earlier, it is now clear knowledge is gathered about online
teaching and learning system. Universities through internet are now big source
of student-teacher interaction and source of learning and training. This
technology has made place in the field of educational data mining in which
knowledge diffusion has become research icon for the researchers who belong
to field of data mining, web mining and graph algorithms. The data of online
teaching system is serving best to the researchers who keen to extract
knowledge from big datasets. Researchers are analyzing patterns of online
learning behavior of students and to draw outcomes from these sets of data,
they have been using machine learning algorithms and various data mining
approaches. In a study, researchers used 19,934 servers to identify the behavior
of students in Taiwan who had registered online courses and with the help of
this data, they drawn conclusion after predicting their performance too (Hung,
J. L., & Zhang, K. 2008). In addition, data mining techniques were proven to be
helpful for the course developers, online trainers, and instructional designers. In
their study, they used WEKA and KNIME tools to perform analysis of
descriptive and artificial intelligence. Moreover, for data visualization and
statistical analysis, they used
18
of the mentioned approaches are ultimately works to represent the results after
applying over the data. The concept of data visualization has been derived from
visual reasoning (Inoue, S et al, 2017).
d) Text mining
In the past decade, there are number of tools such as WEKA, RapidMiner, R,
KEEL, and SNAPP that have been used to extract text (Useful information)
from the datasets in the field of educational data mining (Baker, R. S., &
Inventado, P. S. 2014). In other research, data mining is also known as
knowledge discovery from databases (Anand, S. S. et al, 1996). In this process,
database techniques are bind with mathematical and artificial intelligence
techniques.
While review of papers from the literature, many data mining approaches have
been found that are applied in academic datasets of different educational
institutions for various purposes. Such approaches have been applied on the
attributes that are considered after analyzing the factors. Those factors are
briefly discussed in following passage.
I. Demographic attributes
19
general weighted average, School’s Radial Distance, and school ownership are
taken (Abaya, S. A., 2013) (Gerardo, B. D., 2013)
Another factor which has been found during literature survey is pre-university
attributes. These attributes have quite great impact on student’s performance in
any educational institution. Pre-university attributes will be fruitful with respect
to accommodate the student’s interest to map in course recommendation
system. Pre-university attributes include Secondary School Grade, Higher
Secondary Grade, SAT Score, Pre-college, Pre-board, pre-program.
Researchers have used the historical data of students as well to find the attrition
rate of students from academia through cross validation process under the
classification and naïve based methods Guarín, C. E. L., 2015)( Guzmán, E.
L.,2015)( González, F. A. 2015).
20
repeat. Moreover, student’s academic record can be improved and efficiency of
department management can be raised.
Used
Attributes
Pre-university
Demographic
Institutional
attributes
attributes
10 fold cross
(Guarín, C. E.
Decision Trees, validation
L., Guzmán, E.
Bayesian model, Cost √
L.,González, F.
Classification sensitive
A. 2015)
model
(Abaya, S. Irecruit
Classification through
A.& Gerardo, Application √
C4.5
B. D., 2013) in UNIX
Clustering, Feature
CHAID
(Kovacic, Z. selection thorough
Model, Gian √
2010). cross validation, and
chart
CART classification
CRISP-DM, Neural
networking, Nearest
Kabakchieva,
neighbor classifier, WEKA √
D. (2013).
Rule learner, Decision
tree classifier
(Al-Barrak, M.
E-learning
A., & Al- Data Visualization,
Web Miner, √
Razgan, M. Classification
WEKA
2016).
(Baradwaj, B.
Classification and
K., & Pal, S. Manual √
Decision trees
2012)
Recommend
Matrix factorization
er system
(Elbadrawy et based methods,
based √
al, 2016) regression based
personal
methods
analytics
(Bydžovská, EDM, SNA and WEKA and
√
H. 2013). Collaborative filtering R
(Kaur, G., &
Naïve based and J48
Singh, W. WEKA √
Decision trees
2016)
(Kokina, J.,
Pachamanova, Predictive Modeling, Excel and
√
D., & Corbett, Data Visualization Tableau
A., 2017).
21
Chapter 3
RESEARCH METHODOLOGY
This chapter presents the methodology adopted to address the research
questions presented in chapter 1. The focus of the research is accurate grade
prediction of critical courses of first semester of BS CS students of CUST.
Prediction will help to take appropriate measures to control the attrition rate
and hence will be beneficial for students, parents and university. Targeting the
same objective, appropriate faculty members are also being recommended
based on the results of previous semester. The factors under consideration for
the proposed research are classified into three categories; Demographic, Pre-
university and Institutional. Such researches in the field of educational data
mining have been conducted in foreign earlier. Experiments have been
conducted in local context, Pakistan. For this purpose, all the data have been
collected from Capital University of Science and Technology, Islamabad
Pakistan. After data collection and its pre-processing, some data mining
techniques are being selected like SVM, Linear regression and Non-linear
regression in WEKA.
This study will not only be useful in improving the overall performance of
students but also reduces the attrition rate. Now it is easy to do such things on
the basis of early prediction. Field of data mining has been emerged with the
linkage of natural language processing, artificial intelligence, visual data
analytics, and social data analysis etc (Romero, C., & Ventura, S. 2013)(Zhang,
Y.et al, 2010). This early prediction is computed by using term marks and mid-
term marks of the students along with their pre-university; Intermediate/O-
levels marks and Matriculation marks and demographic factors; gender and
city. This methodology will be discussed briefly in the current chapter. The
related work of this research task has been discussed in detail in chapter 2.
Chapter 3 contains detailed discussion of proposed methodology along with the
data flow diagram that has been given as 3.1
22
Figure 3.1: Data Flow Diagram of Proposed Methodology
23
To conduct the experiments, first of all the data is collected from the CUST
(Capital University of Science and Technology), Islamabad Pakistan. The data
set contains the data of six semesters that belongs to the BSCS program. On the
basis adopted factors, the GPA of 1st semester as well as CGPA of 2nd semester
has been predicted. In addition, performance of teachers against their respective
courses has been measured through ANOVA analysis. The experiments have
been conducted through WEKA. Each research question has been answered
with the help of proposed methodology and experiments.
Spring 2014 74
Spring 2016 62
Spring 2017 58
Total 667
24
However, the information has been extracted from a research (Junaid. 2017),
that most crucial courses with respect to attrition or students’ performance are
ITP/CP and Cal-I. So results of only these two semesters are considered and
excluded the rest three.
Collected data is related to the students of first semesters and seven semesters
starting from Spring 2014 (Term 141) till Spring 2017 (Term 171), and also of
second semester students of six semesters starting from Fall 2014 (Term 143)
to Spring 2017 (Term 173). The courses of first semester, as mentioned earlier,
are Cal-I and ITC/CP, and the second semester courses are Cal-II and CP/OOP
from each spring new intakes that belongs to the semester no 1. The initiative
of this research is to identify the factors which contribute to keep the students
retained till the end of the semesters as well those factors that affect the
performance of the students. Following table 3.2 shows the information of the
students comprehensively.
3rd Sem
4th Sem
6th Sem
7th Sem
8th Sem
1st Sem
5h Sem
Term
161 94% 6% 0% 0% 0% 0% 0% 0%
163 100% 0% 0% 0% 0% 0% 0% 0%
25
The dataset contains the institutional factors; Registration Number, Name,
GPA, Internal marks like midterm, and term works marks etc and overall
information of the students such as Student’s registration number, GPA,
Registered courses, Matriculation and Intermediate results. To conduct the
experiments, we have used demographic and pre-university factors of the
students apart from institutional factors. In the data set, the data of term 141,
143, 151, 153, 161, 163, and term 171 are included. The terms of the spring
semesters are associated with the registration numbers of the students such as
the term 141 means spring 2014. Similarly, the data of spring 2015 and spring
2016 are collected and on the basis of three years record of the students. The
file of dataset is initially imported into WEKA after converted into CSV and
ARFF format for the pre-processing and further experimentations have been
conducted respectively.
3
https://measuringu.com/handle-missing-data/
26
Noisy Data Handling
While data collection, irrelevant data is called as noise. As the data of BSCS
students of spring 2014 from spring 2016 is collected, therefore during the
collection, data belongs to Software engineering and bioinformatics was
present and then eliminated through filters in WEKA. Noisy data occurred in
the form of errors such as; GPA = “−0.4” and Intermediate or Matriculation
marks = “-455, 544.8” which were handled in Pre-processing of data.
Attribute Selection
After missing data handling and outlier detection phase, final attributes have
been selected on which overall experiments and results depend. Such attributes
have been narrated in table 3.1 that belongs to all three types of factors;
demographic, pre-university, and institutional. One more factor has been
considered to answer our research question that is teacher performance. After
completion of data pre-processing, classification algorithms have been applied.
Attributes are then finalized for the experiments that have been evaluated to
answer the research questions. In the current section, we have discussed
adopted classifiers in detail.
3.2 Classification
Before applying algorithms, factors have been grouped that are earlier used by
foreign researchers. These factors have been compared against the factors used
by the researchers in local context (Pakistan). Moreover, we have expanded our
research work by making the variations of different combinations attributes and
then compared. The comprehensive detail about these factors has been
mentioned in table 3.4.
27
Table 3.3: Selection of Attributes
Sr.No. Factors Attribute Name
1. Gender
2. Age
3. Residence
4. DEMOGRAPHIC Location
5. Race
6. Father’s qualification
7. Father’s occupation
8 Secondary school grade
9 Higher secondary grade
10 PRE-COLLEGE Pre-college
11 Pre-program
12 SAT score
13 GPA (Term1,
Term2,Term3, Term4)
14 CGPA
15 Financial status
16 Total credit hours taken
INSTITUTIONAL
17 Total courses taken
18 Initial Major
19 Current Major
20 Current enrollment status
21 Teacher methodology
22 Instructor grading
23 Instructor feedback
24 TEAHCER Instructor teaching
methodology
25 Teacher name
28
3.3 Algorithms and Techniques
To evaluate our research by using different classification algorithms; Naïve
Based and J48 has been compared. These classifiers are chosen on the basis of
different reasons i-e; these classifiers supports Categorical and Nominal. For
this purpose WEKA 3.8 tool has been used. The tool is open source available
on the web and is generally used for machine learning algorithms,
classification, training and testing of data. The data is then can be visualize in
the form of graphs as well. In present research, the formed combinations of
attributes have been evaluated to answer our research questions. For this
purpose the data has been converted into csv format to import in the WEKA
and the results from various filters and classifiers have been accumulated.
a) Naïve Based
Naïve based algorithm is comparatively fast algorithm in terms of
classification. It works faster on huge datasets by using Bayes algorithm of
probability. Bayes algorithm generally used to predict the class of unknown
dataset4. Naïve based algorithm works on assumptions to label an item whose
features are known but name is unknown. For example; a fruit is labeled as an
apple if it is round and red in color and its size is 3 inches in diameter. These
features of apple will raise the probability of this fruit that it is an apple.
b) J48
J48 decision tree is used to predict the target variable of new dataset. If dataset
contains predictors or independent variables and set of target or dependent
variables, then this algorithm is applied to extract the target variable of new
dataset5.
c) Linear Regression
4
https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
5
http://data-mining.business-intelligence.uoc.edu/home/j48-decision-tree
29
Second main thing that linear regression does is the identification of variable
that are significant predictors of dependent variables. At the end, the regression
equation is used which helps to determine the set of predictor which are used to
predict the outcome. In this research, algorithms are being used to compare the
trend and pattern of the factors with other approaches like non-linear regression
and SMO.
RQ1: Whether exiting identified factors for grade prediction are valid in
our local context (Pakistan)?
Our research question number 1 determines the validity of factors in local
context which have been carried out in foreign context by the researchers. The
main reason that distinguishes this research in local context from the foreign
context is that the norms, traditions and culture of every country that varies.
Therefore it is needed to be evaluated that whether the existing factors that
work in the colleges and universities of Canada, USA, England, and Nigeria
etc, are applicable in Pakistan or not. After pre-processing of whole data,
classification algorithms have been applied to analyze the behavior of
demographic, pre-university and institutional factors. According to the results,
it has come to known that these factors are also valid in Pakistan because of
nearest difference in accuracies. Detailed results have been discussed in chapter
no 4.
For our research, data being collected from term 141 to term 171 which consist
of 667 students. Then three types of exiting attributes to predict the
performance of students have been taken. First type of attribute is
demographic; Gender, Age, Residence, Location, Race, Father’s Qualification
and have considered only Gender and Age of the student. Second type of
attribute is Institutional; GPA(Term1, Term2, Term3, Term4, CGPA, Financial
status, Total credit hours taken, Total courses taken, Initial Major, Current
30
Major, Current enrollment status, Teacher methodology and Midterm was
considered, Term Marks for the prediction of final CGPA.
During pre-processing, it has been found the total number of students who
could not continue their program and left the course incomplete. After pre-
processing, there were total 660 remained and 7 out of 667 who left the course.
Two types of errors occurred in this dataset. The first was missing data and
second was incomplete data which has been corrected in the phase of pre-
processing.
In Table 3.5, Attributes selection to dataset for experiments has been shown
Mid Physics
Mid Eng-1
Mid Cal-1
Mid ITC
Gender
Studies
Sr. No.
Matric
GPA
F.Sc
City
31
RQ2: Which factors help in accurate prediction of student’s GPA of first
semester?
Some specific attributes form all three factors; demographic, institutional and
pre-university have been selected. The combinations of those attributes are
constructed and then filters have been applied respectively. There are total 21
unique combinations of attributes presented in table 3.6.
Sr.
No. ATTRIBUTES
Demographic, Pre-Qualification, Institutional (Result all subject of 1st
1 semester)
Demographic, Pre-Qualification, Institutional (Result Eng-1, Cal-1, ITC,
2 subject of 1st semester)
Demographic, Pre-Qualification, Institutional ( Result Eng-1, Cal-1 subject
3 of 1st semester)
Demographic, Pre-Qualification, Institutional (Result Eng-1 subject of 1st
4 semester
5 Demographic, Pre-Qualification, Institutional (Cal-1 subject of 1st semester)
Demographic, Pre-Qualification, Institutional (Result ITC, subject of 1st
6 semester)
Demographic, Pre-Qualification, Institutional (Result Cal-1, ITC subject of
7 1st semester)
Demographic, Pre-Qualification, Institutional (Result Eng-1, ITC subject of
8 1st semester)
9 Demographic, Pre-Qualification, Institutional(GPA )
10 Demographic, Pre-Qualification
11 Demographic
12 Pre-Qualification
13 Pre-Qualification, Institutional (Result all subject of 1st semester)
14 Demographic, Institutional ( Result all subject of 1st semester)
15 Institutional (Result Eng-1, Cal-1, ITC, subject of 1st semester)
16 Institutional (Result Eng-1 subject of 1st semester)
17 Institutional ( Result Cal-1 subject of 1st semester)
18 Institutional (Result ITC, subject of 1st semester)
19 Institutional (Result Eng-1, Cal-1 subject of 1st semester)
20 Institutional (Result Cal-1, ITC subject of 1st semester
21 Institutional (Result Eng-1, ITC subject of 1st semester)
32
RQ3. Is it possible to improve the performance of a student by
allocating of appropriate teacher for a subject?
There is a fact that is normally faced during the research period is; the
personality of teacher affects the performance of the students. If the
background of the teacher is already known with the help of prediction system,
it might become feasible for the department to recommend the best suitable
teacher to the student up to his level of interest. And this phenomenon can
surely boost the performance of the student. To answer the question number 3,
experiments have been performed using two approaches have been used. And
these approaches Statistical experiments and Predictive experiments.
33
case of experiments6.
c) Z-score data analysis
Unlike ANOVA (analysis of variance), Z-score is usually applied on three or
more means. A Z-score is a type of hypothesis test which is a way to find
whether the results obtained from a test are valid or need to be repeated. For
example, if someone said they had found a new drug to cure the cancer, one
would want to be sure it was probably true. Similarly, in our research, we have
applied Z-test to compare the performance of teachers particularly with respect
to their subject. Z – Test will exploit the likelihood that the obtained results are
true or not. A Z-test is generally used when the data is approximately normally
distributed in the form of pairs7.
6
http://www.statisticssolutions.com/manova-analysis-anova/
7
http://www.statisticshowto.com/z-test/
34
Chapter 4
Data collection and pre-processing are the initial steps towards the analysis of
the research. For this purpose, the demographic data has been collected from
the registrar office of the CUST which contains student’s gender and city. With
the co-operation of administrative resources of CUST, the data related to pre-
university and institutional factors has been collected. Institutional and pre-
university factors contain the data about mid-term marks, term-marks, GPA,
intermediate marks and matriculation mark. University portal provided
privilege to access this data of the students to gather the sufficient dataset for
our experiments.
The gathered data came into many formats such as PDF, word and excel. For
the pre-processing, firstly this data set brought into the single file; Excel.
Figure 4.1 represents the format of files and transformation of all files into
single one. The collected data set arises in three formats; word, PDF and excel.
After collection of dataset, it became mandatory to transfer the data into single
file from all the files.The purpose of excel file is to import it into the WEKA to
apply filters for the pre-processing. The final versions of dataset in the form of
35
excel is then converted into CSV format and imported in to the WEKA. Such
filters have been applied that are discussed in subsections of section 4.1.
Mid Term
Weighted
Teacher
Gender
Sr. No.
Work
Term
Total
ITC
4 Male
36
The methods of data mining behave differently in the way that they treat
missing values. Normally, they ignore the missing values, or exempt those
records which contain missing values or either replace missing values with the
mean, or conclude missing values from existing values. Missing
Values Replacement Policies include the number of strategies such as; ignore
the records with missing values. Following example shows the occurrence of
missing values in the dataset. After the handling of missing data by taking the
average of last three records of students and filled the missing one, complete
dataset is gained. Further, this dataset has been used for experiments to
evaluate all research questions.
Term Work
Final Exam
Mid Term
Weighted
Teacher
Gender
Sr. No.
Total
ITC
37
Attribute Selection
Sampling
Data sampling is a statistical analysis technique used to select, manipulate and
analyze a representative subset of data points in order to identify patterns and
trends in the larger data set being examined. For example: term
141,143,151,153,161,163 dataset has been used as a training data and term 171
is test data because prediction of result of corresponding students is required to
be computed.
38
RQ1: Whether existing identified factors for grade prediction are valid in
our local context (Pakistan)?
As existing factors along with their attributes have been stated in above
sections, we have drawn the combinations of local factors with the existing
factors. We found a slight variation between both types
Mid Pak-Studies
Mid Physics
Mid Eng-1
Mid Cal-1
Mid ITC
Gender
Sr. No.
Matric
GPA
City
FSc
As an input we have imported the .CSV into the WEKA that contains
information about each attribute against every student. The input data contains
the data of students of BSCS department from term 141 to term 163. The
39
dataset of CUST (Capital University of Science and Technology) has been used
for this purpose. Among those factors, attributes like Pervious Course
Marks/Grade, GPA, SSG, HSSG, Gender, City has been used. Afterwards, we
have applied well-renown classification algorithms i-e; Naïve Bayes and J48 on
the dataset.
Student
Demographic, Per-
Qualification, 1st
semester 5 Subjects
Grade CGPA Grade 69.12% 72.39%
Student
Data mining Demographic, High
approach for School Background,
predicting student Scholarship, Social
performance (2012) network interaction CGPA Grade 76% 73%
Student
Predicting Student Demographic, Per-
Performance: A Qualification, CGPA Grade 49.90% 63.19%
Statistical
Student
and Data Mining Demographic, High
Approach(2013) School Background CGPA Grade 50% 65%
40
results. The inferences show the slight variations which is quite considerable
with the correspondence to our research. The existing attributes have been
shown in black color in the summary table and attributes in red color are local
attributes.
The summarized results have been shown in the figure 4.2 in which bars of
different colors on X-axis shows the presence of attributes and y-axis shows the
interval of accuracy.
Our Findings narrate that existing identified factors for grade prediction are
considered to be comparable with the local factors because of low variation in
the result of their accuracies. According to the results, Pervious Grade, Internal
Marks, HSSC were found to be effective for grade prediction.
Our second research question is the extraction of those factors that helps in
accurate prediction of GPA of students belong to 1st semester. To answer the
research question, we have transformed this question into following two ways.
41
1. Based on Mid term marks
Ranked Attributes
0.2277 Cal-I MID
0.1903 Cal-II MID
0.1778 ITC MID
0.1684 Physics MID
0.1548 Pak-Studies MID
0.1514 FSC
0.1006 City
0.0992 Matric
0.0791 Gender
According to results, picture states the prediction of CAL-I MID and CAL-II
MID remained better in terms of accuracy. Then the accuracies gradually
decreases from ITC MID to GENDER. Based on the corresponding accuracies
of attributes, we may inference the importance of acquired ranking information
gain of all the instances.
42
1- First Semester GPA Prediction
We have expanded our answer for the research question number 2. With the
help of “GPA Prediction for the first semester” we became able to evaluate our
results through WEKA. We covered the list of attributes narrated in table 4.6 in
which the combination of attributes are defined in column 1. There are four
combinations of attributes which belong to the three factors; Pre-Qualification,
Demographic and Institutional. The performance of J48 is considerably better
then Naïve Bayes that has been shown in table as well as presented in figure
4.3.
Demographic, Pre-Qualification,
Institutional (Internal marks only
Midterm all subject of 1st semester) GPA Grade 66.46% 79.55%
Demographic, Pre-Qualification,
Institutional (Internal marks only
Midterm Eng-1, Cal-1 ,ITC subject of
1st semester) GPA Grade 69.33% 78.12%
43
Figure 4.3: GPA Prediction for the 1st semester
2- CGPA of second Semester
To predict the CGPA of 2nd semester, we have defined the following narrated
set of experiments as summarized in table 4.7. We have merged all the grades
of previous semester along with the adopted factors. These combinations and
their accuracies on the basis of Naïve Bayes and J48 have been presented in
table 4.4.
Class Naïve
Attributes Label Bayes J48
Demographic, Pre-Qualification,
Institutional(Subject Grade Cal-1 subject of CGPA
1st semester) Grade 67.08% 74.44%
Demographic, Pre-Qualification,
Institutional(Subject Grade ITC subject of 1st CGPA
semester) Grade 68.10% 74.44%
44
These tabular results have been presented as figure 4.4. Like 1st semester GPA
prediction, we have used same classifiers for the CGPA prediction for 2md
semester. The only change lies in the combinations of attributes. According to
the accuracy of each combination, we reached to the point that, the accuracy of
Demographic, Pre-Qualification, Institutional (Subject Grade Cal-1, ITC
subject of 1st semester) was high whereas when J48 classifier was applied, the
prediction of Demographic, Pre-Qualification, Institutional(Subject Grade Cal-
1 subject of 1st semester) and Demographic, Pre-Qualification,
Institutional(Subject Grade ITC subject of 1st semester) remained better than
other two combinations with the minor difference i-e; Demographic, Pre-
Qualification, Institutional (Subject Grade Cal-1, ITC subject of 1st semester)
and Demographic, Pre-Qualification, Institutional (Subject Grade Eng-1, Cal-1
subject of 1st semester).
45
Then it is found that the subjects of First semester i-e; Cal-1, Eng-1,
ITC have high value of information gain and played an effective role in
order to predict the accurate grades of students.
As we came to know from the situation of dataset that the Attrition rate
is high in first semester. The purpose of the research is to reduce the
attrition rate, so we have considered these subjects to predict the
attrition rate of students as well as student’s performance.
Results states that Demographic attributes does not play significant role
in grade prediction of students.
In addition, internal marks like Mid Terms are used for early grade
prediction that has high accuracy and also helps to minimize the
attrition rate.
The experiments have been conducted subject wise for the individual course
grade prediction. On the basis of course grades, accumulated teacher
46
performance has been computed. Furthermore, following calculations have
been performed for the Cal-I teachers:
The comparison of Maths course and Pre-qualification has been made in this
phase of experiment. The teachers have been labeled alphabetically and the
overall average of the students has been given against every teacher in table
4.8.
E 58.96428571 73 72.75
The result shows that the performance of teacher “B, D and E” is found to be
better in Pre-qualification combination. With respect to the Cal –I performance
of teacher “B, G and H remained better. On using the Cal-II attribute, the
performance of teacher “B, C, and D” found to be better than other teachers.
These results have been shown graphically in figure 4.5.
47
Figure 4.5: Pre-Qualification vs Maths Courses
Average of Cal-ll Teachers
In this case, we have predicted the the performance of teahcers who teach Cal-
II and compared with the actual performance of the teachers in Cal-II.
According to the results, we have found significantly unexpected outcomes.
There is a minor variation between actual Cal-II and predicted Cal-II. The
results have been narrated in table 4.9 and presented in figure 4.6.
48
Figure 4.6: Average of Cal-ll teachers
Average of Cal-I teachers
The average of Cal-I teachers has been compared with the results of actual and
predicted values. There is another attribute “Pre-qualification” that has been
used in this case. The obtained results have been given in the table 4.10 as well
as presented in figure 4.7.
49
Figure 4.7: Average of Cal-I teachers
50
Figure 4.8: Overall Average of Maths Teachers
After the computation of results of Cal-1 and Cal-II, we became able to find the
teacher “E” whose performance found to be better and able to recommend
teaching Cal-I and Cal-II. Similarly, we will compute the results of other
subjects which are to be taught in the 1st semester and then computed similarly.
Firstly, researchers have computed the performance of every teacher of each
subject which is being taught in first and second semester and gained the
results. The results have been interpreted for every subject with respect to its
related teacher. These results have been shown significant average of all
students in Cal-1 & Cal-2 with Cal-1 teacher name. After the computation of
results for Cal-1 and Cal-II, it became clear to find that the teacher “E” whose
performance found to be better and able to recommend teaching Cal-I and Cal-
II then teacher B have better performances and further teacher B and H
performs better among all other teachers.
51
significant. To prove these results, we applied ANOVA analysis on this dataset.
We have applied ANOVA (Analysis of Variance) test on teacher’s average. By this
way, results indicate the performance of all Cal-I teacher on the basis of group.
SUMMARY
ANOVA
Source of
SS Df MS F P-value F crit
Variation
Between
872.7731 7 124.6819 9.273853 3.14E-06 2.312741
Groups
Total 1302.996 39
In this case F > F crit has been shown as 9.273853 >2.312741. Researchers
reject the null hypothesis and concluded that there are significant differences
between the methods because all 11 methods don’t have the same mean.
52
Z-Score Data Analysis
Now Z-score test is applied to evaluate the performance of every pair of
combination of teacher’s subject wise. Generally, Z-test is applied on two
populations to compare its proportion. In this case, we are supposed to compare
the performance of teachers of the same course which is to recommend to the
student. Researchers have classified each Z-test with respect to the subject and
then have considered the results of Z-score with respect to Cal-1. All the
teachers who teach Cal-I have been indicated with alphabetic characters i-e; A,
B, C, D, E, and F. Pairs such as teacher A with other teachers of Cal-1 are
made for experiment. Moreover, for the comparison, teacher’s data has been
classified as same number of students and different number of students.
Table 4.13: Two Samples for Mean with respect to teacher “E”
Z-score data Analysis applied on the Cal-1 teacher “E” with Teacher “B” on
the basis different number of student. Value of Z-critical to tail < Z concluded
that the performance of teacher “E” is not significant.
Then teacher “B” has been paired with teacher “A” on the basis same number
of students. The value of Z-critical > Z concluded that significant performance
53
of teacher “A”. The results have been shown in table 19. Z-test has been
applied in both cases and results have been compared either the performance of
teachers remain same or vary. By this way, results indicate the performance of
every teacher of Cal-I on the basis of combinations. The result of z-test will
evolve around this statement “If z< z critical two tail then performance
difference is not significant”. If z > z critical two tail then performance
difference is significant.
Table 4.14: Two samples for means with respect to teacher “B”
Here, researchers have applied Z-score data Analysis on the Cal-1 teacher “B”
with Teacher “A” on the basis different number of students who enrolled in
different terms from term 141 to term 171. In this case, value of Z-critical two
tails is greater than the value of Z which shows that performance of teacher “B”
is not significant.
Then teacher “B” has been paired with teacher “A” on the basis same number
of students and Z-test has been applied. The results shows that value of z-
critical shows the significant performance of teacher “A” as it is greater than
the value of Z.
54
Teacher B with different and same number of Students
Z-score data Analysis applied on the Cal-1 teacher “B” with all Teachers on the
basis different number of student. The value of Z-critical tow tail > Z conclude
that performance of teacher “E” is not significant with Teacher “A”, ,”C”, “D”,
“F”, “G”, “H”.,”E”
Table 4.15: Teacher B with different number of Students
A 3.687091 1.959964
C 1.235961 1.959964
D 3.480026 1.959964
F 2.829936 1.959964
G 3.218075 1.959964
H 1.61421 1.959964
E 1.271677 1.959964
Z-score data Analysis applied on the Cal-1 teacher “B” with all Teachers on the
basis same number of student. The value of Z-critical tow tail > Z conclude
that performance of teacher “B” is not significant with Teacher “A”, “B”,”C”,
“D”, “F”, and significant with Teacher “G”, “H”. The ANOVA test has been
applied on the cal-I only. The results have been computed for the rest of the
courses similarly.
55
Pre-Qualification vs Programming Courses
In table 4.17, we have narrated the results of all programming courses which
are taught in 1st and 2nd semester. These averages of courses have been
compared with the Pre-Qualification.
56
computed. The performance of teachers with respect to courses average
changes.
F 59.72643733 69.13777079 77
57
Figure 4.10: Comparison of Actual and Predicted Grades of ITC/ITP
Then the actual and predicted OOP/CP grades have been predicted against
every teacher. The results shows that there is a slight variation in the results
that has been stated in table 4.19 and presented in figure 4.11
58
Figure 4.11: Comparison of Actual and Predicted Performance of
OOP/CP Teachers
The overall performance of OOP teachers have been presented in table 4.20 as
well in figure 4.12.
The results show that the performance of teacher “E & F” is considerably high
and better than other teachers. Therefore, based on the acquired results, we may
recommend these two teachers to teach the courses in respective semesters.
59
Figure 4.12: Overall performance of OOP Teachers
These results have been shown significant average of all students in ITC/ITP&
CP/OOP with ITC/ITP teacher name. During the computation of results of
ITC/ITP and CP/OOP, we have found teacher “F” performance better and able
to recommend teaching ITC/ITP and CP/OOP.
ANOVs analysis is used for group data we have many teacher for Cal-1 & Cal-
2 subjects. The average prediction shows that teacher average performance is
significant. To prove this result applies ANOVA analysis on this dataset.
If F > F crit, we reject the null hypothesis. This is the case 8.542703
>1.952212. Therefore, we reject the null hypothesis. The means of the 11
populations are not all equal. At least one of the means is different. Hence we
conclude that there are significant differences between the methods (i.e. all 11
methods don’t have the same mean).However, the ANOVA does not tell you
where the difference lies. Therefore it is preferred T-Test, Z-Score test each
pair of means.
60
Table 4.21: ANOVA Data Analysis on ITC/ITP
SUMMARY
ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 1075.12 11 97.73818 8.542703 7.87E-09 1.952212
Within
Groups 686.4678 60 11.44113
Total 1761.588 71
61
have applied Z-test on the combinations of teachers who teach ITC/ITP. We
have made combinations similarly as we made for the teachers of Cal-1 and
Cal-II. In table ..., Z-test has been applied on each pair of combination and
tested on two cases. First is to compare the performance of teachers on same
number of students and second is to compare the performance of teachers on
different number of students. Z-score data Analysis ITC/ITP has been applied
on the pair in which performance of teacher “F” with Teacher “A” on the basis
different number of students has been compared. The results show that there is
significant performance of teacher “F” with teacher “A”.
Variable 1 Variable 2
Mean 77 61.48148
Observations 67 27
Z 4.116767
Z-score data Analysis has been applied on the pair of ITC/ITP teacher “F” with
Teacher “A” on the basis same number of students and results show that there
is a significant performance between teacher “F” and teacher “A”.
62
Table 4.236: Z-Score Data Analysis Teacher F with A same number of
Students
Variable 1 Variable 2
Observations 27 27
Z 3.359441
As computed earlier, Z-score data Analysis ITC/ITP teacher “F” with other
teachers who teaches same subjects on the basis different number of students
has been computed. According to the summarized results, there is significant
performance among teacher “F” with all of the teachers.
As computed earlier, Z-score data Analysis ITC/ITP teacher “F” with other
teachers who teaches same subjects on the basis of same number of students
has been computed. According to the summarized results, there is significant
performance among teacher “F” with all of the teachers.
63
Table 4.24: Z-Score Data Analysis Teacher F with other different number
of Students
Table 4.25: Z-Score Data Analysis Teacher F with other same number of
Students
64
Summary
To prove our point, whether student’s performance or grades are dependent on
teacher’s performance or not, we have taken an example to demonstrate. We
have selected the course cal-1 to justify our analysis. For this task, we have
taken courses; Cal-I & Cal-II and ITC/ITP & CP/OOP. Using different
approaches for on the basis these results, we have ended with following
assumptions.
When Cal-1 and ITC/ITP is taught by different teacher and get average marks
of students from session 141 to 163 in Cal-1 and ITC/ITP teacher-wise using
five different experiments, like : Average of actual Cal-I marks based on the
Cal-I teacher, Prediction of the Cal-I marks excluding the teacher name and
then taking the average teacher-wise, Prediction of the Cal-I marks including
the teacher and then taking the average teacher-wise , Average of actual Cal-II
marks based on the Cal-I teacher , Prediction of Cal-II marks and then
computed the average of these marks based on the Cal-I teacher, our predicted
results were higher than the actual results or nearly lesser than actual ones. The
result has showed that all teachers have different average in these subjects and
performances of teachers in this subject are significant.
For proved these result have used ANOVA analysis .The performance of
teacher in CAL-I and ITC/ITP have different and average of teachers have
significant between all teacher .The teacher whose performance is better might
have a reason that he is lenient in giving grades to the students and the teachers
whose performance is comparatively low might be strict in giving grades to the
students. Z-Score have proved these results and have showed that significant
result in CAL-I and ITC/ITP according to each teacher. With the help of these
result, we have recommended for CAL-I teacher E is a very suitable teacher
and ITC/ITP teacher F can improve the performance of students.
65
with their teachers. With the help of recommender system, university
department may become able to identify the student’s performance and enable
them to make their performance better. To evaluate our prediction model and
recommender system, we have produced the answers of research questions. We
have found following strong observations during the course of our experiments.
66
Chapter 5
Our proposed research has been distributed in five chapters. First chapter
consist of introduction of the topic and purpose of this research. Second chapter
contains literature review in which the related work of the other researchers has
been presently in quite understandable manner. Third chapter contains
methodology and chapter no 4 contains detailed discussion of results and
experiments.
In fist chapter, we have briefly explained the background of our research topic
along with its scope, significance and applications. Moreover, we have
discussed research question, which we have constructed after critically
reviewing of literature. We have explored various data mining techniques that
were proposed and used by many researchers. The area of data mining is quite
versatile. Data mining means to explore intrinsic information from the bulk of
datasets. These datasets might belong to educational institutions or internet or
any organization. With the help of mining techniques such as classification, we
can make data analysis for prediction and find the accuracy. Researchers have
worked a lot in the field of education in data mining in which they have used
student’s academic record for specific reasons. They have used the records for
grade prediction, evaluation of student’s performance and teacher’s
performance most commonly.
67
In this regard, from literature we have obtained exiting factors such as
institutional factors and demographic factors. These factors are known as
exiting because they have applied in the western universities. We have applied
these factors in our local context. We have collected data from the Capital
University of Science and Technology, Pakistan. We have used the data of
BSCS program from six terms of spring, which covers the data of 3 years. In
this dataset, we have used the data of students of first semester and their final
grade has been predicted based on their midterm marks, term marks, gender,
age, Matric grades and inter grade. These attributes cover demographic,
institutional and pre-university attributes of the students. In addition,
researchers have used only two factors; Demographic and Pre-university. In our
experiments, we have considered another factor which is institutional in which
midterm and term marks are combined to predict the students final GPA/Grade.
This is how we have proposed prediction model on the basis students
associated attributes.
68
RQ1. Whether exiting identified factors for grade prediction are valid in our
local context (Pakistan)?
The answer to this research question is yes because we have gained quite
considerable accuracy from all data analysis measures. These analyses have
been applied on categorical and nominal data. Results in both categories are
same, which shows the validity of, existing factors in local context. All three
factors contributed well in order to increase the accuracy through WEKA tool
when ARPP format of data was given to it and classification algorithms; SMO
and Linear regression were applied.
69
performance with the pairing of each teacher belong to that particular course.
By this way, we became able to bring out the teachers whose performance is
better than other teachers who teach same course. Then we computed overall
average performance of the teacher based on the overall average grade of
students whom they taught.
With the help of prediction experiments, we have drawn prediction of grades
inclusively and exclusively teacher’s name who taught the courses. Then
predicted results and actual results have been computed and compared. By this
way we became able to get the entire set of results in which our analysis shows
that predicted and actual results remained close at some point and better at
some point too. In the same time, predicted were bit lower than actual ones as
well which can be seen in chapter 4.
70
REFERENCES
Azarcon Jr, D. E., Gallardo, C. D., Anacin, C. G., & Velasco, E. (2014).
Attrition and retention in higher education institution: A conjoint analysis
of consumer behavior in higher education. Asia Pacific Journal of
Education, Arts and SCIENCE, 1(5), 107-118.
71
Cabrera, A. F., Nora, A., & Castaneda, M. B. (1993). College persistence:
Structural equations modeling test of an integrated model of student
retention. The journal of higher education, 64(2), 123-139.
Elbadrawy, A., Polyzou, A., Ren, Z., Sweeney, M., Karypis, G., &
Rangwala, H. (2016). Predicting Student Performance Using Personalized
Analytics. Computer, 49(4), 61-69.
Hung, J. L., & Zhang, K. (2008). Revealing online learning behaviors and
activity patterns and making predictions with data mining techniques in
online teaching. MERLOT Journal of Online Learning and Teaching.
Inoue, S., Rodgers, P. A., Tennant, A., & Spencer, N. (2017). Reducing
Information to Stimulate Design Imagination. In Design Computing and
Cognition'16 (pp. 3-21). Springer, Cham.
72
Kabakchieva, D. (2013). Predicting student performance by using data
mining methods for classification. Cybernetics and Information
Technologies, 13(1), 61-72.
Kokina, J., Pachamanova, D., & Corbett, A. (2017). The role of data
visualization and analytics in performance management: Guiding
entrepreneurial growth decisions. Journal of Accounting Education.
Kovacic, Z. (2010). Early prediction of student success: Mining students'
enrolment data.
Merceron, A., & Yacef, K. (2005, May). Educational Data Mining: a Case
Study. In AIED (pp. 467-474).
73
Different Case Studies). Computer Engineering and Applications
Journal, 3(2), 79-88.
Polyzou, A., & Karypis, G. (2016). Grade prediction with models specific
to students and courses. International Journal of Data Science and
Analytics, 2(3-4), 159-171.
Ramist, L. (1981). College student attrition and retention. Romero, C., &
Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005.
Expert systems with applications, 33(1), 135-146.
Romero, C., & Ventura, S. (2010). Educational data mining: a review of the
state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part
C (Applications and Reviews), 40(6), 601-618.
Sachin, R. B., & Vijay, M. S. (2012, January). A survey and future vision
of data mining in educational field. In Advanced Computing &
Communication Technologies (ACCT), 2012 Second International
Conference on (pp. 96-100). IEEE.
74
Tewari, A. S., Saroj, A., & Barman, A. G. (2015). e-Learning
Recommender System for Teachers using Opinion Mining. In Information
Science and Applications (pp. 1021-1029). Springer Berlin Heidelberg.
Yang, D., Sinha, T., Adamson, D., & Rosé, C. P. (2013, December). Turn
on, tune in, drop out: Anticipating student dropouts in massive open online
courses. In Proceedings of the 2013 NIPS Data-driven education
workshop(Vol. 11, p. 14).
Zhang, Y., Oussena, S., Clark, T., & Hyensook, K. (2010). Using data
mining to improve student retention in HE: a case study.
Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., &
Cunningham, S. J. (1999). Weka: Practical machine learning tools and
techniques with Java implementations.
75
Student Performance Prediction and
Teacher Recommender System
By
Muthara-Tul-Ain
i
Copyright 2017 by CUST Student
All rights reserved. Reproduction in whole or in part in any form requires the
prior written permission of Muthara-Tul-Ain (MS133033) or designated
representative
ii
CAPITAL UNIVERSITY OF SCIENCE & TECHNOLOGY
ISLAMABAD
Islamabad Expressway, Kahuta Road, Zone-V, Islamabad
Phone: +92 51 111 555 666, Fax: 92 51 4486705
Email: [email protected], Website: http”//www.cust.edu.pk
CERTIFICATE OF APPROVAL
Muthara-Tul-Ain
MS133033
________________________________
iii
DEDICATED
TO
MY
RESPECTED
&
FOR
THEIR CONSTANT
iv
ACKNOWLEDGMENT
All praise and exaltation is due to ALLAH (S.W.T) The creator and sustainer
of all seen and unseen worlds. First and foremost I would like to express my
gratitude and thanks giving to Him for providing me the boundaries and
blessings to complete this work. Secondly, I would like to express my sincerest
appreciation to my supervisor Dr Nayyar Masood for his directions,
assistance, and guidance. I sincerely thanked for his support, encouragement
and technical advice in the research area. I am heartily thankful to him from the
final level, as he enabled me to develop an understanding of the subject. He has
taught me, both consciously and unconsciously, how good experimental work
is carried out. Sir you will always be remembered in my prayers.
I pray to ALLAH (S.W.T) that may He bestow me with true success in all
fields in both worlds and shower His blessed knowledge upon me for the
betterment of all Muslims and whole Mankind.
AAMEEN
Muthara-Tul-Ain
v
DECLARATION
It is declared that this is an original piece of my own work, except where
otherwise acknowledged in text and references. This work has not been
submitted in any form for another degree or diploma at any university or other
institution for tertiary education and shall not be submitted by me in future for
obtaining any degree from this or any other University or Institution.
Muthara-Tul-Ain
November, 2017
vi
ABSTRACT
Mining data and extracting information from huge databases has become an
interesting research area for the researchers. The idea to extract information
with the help of data mining techniques came into being since a couple of
decades ago. Initially, researchers were supposed to apply classification and
clustering techniques to partite the dataset and analyze the intrinsic features. On
the basis of such features, they make reasonable predictions. These predictions
have taken place in the field of educational data mining for many purposes such
as; predict the performance of students on the basis of factors associated with
them, to enable them suitable courses and appropriate teachers. These purposes
have been derived from the area of student retention and attrition. Our research
aims to achieve these purposes under the roof of student attrition and retention.
Moreover, we have identified such exiting factors which are beneficial for
predicting the performance to students, recommend them best suitable teachers
and help them to select the courses. We have applied classification algorithms
with respect to the nature of data which have been collected from the Capital
University of Science and Technology (CUST), ISB. In this study, the GPA of
first semester on the basis of Midterm and previous academic grades have been
tried to predict. For the second semester, we have predicted the CGPA of same
students by using their complete proceeding academic record with the help of
hybrid approach. The hybrid approach consists of combination of factors that
have been evaluated against our research questions. Moreover, we tried to
improve their performance by recommending those suitable courses and
teachers whose performance is better amongst others comparatively. The
reason of collecting the data from CUST is to validate the exiting factors in
local context (Pakistan). On the basis of classification algorithms such as;
Naïve Bayes and J48, we have become able to build the recommender system.
Then the factors which contributed well to validate the exiting factors in local
context have been measured. In the last, appropriate teacher allocation has been
measured by two ways; Statistical and Prediction. In statistical
experimentations, average performance of teachers, Z- test and ANOVA test
has been applied. In prediction experimentations, one subject teacher with other
vii
subject teacher’s name attribute, and one subject teacher without other subject
teacher’s name attribute and overall performance of teachers have been
computed with respect to each subject independently. With the help of our
research we might have become able to provide a way to educational
institutions to reduce the attrition and increase the retention rate.
viii
Table of Contents
Abstract .........................................................................................................................vii
Introduction ..................................................................................................................... 1
1.1 Background Of Research ......................................................................................... 2
1.2 Problem Statement ..................................................................................................... 6
1.3 Research Questions.................................................................................................... 6
1.4 Scope ......................................................................................................................... 6
1.5 Application Of Proposed Approach............................................................................ 6
1.6 Significance Of Solution ............................................................................................ 7
1.7 Organization Of Thesis .............................................................................................. 8
1.8 Definitions And Frequently Used Terms .................................................................... 8
Literature Review ............................................................................................................ 9
2.1 Educational Data Mining ......................................................................................... 10
2.2 Student Performance Prediction ............................................................................... 14
2.3 Commonly Used Approaches In Edm ...................................................................... 16
2.5 Commonly Used Attributes For Student Performance Prediction. ............................ 19
2.5 Literature Review Summary .................................................................................... 21
Research Methodology .................................................................................................. 22
3.1 Data Collection And Pre-Processing ........................................................................ 24
3.2 Classification ........................................................................................................... 27
3.3 Algorithms And Techniques .................................................................................... 29
3.4 Evaluation Of Research Questions ........................................................................... 30
Results & Evaluations.................................................................................................... 35
4.2 Final Attribute Selection .......................................................................................... 38
4.3 Result Of Evaluation Of Research Questions ........................................................... 38
Conclusion And Future Work ........................................................................................ 67
5.1 Future Work ............................................................................................................ 70
References ..................................................................................................................... 71
ix
List of Tables
Table 2.1: Literature Review Summary ........................................................................................... 21
Table 3.2: Summary Of Session-Wise Attrition............................................................................. 24
Table 3.3: Summary Of Session-Wise Attrition............................................................................. 25
Table 3.4: Selection Of Attributes ..................................................................................................... 28
Table 3.5: Attributes Selection To Dataset ...................................................................................... 31
Table 3.6: Pattern Of Combination Of Attributes .......................................................................... 32
Table 4.1: Occurrence Of Missing Values ....................................................................................... 36
Table 4.2: Handling Of Missing Values ........................................................................................... 37
Table 4.3: Attributes Table ................................................................................................................... 39
Table 4.4: Classification Of Attributes And Algorithms ............................................................. 40
Table 4.5 Ranked Attributes ................................................................................................................. 42
Table 4.6 GPA PREDICTION FOR THE FIRST SEMESTER .............................................................................. 42
Table 4.7: Cgpa Prediction For The Second Semester ................................................................. 44
Table 4.8: Pre-Qualification Vs Maths Courses ............................................................................. 47
Table 4.9: Average Of Cal-Ll Teachers ............................................................................................ 48
Table 4.10: Average Of Cal-I Teachers ............................................................................................ 49
Table 4.11: Overall Average Of Maths Teachers ......................................................................... 50
Table 4.12: Anova Analysis On Cal-1 .............................................................................................. 52
Table 4.13: Two Samples For Mean With Respect To Teacher “E”........................................ 53
Table 4.14: Two Samples For Means With Respect To Teacher “B” ..................................... 54
Table 4.15: Teacher B With Same Number Of Students ............................................................. 55
Table 4.16: Teacher B With Different Number Of Students ...................................................... 55
Table 4.17: Comparison Of All Programming Courses ............................................................... 56
Table 4.18: Comparison Of Actual And Predicted Grades Of Itc/Itp ...................................... 57
Table 4.19: Comparison Of Actual And Predicted Performance Of Oop/Cp
Teachers ............................................................................................................................................ 58
Table 4.20: Overall Performance Of Oop Teachers ...................................................................... 59
x
List of Figures
Figure 3.1: DATA FLOW DIAGRAM OF PROPOSED METHODOLOGY ......................................................... 23
Figure 4.1: Formats Of Data................................................................................................................. 36
Figure 4.2comparison Of Local And Existing Attributes ............................................................ 41
Figure 4.3: Gpa Prediction For The 1st Semester .......................................................................... 44
Figure 4.4: Cgpa Prediction For The Second Semester................................................................ 45
Figure 4.5: Pre-Qualification Vs Maths Courses ........................................................................... 48
Figure 4.6: Average Of Cal-Ll Teachers .......................................................................................... 49
Figure 4.7: Overall Average Of Maths Teachers ........................................................................... 50
Figure 4.8: Comparison Of Itc/Itp Teacher ...................................................................................... 51
Figure 4.9: Comparison Of Actual And Predicted Grades Of Itc/Itp ....................................... 56
Figure 4.10: Comparison Of Actual And Predicted Performance Of Oop/Cp
Teachers ............................................................................................................................................ 58
Figure 4.11: Overall Performance Of Oop Teachers .................................................................... 59
Figure 4.12: Overall Performance Of Oop Teachers .................................................................... 60
xi
List of Abbreviations
CUST: Capital University of Science and Technology
Cal-I: Calculus-I
Cal-II: Calculus-II
SEM: Semester
xii
Chapter 1
INTRODUCTION
Researchers have been working in the area of educational data mining for a
decade. Educational Data Mining (EDM) has become a broadened field in
which researchers are conducting their experiments to extract useful
information from the data belonging to the educational sectors for many
purposes. The purposes include identifying student attrition and retention rate,
students’ performance prediction, building a course recommender system, and
teacher recommender system, etc. In the field of Educational Data Mining
(EDM), researchers are busy in exploring effectiveness or role of different
types of variables to measure and predict the performance of students. Among
those variables, student’s age, academic record and biography of students are
involved. EDM is a growing research field, which carries data mining
techniques in educational system (Romero, C., & Ventura, S. 2007). In
previous study, (Romero, C., & Ventura, S. 2013) has explored the
phenomenon of making the student’s outcome better with the help of data
mining approaches. According to them, huge data from the institutions have
problems associated with it. Therefore, data mining approaches are not
supposed to be applied directly on this data. Therefore, knowledge discovery
process has been implemented. In educational institutions, data mining is
playing a pivotal role on datasets after preprocessing. Most widely used data
mining approaches involved classification and clustering, outlier detection,
association rule mining, pattern mining and text mining. Researchers are trying
to extract useful, novel and interesting information with the help of data mining
approaches (Romero, C., & Ventura, S. 2010). EDM process is used to convert
raw educational data into useful information.
In the field of EDM, researchers are active in three broad categories; Personal
recommender system and learning environment, course management system,
and student attrition and retention. Our research will cover all these mentioned
areas in different contexts such as identification of students with low academic
performance, building prediction model to predict the student’s performance by
1
using their historical data, improvement in raising the confidence level among
them, assist them to choose their courses, building a model that will assign
appropriate teachers to the students.
2
c) Student attrition and retention
Student attrition means the number of students who are leave their courses
without completing because of certain reasons. Those reasons may include poor
choices of courses, consistent undesirable result, poor performance, financially
instability and immature selection of courses etc whereas student retention
means the number of students who complete their course and acquire degree
despite of any kind of circumstance and gain good grades in the transcript. In
educational institutions, rate of student attrition and retention has reached up to
considerable unit (Stallone, M. N. 2011).
Student attrition and retention not only affect the performance of education
institution or department but also affect the faculty positions and raise the
financial problems for the parents eventually. To analyze the pattern of student
retention and attrition, one has to find out the reasons of this cause then take
steps to resolve this problem. Our research topic falls into the type student
attrition and retention which is further explained in the following subsection in
detail. For this research collected dataset from Capital University of science
and Technology, Department of Computer Science. Let’s consider first
semester and second semester courses because student attrition rate is high in
first and second semester. These semesters are helpful attrition of students and
predicted the student performance.
3
2016). EDM is a broad research field in which researchers are exploring a lot of
problems including student’s performance, attrition and retention, course
selection, and particular teacher’s selection. The widely covered research
problems in the domain of educational data mining are known as student’s
future grade prediction, performance prediction and course enrollment
recommender system.
This research is based on different aspects. One aspect is to identify the number
of those students who had to leave the institution because of their poor
performance in studies or financial problem. Then the number of those students
have been found who completed their degree belong to the spring semesters of
three years. Expectedly this research will help the students to improve their
results in the courses which have been taken for the experiments by
recommending them appropriate teachers and courses based on their associated
factors.
This is helpful not only for students to raise their grades but also for parents
not to suffer from financial problems. Another aspect which is being
considered is to retain maximum percentage of students in the university and
fulfill their degree requirements till its completion. Ultimately, if the grades of
students would be good their retention will become stronger in the institution.
This research might reduce the attrition rate because of the provision of teacher
recommender system, course recommender system and student performance
prediction. The outcome of our research has been discussed in detail as follows.
4
are removed and then classification and clustering are applied. By using such
techniques, data can be further divided into clusters and classes to analyze
useful and hidden patterns from it (Kabakchieva, D, 2013).
With respect to performance prediction of students, there are many cases such
as student’s next term grade prediction, presence performance based on
assessment in the current courses. To evaluate this problem, researchers have
used some of the characteristics of students such as admission records, High
school scores, SAT/ACT scores and grades of previously completed courses
(Elbadrawy, et al, 2016). More characteristics such as class test, seminar
attendance and marks, assignment marks (Baradwaj, B. K., & Pal, S. 2012).
5
The factors used in the research have already been applied in countries other
than Pakistan. Those countries are Canada, USA, England, and Nigeria. All
these factors have been applied in the colleges and universities of narrated
countries and now will be validated in the context of Pakistan.
RQ1. Whether exiting identified factors for grade prediction are valid in our
local context (Pakistan)?
RQ2. Which factors help in accurate prediction of students’ GPA of first
semester?
RQ3. Is it possible to improve the performance of a student by allocating of
appropriate teacher for a subject?
1.4 Scope
The application of this research is versatile. The experiments of this research
have been conducted on the dataset that has been collected from Capital
University of Science and Technology, Islamabad (Pakistan). Such factors can
be applied on the data of Government institutions as well as schools and
colleges.
6
To make the university management system efficient
This research might be highly beneficial for the university management system
to keep check on the performance of student’s belong to different departments.
This will not only help to predict and improve student’s performance but also
supports the parents economically.
Moreover, with the help of the contribution of this research there are chances to
propose teacher recommender system that will help to recommend suitable
teachers to the students by considering their current performance in the
semester and their relative interest in the courses that might improve their
grades.
7
1.7 Organization of thesis
This thesis comprises of five chapters. The first chapter states the introduction
of the proposed research. Second chapter is the literature review in which the
work and research contribution in chosen area by former researchers has been
discussed. Moreover, overall literature summary has been presented in
summarized way. Third chapter contains the methodology diagram that has
been adopted to perform the experiments and how experiments will be
performed. The results of experiments by using selected tools have been
presented in the chapter four. Last chapter contains the conclusion and future
work in which summary of whole thesis has been presented.
Performance Prediction
A student’s grade is used to acknowledge his/her related performance in
academia. It is now possible to predict the student’s performance with scientific
methods in which quantitative approaches have been used by the researchers
(Bhardwaj, B. K, 2012) (Pal, S. 2012).
Recommender System
Usually recommender system is built to assign a particular entity to required
entity. In the field of data mining, researchers have proposed some
recommender systems in which teacher recommender system and course
recommender systems are involved (Bozo, J., 2010) (Alarcón, R.,2010)(
Iribarra, S., 2010).
8
Chapter 2
LITERATURE REVIEW
The initiative of online course learning system is based on e-commerce
venture. With the growth of this venture, data from web resources started
collecting and storing in the excel that contains customer and product
information and order information (Kokina, J, 2017). E-commerce is a term
that refers to use as online business through internet. There are various websites
that are working for this purpose such as e-bay, Alibaba, and Amazon etc. It
can be said that data storage in excel from the resources has been derived from
e-commerce. Predicting student’s performance by using data mining techniques
to extract information from the academic dataset of universities has become
state of the art research in the scientific society. Universities are confronting
with some challenges now a day to analyze the performance of their students.
That’s why researchers are focusing on student’s profiles and characteristics to
make the university management aware of student’s performance and overall
academic result (Kabakchieva, D, 2013). There is another dimension of
student’s performance that is the dependence of student retention upon student
student’s performance. To minimize the problem of student retention cases in
the universities, different researchers have proposed different methods to
predict the performance of students in their future semester based on the
performance of previous one.
To predict the courses of next term grades, four parameters have been
considered in this study such as; admission records, High school scores,
SAT/ACT scores and grades of previously completed courses. Based upon
these parameters, recommender system can be trained to predict the grades of
students accurately in any of the educational institution. Historical information
about the course has also been considered in this study such as which course is
taught by which teacher and information about contents of the course. Many
researchers have used LMS and Moore to predict the successive chances of
success and failure of students. In this research, regression based methods such
as course specific regression (CSPR) and personalized linear multi regression
9
(PLMR) has been used. Another method known as matrix factorization based
methods in which standard matrix factorization (MF) has been used for the
grade prediction of students (Elbadrawy, A, et al, 2015).
In this chapter, the background of educational data mining and its branches will
be discussed in detail. Also the student’s performance prediction and data
mining approaches that are commonly used by researchers in the literature are
being discussed.
1
http://searchsqlserver.techtarget.com/definition/data-mining
2
http://www.educationaldatamining.org/
10
educational institutions stronger. Another purpose is to build the student career
by covering its each and every aspect such as improvement of their grades if
lacking, overall performance booster, support them financially, make them
enable to select the courses suggested by course recommender system, assign
them appropriate teachers based on their inclination of interest in course
selection and many more.
However, these are some common areas in which researchers are producing
their research by using different data mining techniques. In our research, same
areas are being covered under the dataset of Pakistani students. Three factors
are being applied on students those factors have been discussed briefly in
section 2.4. In current section, our focus will remain on key areas of EDM, in
which researchers are engaged.
11
educators to make their courses available on the internet. All the data on the
Moodle is managed by Moodle team. Moodle is now facilitating educational
institutions, community colleges, and schools to create online teaching system
(Dougiamas, M., 2003). It delivers courses online, unlike traditional
classrooms. Data of courses and students is stored in its specified database
which is further used by researchers that carries out for mining purposes to
extract useful information from it. Performance of Moodle has raise higher in
virtual environment. Other online course management system such as web
portals and learning management systems are serving for the same purpose now
days.
With the use of internet in classrooms and institutions, those institutions have
launched their proper websites or web pages, through which students can
register themselves into particular courses (Kaminski, J. 2005). In the area of
online course learning, researchers have presented many research articles. In
this domain LMS and Moore have contributed a lot. They have been offering
different courses on their websites and thousands of students from all over the
world registered themselves. This is how, huge amount of dataset is collected
from LMS and Moore and researchers have performed analysis on these dataset
to evaluate the performance of students who enrolled (Kizilcec, R. F, 2017)
(Pérez-Sanagustín, M., 2017) (Maldonado, J. J. 2017). Self regulated learning
(SRL) is a term that has been stated by narrated researchers. According to
them, students having strong SRL are good enough in planning, managing and
controlling as compared to the students having weak SRL. In this regard,
MOOCs have been providing support for the learners with the help of different
levels of SRL.
12
institutions are finding that factor that ultimately causes student attrition
(Azarcon Jr et al, 2014). After analyzing those factors, it is important for
educational institutions to make strategic adjustments accordingly to improve
student retention in institutions. In Gaviria, Colombia, people are closely
belongs to social mobility that is a causing the academic performance of the
students and student attrition rate may increase in educational institutions.
Researchers have used three drivers the student performance analytics. The first
driver is the volume of data that is being collected from learning management
system and student information system, second one is the e-learning and third
one is political concerns (Guarín, C. E. L., 2015)( Guzmán, E. L.,2015)(
González, F. A. 2015). The application of data mining in the field of
educational data mining has been emerged in different areas and researchers are
exploiting these areas with various dimensions.
There is another supplement for student learning through web that is known as
MOOCs. In online course learning and management system, high rate of
dropout of students have been identified by researchers. This problem has been
enlightened by the YANG in his research that is students are dropping at
considerable level in Massive Open Online Courses (MOOCs). To control the
student attrition problem from the coarser classes, researchers have proposed a
model in their research. This model is a helping hand to determine such
influential factors that are causing student drop out. So the predictors have been
tried to propose in this research that will determine the factors related to
student’s behavior and social position in the discussion in which they
participate in the forums (Yang, D.et al, 2013).
The problem of student attrition and retention is not new for the educational
institutions. It has been enlightened by the researchers from the fields of data
mining and information visualization. Now it has become very common
research problem for the researchers. Student attrition and retention problem
has been observed by the researchers when this problem was raised up to the
ratio of 50% on the colleges of Ontario (Drea, C. 2004). To reduce attrition
rates, institutions should focus on student retention. Researchers have analyzed
the factors that causes student attrition and in the research, Drea has addressed
13
both elements; student attrition and retention. To retain the persistence of
institutions, two theories have been formulated in this context. The first one is
student integration model by Tinto and student attrition model by Bean. Both
models work almost similar (Cabrera, A. F. 1993).
There are numerous reasons for what student attrition has become research
problem for the researchers. In those reasons, personal disappointments,
financial setbacks, and lowering of career and life goals are considerable.
Therefore scientists have carried out the retention and persistent in their
research to resolve this societal problem (Ramist, L.1981). This research focus
on student performance and also find a teacher methodology has positive
impact on student grade prediction and has reduced student retention rate.
14
students degrade up to the mark able level, it should be considered by the
department management and find out the factors that affects the performance of
students. Under consideration of this problem, researchers have evaluated their
studies against the parameters that affects the overall performance of students,
data mining techniques and data mining tools (Kaur, G, 2016)( Singh, W,
2016). In such parameters, psychological, personal and environmental factors
are involved. For the experiments, they have used Naïve Based and J48
techniques through WEKA.
15
(Alarcón, R., 2010) (Iribarra, S., 2010). As online course enrollment system has
taken the place globally because of internet. In this variation in technology,
there are several web pages and websites that are busy in providing best at their
own but still contain some missing elements in their services. Therefore,
researchers have worked upon this problem and proposed solution regardingly.
To make every course learnable creatively, researchers has focused on teacher
recommendation according to the course respectively. The new recommender
system is known as A3. With the help of this recommender system, not only
accurate teacher will be associated to the relative course contents but also best
contents of the course will be available on that site after some time (Tewari, A.
S, 2015).
To improve the performance of weak students in the class, with the help of data
mining techniques, collected data of 1100 students have been transformed in
Weka. Researchers have used freeware software such as Weka, Clementine and
Rapid-Miner in this work. Different classifiers such as Naïve Based, C4.5,
Neural networks and random forest have been used. The classifiers which have
been used are Adaboost, Bagging and boosting. It is found that some factors
have same effects on both countries and some have different. Moreover, it has
16
been found that male students suffer with stress more as compared to female.
The performance of male students in Mathematics and formal SCIENCE is
better whereas performance of female students aroused better in Literature and
mnemonic SCIENCE (Oskouei, R. J.,2014)( Askari, M. 2014).In this field of
research, there exist wide varieties of benchmark which are used to evaluate the
performance and accuracy of experiments conducted by using machine learning
approaches. Different researchers have used various types of educational
datasets and each dataset is unique in its attributes (Garcia-Saiz, D., 2011)
(Zorrilla, M. E, 2011). Therefore, in the study researchers have proposed meta
algorithm to preprocess dataset. Various data mining models have been studied
in this research to find the most accurate one with the help of Meta algorithm.
In the study, some factors which affects the grades of students of Iran and India
has been observed. In such factors, their respective gender, family background,
education level of their parents and their lifestyle has been encountered by
17
asking questions from them (Oskouei, R. J., & Askari, M, 2014). To evaluate
the performance of students and improve the management of educational
institutions, researchers have introduced pre university characteristics of
students which are known as student’s profile and place of secondary school,
final secondary education score, total admission score, and score achieved that
exams. In this research data of University of National and World Economy
(Bulgaria) has been collected and data mining techniques has been applied. In
such data mining techniques naïve bayes and bayes net, nearest neighbor
algorithm, and two rule learners One R and JRIP has been applied on the
dataset (Kabakchieva, D. 2013).
c) Pattern mining
As discussed earlier, it is now clear knowledge is gathered about online
teaching and learning system. Universities through internet are now big source
of student-teacher interaction and source of learning and training. This
technology has made place in the field of educational data mining in which
knowledge diffusion has become research icon for the researchers who belong
to field of data mining, web mining and graph algorithms. The data of online
teaching system is serving best to the researchers who keen to extract
knowledge from big datasets. Researchers are analyzing patterns of online
learning behavior of students and to draw outcomes from these sets of data,
they have been using machine learning algorithms and various data mining
approaches. In a study, researchers used 19,934 servers to identify the behavior
of students in Taiwan who had registered online courses and with the help of
this data, they drawn conclusion after predicting their performance too (Hung,
J. L., & Zhang, K. 2008). In addition, data mining techniques were proven to be
helpful for the course developers, online trainers, and instructional designers. In
their study, they used WEKA and KNIME tools to perform analysis of
descriptive and artificial intelligence. Moreover, for data visualization and
statistical analysis, they used
18
of the mentioned approaches are ultimately works to represent the results after
applying over the data. The concept of data visualization has been derived from
visual reasoning (Inoue, S et al, 2017).
d) Text mining
In the past decade, there are number of tools such as WEKA, RapidMiner, R,
KEEL, and SNAPP that have been used to extract text (Useful information)
from the datasets in the field of educational data mining (Baker, R. S., &
Inventado, P. S. 2014). In other research, data mining is also known as
knowledge discovery from databases (Anand, S. S. et al, 1996). In this process,
database techniques are bind with mathematical and artificial intelligence
techniques.
While review of papers from the literature, many data mining approaches have
been found that are applied in academic datasets of different educational
institutions for various purposes. Such approaches have been applied on the
attributes that are considered after analyzing the factors. Those factors are
briefly discussed in following passage.
I. Demographic attributes
19
general weighted average, School’s Radial Distance, and school ownership are
taken (Abaya, S. A., 2013) (Gerardo, B. D., 2013)
Another factor which has been found during literature survey is pre-university
attributes. These attributes have quite great impact on student’s performance in
any educational institution. Pre-university attributes will be fruitful with respect
to accommodate the student’s interest to map in course recommendation
system. Pre-university attributes include Secondary School Grade, Higher
Secondary Grade, SAT Score, Pre-college, Pre-board, pre-program.
Researchers have used the historical data of students as well to find the attrition
rate of students from academia through cross validation process under the
classification and naïve based methods Guarín, C. E. L., 2015)( Guzmán, E.
L.,2015)( González, F. A. 2015).
20
repeat. Moreover, student’s academic record can be improved and efficiency of
department management can be raised.
Used
Attributes
Pre-university
Demographic
Institutional
attributes
attributes
10 fold cross
(Guarín, C. E.
Decision Trees, validation
L., Guzmán, E.
Bayesian model, Cost √
L.,González, F.
Classification sensitive
A. 2015)
model
(Abaya, S. Irecruit
Classification through
A.& Gerardo, Application √
C4.5
B. D., 2013) in UNIX
Clustering, Feature
CHAID
(Kovacic, Z. selection thorough
Model, Gian √
2010). cross validation, and
chart
CART classification
CRISP-DM, Neural
networking, Nearest
Kabakchieva,
neighbor classifier, WEKA √
D. (2013).
Rule learner, Decision
tree classifier
(Al-Barrak, M.
E-learning
A., & Al- Data Visualization,
Web Miner, √
Razgan, M. Classification
WEKA
2016).
(Baradwaj, B.
Classification and
K., & Pal, S. Manual √
Decision trees
2012)
Recommend
Matrix factorization
er system
(Elbadrawy et based methods,
based √
al, 2016) regression based
personal
methods
analytics
(Bydžovská, EDM, SNA and WEKA and
√
H. 2013). Collaborative filtering R
(Kaur, G., &
Naïve based and J48
Singh, W. WEKA √
Decision trees
2016)
(Kokina, J.,
Pachamanova, Predictive Modeling, Excel and
√
D., & Corbett, Data Visualization Tableau
A., 2017).
21
Chapter 3
RESEARCH METHODOLOGY
This chapter presents the methodology adopted to address the research
questions presented in chapter 1. The focus of the research is accurate grade
prediction of critical courses of first semester of BS CS students of CUST.
Prediction will help to take appropriate measures to control the attrition rate
and hence will be beneficial for students, parents and university. Targeting the
same objective, appropriate faculty members are also being recommended
based on the results of previous semester. The factors under consideration for
the proposed research are classified into three categories; Demographic, Pre-
university and Institutional. Such researches in the field of educational data
mining have been conducted in foreign earlier. Experiments have been
conducted in local context, Pakistan. For this purpose, all the data have been
collected from Capital University of Science and Technology, Islamabad
Pakistan. After data collection and its pre-processing, some data mining
techniques are being selected like SVM, Linear regression and Non-linear
regression in WEKA.
This study will not only be useful in improving the overall performance of
students but also reduces the attrition rate. Now it is easy to do such things on
the basis of early prediction. Field of data mining has been emerged with the
linkage of natural language processing, artificial intelligence, visual data
analytics, and social data analysis etc (Romero, C., & Ventura, S. 2013)(Zhang,
Y.et al, 2010). This early prediction is computed by using term marks and mid-
term marks of the students along with their pre-university; Intermediate/O-
levels marks and Matriculation marks and demographic factors; gender and
city. This methodology will be discussed briefly in the current chapter. The
related work of this research task has been discussed in detail in chapter 2.
Chapter 3 contains detailed discussion of proposed methodology along with the
data flow diagram that has been given as 3.1
22
Figure 3.1: Data Flow Diagram of Proposed Methodology
23
To conduct the experiments, first of all the data is collected from the CUST
(Capital University of Science and Technology), Islamabad Pakistan. The data
set contains the data of six semesters that belongs to the BSCS program. On the
basis adopted factors, the GPA of 1st semester as well as CGPA of 2nd semester
has been predicted. In addition, performance of teachers against their respective
courses has been measured through ANOVA analysis. The experiments have
been conducted through WEKA. Each research question has been answered
with the help of proposed methodology and experiments.
Spring 2014 74
Spring 2016 62
Spring 2017 58
Total 667
24
However, the information has been extracted from a research (Junaid. 2017),
that most crucial courses with respect to attrition or students’ performance are
ITP/CP and Cal-I. So results of only these two semesters are considered and
excluded the rest three.
Collected data is related to the students of first semesters and seven semesters
starting from Spring 2014 (Term 141) till Spring 2017 (Term 171), and also of
second semester students of six semesters starting from Fall 2014 (Term 143)
to Spring 2017 (Term 173). The courses of first semester, as mentioned earlier,
are Cal-I and ITC/CP, and the second semester courses are Cal-II and CP/OOP
from each spring new intakes that belongs to the semester no 1. The initiative
of this research is to identify the factors which contribute to keep the students
retained till the end of the semesters as well those factors that affect the
performance of the students. Following table 3.2 shows the information of the
students comprehensively.
3rd Sem
4th Sem
6th Sem
7th Sem
8th Sem
1st Sem
5h Sem
Term
161 94% 6% 0% 0% 0% 0% 0% 0%
163 100% 0% 0% 0% 0% 0% 0% 0%
25
The dataset contains the institutional factors; Registration Number, Name,
GPA, Internal marks like midterm, and term works marks etc and overall
information of the students such as Student’s registration number, GPA,
Registered courses, Matriculation and Intermediate results. To conduct the
experiments, we have used demographic and pre-university factors of the
students apart from institutional factors. In the data set, the data of term 141,
143, 151, 153, 161, 163, and term 171 are included. The terms of the spring
semesters are associated with the registration numbers of the students such as
the term 141 means spring 2014. Similarly, the data of spring 2015 and spring
2016 are collected and on the basis of three years record of the students. The
file of dataset is initially imported into WEKA after converted into CSV and
ARFF format for the pre-processing and further experimentations have been
conducted respectively.
3
https://measuringu.com/handle-missing-data/
26
Noisy Data Handling
While data collection, irrelevant data is called as noise. As the data of BSCS
students of spring 2014 from spring 2016 is collected, therefore during the
collection, data belongs to Software engineering and bioinformatics was
present and then eliminated through filters in WEKA. Noisy data occurred in
the form of errors such as; GPA = “−0.4” and Intermediate or Matriculation
marks = “-455, 544.8” which were handled in Pre-processing of data.
Attribute Selection
After missing data handling and outlier detection phase, final attributes have
been selected on which overall experiments and results depend. Such attributes
have been narrated in table 3.1 that belongs to all three types of factors;
demographic, pre-university, and institutional. One more factor has been
considered to answer our research question that is teacher performance. After
completion of data pre-processing, classification algorithms have been applied.
Attributes are then finalized for the experiments that have been evaluated to
answer the research questions. In the current section, we have discussed
adopted classifiers in detail.
3.2 Classification
Before applying algorithms, factors have been grouped that are earlier used by
foreign researchers. These factors have been compared against the factors used
by the researchers in local context (Pakistan). Moreover, we have expanded our
research work by making the variations of different combinations attributes and
then compared. The comprehensive detail about these factors has been
mentioned in table 3.4.
27
Table 3.3: Selection of Attributes
Sr.No. Factors Attribute Name
1. Gender
2. Age
3. Residence
4. DEMOGRAPHIC Location
5. Race
6. Father’s qualification
7. Father’s occupation
8 Secondary school grade
9 Higher secondary grade
10 PRE-COLLEGE Pre-college
11 Pre-program
12 SAT score
13 GPA (Term1,
Term2,Term3, Term4)
14 CGPA
15 Financial status
16 Total credit hours taken
INSTITUTIONAL
17 Total courses taken
18 Initial Major
19 Current Major
20 Current enrollment status
21 Teacher methodology
22 Instructor grading
23 Instructor feedback
24 TEAHCER Instructor teaching
methodology
25 Teacher name
28
3.3 Algorithms and Techniques
To evaluate our research by using different classification algorithms; Naïve
Based and J48 has been compared. These classifiers are chosen on the basis of
different reasons i-e; these classifiers supports Categorical and Nominal. For
this purpose WEKA 3.8 tool has been used. The tool is open source available
on the web and is generally used for machine learning algorithms,
classification, training and testing of data. The data is then can be visualize in
the form of graphs as well. In present research, the formed combinations of
attributes have been evaluated to answer our research questions. For this
purpose the data has been converted into csv format to import in the WEKA
and the results from various filters and classifiers have been accumulated.
a) Naïve Based
Naïve based algorithm is comparatively fast algorithm in terms of
classification. It works faster on huge datasets by using Bayes algorithm of
probability. Bayes algorithm generally used to predict the class of unknown
dataset4. Naïve based algorithm works on assumptions to label an item whose
features are known but name is unknown. For example; a fruit is labeled as an
apple if it is round and red in color and its size is 3 inches in diameter. These
features of apple will raise the probability of this fruit that it is an apple.
b) J48
J48 decision tree is used to predict the target variable of new dataset. If dataset
contains predictors or independent variables and set of target or dependent
variables, then this algorithm is applied to extract the target variable of new
dataset5.
c) Linear Regression
4
https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
5
http://data-mining.business-intelligence.uoc.edu/home/j48-decision-tree
29
Second main thing that linear regression does is the identification of variable
that are significant predictors of dependent variables. At the end, the regression
equation is used which helps to determine the set of predictor which are used to
predict the outcome. In this research, algorithms are being used to compare the
trend and pattern of the factors with other approaches like non-linear regression
and SMO.
RQ1: Whether exiting identified factors for grade prediction are valid in
our local context (Pakistan)?
Our research question number 1 determines the validity of factors in local
context which have been carried out in foreign context by the researchers. The
main reason that distinguishes this research in local context from the foreign
context is that the norms, traditions and culture of every country that varies.
Therefore it is needed to be evaluated that whether the existing factors that
work in the colleges and universities of Canada, USA, England, and Nigeria
etc, are applicable in Pakistan or not. After pre-processing of whole data,
classification algorithms have been applied to analyze the behavior of
demographic, pre-university and institutional factors. According to the results,
it has come to known that these factors are also valid in Pakistan because of
nearest difference in accuracies. Detailed results have been discussed in chapter
no 4.
For our research, data being collected from term 141 to term 171 which consist
of 667 students. Then three types of exiting attributes to predict the
performance of students have been taken. First type of attribute is
demographic; Gender, Age, Residence, Location, Race, Father’s Qualification
and have considered only Gender and Age of the student. Second type of
attribute is Institutional; GPA(Term1, Term2, Term3, Term4, CGPA, Financial
status, Total credit hours taken, Total courses taken, Initial Major, Current
30
Major, Current enrollment status, Teacher methodology and Midterm was
considered, Term Marks for the prediction of final CGPA.
During pre-processing, it has been found the total number of students who
could not continue their program and left the course incomplete. After pre-
processing, there were total 660 remained and 7 out of 667 who left the course.
Two types of errors occurred in this dataset. The first was missing data and
second was incomplete data which has been corrected in the phase of pre-
processing.
In Table 3.5, Attributes selection to dataset for experiments has been shown
Mid Physics
Mid Eng-1
Mid Cal-1
Mid ITC
Gender
Studies
Sr. No.
Matric
GPA
F.Sc
City
31
RQ2: Which factors help in accurate prediction of student’s GPA of first
semester?
Some specific attributes form all three factors; demographic, institutional and
pre-university have been selected. The combinations of those attributes are
constructed and then filters have been applied respectively. There are total 21
unique combinations of attributes presented in table 3.6.
Sr.
No. ATTRIBUTES
Demographic, Pre-Qualification, Institutional (Result all subject of 1st
1 semester)
Demographic, Pre-Qualification, Institutional (Result Eng-1, Cal-1, ITC,
2 subject of 1st semester)
Demographic, Pre-Qualification, Institutional ( Result Eng-1, Cal-1 subject
3 of 1st semester)
Demographic, Pre-Qualification, Institutional (Result Eng-1 subject of 1st
4 semester
5 Demographic, Pre-Qualification, Institutional (Cal-1 subject of 1st semester)
Demographic, Pre-Qualification, Institutional (Result ITC, subject of 1st
6 semester)
Demographic, Pre-Qualification, Institutional (Result Cal-1, ITC subject of
7 1st semester)
Demographic, Pre-Qualification, Institutional (Result Eng-1, ITC subject of
8 1st semester)
9 Demographic, Pre-Qualification, Institutional(GPA )
10 Demographic, Pre-Qualification
11 Demographic
12 Pre-Qualification
13 Pre-Qualification, Institutional (Result all subject of 1st semester)
14 Demographic, Institutional ( Result all subject of 1st semester)
15 Institutional (Result Eng-1, Cal-1, ITC, subject of 1st semester)
16 Institutional (Result Eng-1 subject of 1st semester)
17 Institutional ( Result Cal-1 subject of 1st semester)
18 Institutional (Result ITC, subject of 1st semester)
19 Institutional (Result Eng-1, Cal-1 subject of 1st semester)
20 Institutional (Result Cal-1, ITC subject of 1st semester
21 Institutional (Result Eng-1, ITC subject of 1st semester)
32
RQ3. Is it possible to improve the performance of a student by
allocating of appropriate teacher for a subject?
There is a fact that is normally faced during the research period is; the
personality of teacher affects the performance of the students. If the
background of the teacher is already known with the help of prediction system,
it might become feasible for the department to recommend the best suitable
teacher to the student up to his level of interest. And this phenomenon can
surely boost the performance of the student. To answer the question number 3,
experiments have been performed using two approaches have been used. And
these approaches Statistical experiments and Predictive experiments.
33
case of experiments6.
c) Z-score data analysis
Unlike ANOVA (analysis of variance), Z-score is usually applied on three or
more means. A Z-score is a type of hypothesis test which is a way to find
whether the results obtained from a test are valid or need to be repeated. For
example, if someone said they had found a new drug to cure the cancer, one
would want to be sure it was probably true. Similarly, in our research, we have
applied Z-test to compare the performance of teachers particularly with respect
to their subject. Z – Test will exploit the likelihood that the obtained results are
true or not. A Z-test is generally used when the data is approximately normally
distributed in the form of pairs7.
6
http://www.statisticssolutions.com/manova-analysis-anova/
7
http://www.statisticshowto.com/z-test/
34
Chapter 4
Data collection and pre-processing are the initial steps towards the analysis of
the research. For this purpose, the demographic data has been collected from
the registrar office of the CUST which contains student’s gender and city. With
the co-operation of administrative resources of CUST, the data related to pre-
university and institutional factors has been collected. Institutional and pre-
university factors contain the data about mid-term marks, term-marks, GPA,
intermediate marks and matriculation mark. University portal provided
privilege to access this data of the students to gather the sufficient dataset for
our experiments.
The gathered data came into many formats such as PDF, word and excel. For
the pre-processing, firstly this data set brought into the single file; Excel.
Figure 4.1 represents the format of files and transformation of all files into
single one. The collected data set arises in three formats; word, PDF and excel.
After collection of dataset, it became mandatory to transfer the data into single
file from all the files.The purpose of excel file is to import it into the WEKA to
apply filters for the pre-processing. The final versions of dataset in the form of
35
excel is then converted into CSV format and imported in to the WEKA. Such
filters have been applied that are discussed in subsections of section 4.1.
Mid Term
Weighted
Teacher
Gender
Sr. No.
Work
Term
Total
ITC
4 Male
36
The methods of data mining behave differently in the way that they treat
missing values. Normally, they ignore the missing values, or exempt those
records which contain missing values or either replace missing values with the
mean, or conclude missing values from existing values. Missing
Values Replacement Policies include the number of strategies such as; ignore
the records with missing values. Following example shows the occurrence of
missing values in the dataset. After the handling of missing data by taking the
average of last three records of students and filled the missing one, complete
dataset is gained. Further, this dataset has been used for experiments to
evaluate all research questions.
Term Work
Final Exam
Mid Term
Weighted
Teacher
Gender
Sr. No.
Total
ITC
37
Attribute Selection
Sampling
Data sampling is a statistical analysis technique used to select, manipulate and
analyze a representative subset of data points in order to identify patterns and
trends in the larger data set being examined. For example: term
141,143,151,153,161,163 dataset has been used as a training data and term 171
is test data because prediction of result of corresponding students is required to
be computed.
38
RQ1: Whether existing identified factors for grade prediction are valid in
our local context (Pakistan)?
As existing factors along with their attributes have been stated in above
sections, we have drawn the combinations of local factors with the existing
factors. We found a slight variation between both types
Mid Pak-Studies
Mid Physics
Mid Eng-1
Mid Cal-1
Mid ITC
Gender
Sr. No.
Matric
GPA
City
FSc
As an input we have imported the .CSV into the WEKA that contains
information about each attribute against every student. The input data contains
the data of students of BSCS department from term 141 to term 163. The
39
dataset of CUST (Capital University of Science and Technology) has been used
for this purpose. Among those factors, attributes like Pervious Course
Marks/Grade, GPA, SSG, HSSG, Gender, City has been used. Afterwards, we
have applied well-renown classification algorithms i-e; Naïve Bayes and J48 on
the dataset.
Student
Demographic, Per-
Qualification, 1st
semester 5 Subjects
Grade CGPA Grade 69.12% 72.39%
Student
Data mining Demographic, High
approach for School Background,
predicting student Scholarship, Social
performance (2012) network interaction CGPA Grade 76% 73%
Student
Predicting Student Demographic, Per-
Performance: A Qualification, CGPA Grade 49.90% 63.19%
Statistical
Student
and Data Mining Demographic, High
Approach(2013) School Background CGPA Grade 50% 65%
40
results. The inferences show the slight variations which is quite considerable
with the correspondence to our research. The existing attributes have been
shown in black color in the summary table and attributes in red color are local
attributes.
The summarized results have been shown in the figure 4.2 in which bars of
different colors on X-axis shows the presence of attributes and y-axis shows the
interval of accuracy.
Our Findings narrate that existing identified factors for grade prediction are
considered to be comparable with the local factors because of low variation in
the result of their accuracies. According to the results, Pervious Grade, Internal
Marks, HSSC were found to be effective for grade prediction.
Our second research question is the extraction of those factors that helps in
accurate prediction of GPA of students belong to 1st semester. To answer the
research question, we have transformed this question into following two ways.
41
1. Based on Mid term marks
Ranked Attributes
0.2277 Cal-I MID
0.1903 Cal-II MID
0.1778 ITC MID
0.1684 Physics MID
0.1548 Pak-Studies MID
0.1514 FSC
0.1006 City
0.0992 Matric
0.0791 Gender
According to results, picture states the prediction of CAL-I MID and CAL-II
MID remained better in terms of accuracy. Then the accuracies gradually
decreases from ITC MID to GENDER. Based on the corresponding accuracies
of attributes, we may inference the importance of acquired ranking information
gain of all the instances.
42
1- First Semester GPA Prediction
We have expanded our answer for the research question number 2. With the
help of “GPA Prediction for the first semester” we became able to evaluate our
results through WEKA. We covered the list of attributes narrated in table 4.6 in
which the combination of attributes are defined in column 1. There are four
combinations of attributes which belong to the three factors; Pre-Qualification,
Demographic and Institutional. The performance of J48 is considerably better
then Naïve Bayes that has been shown in table as well as presented in figure
4.3.
Demographic, Pre-Qualification,
Institutional (Internal marks only
Midterm all subject of 1st semester) GPA Grade 66.46% 79.55%
Demographic, Pre-Qualification,
Institutional (Internal marks only
Midterm Eng-1, Cal-1 ,ITC subject of
1st semester) GPA Grade 69.33% 78.12%
43
Figure 4.3: GPA Prediction for the 1st semester
2- CGPA of second Semester
To predict the CGPA of 2nd semester, we have defined the following narrated
set of experiments as summarized in table 4.7. We have merged all the grades
of previous semester along with the adopted factors. These combinations and
their accuracies on the basis of Naïve Bayes and J48 have been presented in
table 4.4.
Class Naïve
Attributes Label Bayes J48
Demographic, Pre-Qualification,
Institutional(Subject Grade Cal-1 subject of CGPA
1st semester) Grade 67.08% 74.44%
Demographic, Pre-Qualification,
Institutional(Subject Grade ITC subject of 1st CGPA
semester) Grade 68.10% 74.44%
44
These tabular results have been presented as figure 4.4. Like 1st semester GPA
prediction, we have used same classifiers for the CGPA prediction for 2md
semester. The only change lies in the combinations of attributes. According to
the accuracy of each combination, we reached to the point that, the accuracy of
Demographic, Pre-Qualification, Institutional (Subject Grade Cal-1, ITC
subject of 1st semester) was high whereas when J48 classifier was applied, the
prediction of Demographic, Pre-Qualification, Institutional(Subject Grade Cal-
1 subject of 1st semester) and Demographic, Pre-Qualification,
Institutional(Subject Grade ITC subject of 1st semester) remained better than
other two combinations with the minor difference i-e; Demographic, Pre-
Qualification, Institutional (Subject Grade Cal-1, ITC subject of 1st semester)
and Demographic, Pre-Qualification, Institutional (Subject Grade Eng-1, Cal-1
subject of 1st semester).
45
Then it is found that the subjects of First semester i-e; Cal-1, Eng-1,
ITC have high value of information gain and played an effective role in
order to predict the accurate grades of students.
As we came to know from the situation of dataset that the Attrition rate
is high in first semester. The purpose of the research is to reduce the
attrition rate, so we have considered these subjects to predict the
attrition rate of students as well as student’s performance.
Results states that Demographic attributes does not play significant role
in grade prediction of students.
In addition, internal marks like Mid Terms are used for early grade
prediction that has high accuracy and also helps to minimize the
attrition rate.
The experiments have been conducted subject wise for the individual course
grade prediction. On the basis of course grades, accumulated teacher
46
performance has been computed. Furthermore, following calculations have
been performed for the Cal-I teachers:
The comparison of Maths course and Pre-qualification has been made in this
phase of experiment. The teachers have been labeled alphabetically and the
overall average of the students has been given against every teacher in table
4.8.
E 58.96428571 73 72.75
The result shows that the performance of teacher “B, D and E” is found to be
better in Pre-qualification combination. With respect to the Cal –I performance
of teacher “B, G and H remained better. On using the Cal-II attribute, the
performance of teacher “B, C, and D” found to be better than other teachers.
These results have been shown graphically in figure 4.5.
47
Figure 4.5: Pre-Qualification vs Maths Courses
Average of Cal-ll Teachers
In this case, we have predicted the the performance of teahcers who teach Cal-
II and compared with the actual performance of the teachers in Cal-II.
According to the results, we have found significantly unexpected outcomes.
There is a minor variation between actual Cal-II and predicted Cal-II. The
results have been narrated in table 4.9 and presented in figure 4.6.
48
Figure 4.6: Average of Cal-ll teachers
Average of Cal-I teachers
The average of Cal-I teachers has been compared with the results of actual and
predicted values. There is another attribute “Pre-qualification” that has been
used in this case. The obtained results have been given in the table 4.10 as well
as presented in figure 4.7.
49
Figure 4.7: Average of Cal-I teachers
50
Figure 4.8: Overall Average of Maths Teachers
After the computation of results of Cal-1 and Cal-II, we became able to find the
teacher “E” whose performance found to be better and able to recommend
teaching Cal-I and Cal-II. Similarly, we will compute the results of other
subjects which are to be taught in the 1st semester and then computed similarly.
Firstly, researchers have computed the performance of every teacher of each
subject which is being taught in first and second semester and gained the
results. The results have been interpreted for every subject with respect to its
related teacher. These results have been shown significant average of all
students in Cal-1 & Cal-2 with Cal-1 teacher name. After the computation of
results for Cal-1 and Cal-II, it became clear to find that the teacher “E” whose
performance found to be better and able to recommend teaching Cal-I and Cal-
II then teacher B have better performances and further teacher B and H
performs better among all other teachers.
51
significant. To prove these results, we applied ANOVA analysis on this dataset.
We have applied ANOVA (Analysis of Variance) test on teacher’s average. By this
way, results indicate the performance of all Cal-I teacher on the basis of group.
SUMMARY
ANOVA
Source of
SS Df MS F P-value F crit
Variation
Between
872.7731 7 124.6819 9.273853 3.14E-06 2.312741
Groups
Total 1302.996 39
In this case F > F crit has been shown as 9.273853 >2.312741. Researchers
reject the null hypothesis and concluded that there are significant differences
between the methods because all 11 methods don’t have the same mean.
52
Z-Score Data Analysis
Now Z-score test is applied to evaluate the performance of every pair of
combination of teacher’s subject wise. Generally, Z-test is applied on two
populations to compare its proportion. In this case, we are supposed to compare
the performance of teachers of the same course which is to recommend to the
student. Researchers have classified each Z-test with respect to the subject and
then have considered the results of Z-score with respect to Cal-1. All the
teachers who teach Cal-I have been indicated with alphabetic characters i-e; A,
B, C, D, E, and F. Pairs such as teacher A with other teachers of Cal-1 are
made for experiment. Moreover, for the comparison, teacher’s data has been
classified as same number of students and different number of students.
Table 4.13: Two Samples for Mean with respect to teacher “E”
Z-score data Analysis applied on the Cal-1 teacher “E” with Teacher “B” on
the basis different number of student. Value of Z-critical to tail < Z concluded
that the performance of teacher “E” is not significant.
Then teacher “B” has been paired with teacher “A” on the basis same number
of students. The value of Z-critical > Z concluded that significant performance
53
of teacher “A”. The results have been shown in table 19. Z-test has been
applied in both cases and results have been compared either the performance of
teachers remain same or vary. By this way, results indicate the performance of
every teacher of Cal-I on the basis of combinations. The result of z-test will
evolve around this statement “If z< z critical two tail then performance
difference is not significant”. If z > z critical two tail then performance
difference is significant.
Table 4.14: Two samples for means with respect to teacher “B”
Here, researchers have applied Z-score data Analysis on the Cal-1 teacher “B”
with Teacher “A” on the basis different number of students who enrolled in
different terms from term 141 to term 171. In this case, value of Z-critical two
tails is greater than the value of Z which shows that performance of teacher “B”
is not significant.
Then teacher “B” has been paired with teacher “A” on the basis same number
of students and Z-test has been applied. The results shows that value of z-
critical shows the significant performance of teacher “A” as it is greater than
the value of Z.
54
Teacher B with different and same number of Students
Z-score data Analysis applied on the Cal-1 teacher “B” with all Teachers on the
basis different number of student. The value of Z-critical tow tail > Z conclude
that performance of teacher “E” is not significant with Teacher “A”, ,”C”, “D”,
“F”, “G”, “H”.,”E”
Table 4.15: Teacher B with different number of Students
A 3.687091 1.959964
C 1.235961 1.959964
D 3.480026 1.959964
F 2.829936 1.959964
G 3.218075 1.959964
H 1.61421 1.959964
E 1.271677 1.959964
Z-score data Analysis applied on the Cal-1 teacher “B” with all Teachers on the
basis same number of student. The value of Z-critical tow tail > Z conclude
that performance of teacher “B” is not significant with Teacher “A”, “B”,”C”,
“D”, “F”, and significant with Teacher “G”, “H”. The ANOVA test has been
applied on the cal-I only. The results have been computed for the rest of the
courses similarly.
55
Pre-Qualification vs Programming Courses
In table 4.17, we have narrated the results of all programming courses which
are taught in 1st and 2nd semester. These averages of courses have been
compared with the Pre-Qualification.
56
computed. The performance of teachers with respect to courses average
changes.
F 59.72643733 69.13777079 77
57
Figure 4.10: Comparison of Actual and Predicted Grades of ITC/ITP
Then the actual and predicted OOP/CP grades have been predicted against
every teacher. The results shows that there is a slight variation in the results
that has been stated in table 4.19 and presented in figure 4.11
58
Figure 4.11: Comparison of Actual and Predicted Performance of
OOP/CP Teachers
The overall performance of OOP teachers have been presented in table 4.20 as
well in figure 4.12.
The results show that the performance of teacher “E & F” is considerably high
and better than other teachers. Therefore, based on the acquired results, we may
recommend these two teachers to teach the courses in respective semesters.
59
Figure 4.12: Overall performance of OOP Teachers
These results have been shown significant average of all students in ITC/ITP&
CP/OOP with ITC/ITP teacher name. During the computation of results of
ITC/ITP and CP/OOP, we have found teacher “F” performance better and able
to recommend teaching ITC/ITP and CP/OOP.
ANOVs analysis is used for group data we have many teacher for Cal-1 & Cal-
2 subjects. The average prediction shows that teacher average performance is
significant. To prove this result applies ANOVA analysis on this dataset.
If F > F crit, we reject the null hypothesis. This is the case 8.542703
>1.952212. Therefore, we reject the null hypothesis. The means of the 11
populations are not all equal. At least one of the means is different. Hence we
conclude that there are significant differences between the methods (i.e. all 11
methods don’t have the same mean).However, the ANOVA does not tell you
where the difference lies. Therefore it is preferred T-Test, Z-Score test each
pair of means.
60
Table 4.21: ANOVA Data Analysis on ITC/ITP
SUMMARY
ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 1075.12 11 97.73818 8.542703 7.87E-09 1.952212
Within
Groups 686.4678 60 11.44113
Total 1761.588 71
61
have applied Z-test on the combinations of teachers who teach ITC/ITP. We
have made combinations similarly as we made for the teachers of Cal-1 and
Cal-II. In table ..., Z-test has been applied on each pair of combination and
tested on two cases. First is to compare the performance of teachers on same
number of students and second is to compare the performance of teachers on
different number of students. Z-score data Analysis ITC/ITP has been applied
on the pair in which performance of teacher “F” with Teacher “A” on the basis
different number of students has been compared. The results show that there is
significant performance of teacher “F” with teacher “A”.
Variable 1 Variable 2
Mean 77 61.48148
Observations 67 27
Z 4.116767
Z-score data Analysis has been applied on the pair of ITC/ITP teacher “F” with
Teacher “A” on the basis same number of students and results show that there
is a significant performance between teacher “F” and teacher “A”.
62
Table 4.236: Z-Score Data Analysis Teacher F with A same number of
Students
Variable 1 Variable 2
Observations 27 27
Z 3.359441
As computed earlier, Z-score data Analysis ITC/ITP teacher “F” with other
teachers who teaches same subjects on the basis different number of students
has been computed. According to the summarized results, there is significant
performance among teacher “F” with all of the teachers.
As computed earlier, Z-score data Analysis ITC/ITP teacher “F” with other
teachers who teaches same subjects on the basis of same number of students
has been computed. According to the summarized results, there is significant
performance among teacher “F” with all of the teachers.
63
Table 4.24: Z-Score Data Analysis Teacher F with other different number
of Students
Table 4.25: Z-Score Data Analysis Teacher F with other same number of
Students
64
Summary
To prove our point, whether student’s performance or grades are dependent on
teacher’s performance or not, we have taken an example to demonstrate. We
have selected the course cal-1 to justify our analysis. For this task, we have
taken courses; Cal-I & Cal-II and ITC/ITP & CP/OOP. Using different
approaches for on the basis these results, we have ended with following
assumptions.
When Cal-1 and ITC/ITP is taught by different teacher and get average marks
of students from session 141 to 163 in Cal-1 and ITC/ITP teacher-wise using
five different experiments, like : Average of actual Cal-I marks based on the
Cal-I teacher, Prediction of the Cal-I marks excluding the teacher name and
then taking the average teacher-wise, Prediction of the Cal-I marks including
the teacher and then taking the average teacher-wise , Average of actual Cal-II
marks based on the Cal-I teacher , Prediction of Cal-II marks and then
computed the average of these marks based on the Cal-I teacher, our predicted
results were higher than the actual results or nearly lesser than actual ones. The
result has showed that all teachers have different average in these subjects and
performances of teachers in this subject are significant.
For proved these result have used ANOVA analysis .The performance of
teacher in CAL-I and ITC/ITP have different and average of teachers have
significant between all teacher .The teacher whose performance is better might
have a reason that he is lenient in giving grades to the students and the teachers
whose performance is comparatively low might be strict in giving grades to the
students. Z-Score have proved these results and have showed that significant
result in CAL-I and ITC/ITP according to each teacher. With the help of these
result, we have recommended for CAL-I teacher E is a very suitable teacher
and ITC/ITP teacher F can improve the performance of students.
65
with their teachers. With the help of recommender system, university
department may become able to identify the student’s performance and enable
them to make their performance better. To evaluate our prediction model and
recommender system, we have produced the answers of research questions. We
have found following strong observations during the course of our experiments.
66
Chapter 5
Our proposed research has been distributed in five chapters. First chapter
consist of introduction of the topic and purpose of this research. Second chapter
contains literature review in which the related work of the other researchers has
been presently in quite understandable manner. Third chapter contains
methodology and chapter no 4 contains detailed discussion of results and
experiments.
In fist chapter, we have briefly explained the background of our research topic
along with its scope, significance and applications. Moreover, we have
discussed research question, which we have constructed after critically
reviewing of literature. We have explored various data mining techniques that
were proposed and used by many researchers. The area of data mining is quite
versatile. Data mining means to explore intrinsic information from the bulk of
datasets. These datasets might belong to educational institutions or internet or
any organization. With the help of mining techniques such as classification, we
can make data analysis for prediction and find the accuracy. Researchers have
worked a lot in the field of education in data mining in which they have used
student’s academic record for specific reasons. They have used the records for
grade prediction, evaluation of student’s performance and teacher’s
performance most commonly.
67
In this regard, from literature we have obtained exiting factors such as
institutional factors and demographic factors. These factors are known as
exiting because they have applied in the western universities. We have applied
these factors in our local context. We have collected data from the Capital
University of Science and Technology, Pakistan. We have used the data of
BSCS program from six terms of spring, which covers the data of 3 years. In
this dataset, we have used the data of students of first semester and their final
grade has been predicted based on their midterm marks, term marks, gender,
age, Matric grades and inter grade. These attributes cover demographic,
institutional and pre-university attributes of the students. In addition,
researchers have used only two factors; Demographic and Pre-university. In our
experiments, we have considered another factor which is institutional in which
midterm and term marks are combined to predict the students final GPA/Grade.
This is how we have proposed prediction model on the basis students
associated attributes.
68
RQ1. Whether exiting identified factors for grade prediction are valid in our
local context (Pakistan)?
The answer to this research question is yes because we have gained quite
considerable accuracy from all data analysis measures. These analyses have
been applied on categorical and nominal data. Results in both categories are
same, which shows the validity of, existing factors in local context. All three
factors contributed well in order to increase the accuracy through WEKA tool
when ARPP format of data was given to it and classification algorithms; SMO
and Linear regression were applied.
69
performance with the pairing of each teacher belong to that particular course.
By this way, we became able to bring out the teachers whose performance is
better than other teachers who teach same course. Then we computed overall
average performance of the teacher based on the overall average grade of
students whom they taught.
With the help of prediction experiments, we have drawn prediction of grades
inclusively and exclusively teacher’s name who taught the courses. Then
predicted results and actual results have been computed and compared. By this
way we became able to get the entire set of results in which our analysis shows
that predicted and actual results remained close at some point and better at
some point too. In the same time, predicted were bit lower than actual ones as
well which can be seen in chapter 4.
70
REFERENCES
Azarcon Jr, D. E., Gallardo, C. D., Anacin, C. G., & Velasco, E. (2014).
Attrition and retention in higher education institution: A conjoint analysis
of consumer behavior in higher education. Asia Pacific Journal of
Education, Arts and SCIENCE, 1(5), 107-118.
71
Cabrera, A. F., Nora, A., & Castaneda, M. B. (1993). College persistence:
Structural equations modeling test of an integrated model of student
retention. The journal of higher education, 64(2), 123-139.
Elbadrawy, A., Polyzou, A., Ren, Z., Sweeney, M., Karypis, G., &
Rangwala, H. (2016). Predicting Student Performance Using Personalized
Analytics. Computer, 49(4), 61-69.
Hung, J. L., & Zhang, K. (2008). Revealing online learning behaviors and
activity patterns and making predictions with data mining techniques in
online teaching. MERLOT Journal of Online Learning and Teaching.
Inoue, S., Rodgers, P. A., Tennant, A., & Spencer, N. (2017). Reducing
Information to Stimulate Design Imagination. In Design Computing and
Cognition'16 (pp. 3-21). Springer, Cham.
72
Kabakchieva, D. (2013). Predicting student performance by using data
mining methods for classification. Cybernetics and Information
Technologies, 13(1), 61-72.
Kokina, J., Pachamanova, D., & Corbett, A. (2017). The role of data
visualization and analytics in performance management: Guiding
entrepreneurial growth decisions. Journal of Accounting Education.
Kovacic, Z. (2010). Early prediction of student success: Mining students'
enrolment data.
Merceron, A., & Yacef, K. (2005, May). Educational Data Mining: a Case
Study. In AIED (pp. 467-474).
73
Different Case Studies). Computer Engineering and Applications
Journal, 3(2), 79-88.
Polyzou, A., & Karypis, G. (2016). Grade prediction with models specific
to students and courses. International Journal of Data Science and
Analytics, 2(3-4), 159-171.
Ramist, L. (1981). College student attrition and retention. Romero, C., &
Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005.
Expert systems with applications, 33(1), 135-146.
Romero, C., & Ventura, S. (2010). Educational data mining: a review of the
state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part
C (Applications and Reviews), 40(6), 601-618.
Sachin, R. B., & Vijay, M. S. (2012, January). A survey and future vision
of data mining in educational field. In Advanced Computing &
Communication Technologies (ACCT), 2012 Second International
Conference on (pp. 96-100). IEEE.
74
Tewari, A. S., Saroj, A., & Barman, A. G. (2015). e-Learning
Recommender System for Teachers using Opinion Mining. In Information
Science and Applications (pp. 1021-1029). Springer Berlin Heidelberg.
Yang, D., Sinha, T., Adamson, D., & Rosé, C. P. (2013, December). Turn
on, tune in, drop out: Anticipating student dropouts in massive open online
courses. In Proceedings of the 2013 NIPS Data-driven education
workshop(Vol. 11, p. 14).
Zhang, Y., Oussena, S., Clark, T., & Hyensook, K. (2010). Using data
mining to improve student retention in HE: a case study.
Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., &
Cunningham, S. J. (1999). Weka: Practical machine learning tools and
techniques with Java implementations.
75