Shahnoor Et Al. Yamama Conf PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Using Logistic Regression to Predict Secondary

School Student Performance


Shahnoor Ali, Hasan Sarwar, Hala Yasmin, Umar Hayyat, Zain-ul-Abidin, Muhammad Rehman Shahid
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected]
Department of Computer Science, National Textile University, Faisalabad, Pakistan

Abstract— Education is one of the principle establishments achievement in which their greatest inward and external
for the student’s advancement and furthermore for public powers are utilized for accomplishing objectives of advanced
human asset improvement. Failure at college and grade education and acquiring fundamental conditions for fruitful
retention is an essential concern among students and their public activity. Then again, the absence of achievement in
parents. This study aims to assess student’s failure in the core training clears the ground for a few individuals, particular
subject Mathematics and the factors that influence failures
and social issues and deviation from accomplishing the
mendaciously. Logistic Regression was applied to the dataset.
Dataset was trained and tested, and then the predictions were
objectives of the instructive framework. The reasons for
made about the accuracy of the results. The whole dataset was academic failure incorporate familial, health and financial
firstly transformed from nominal to numeric values then into issues that lead to an absence of eagerness for establishments,
binary classes of 0 or 1 (whether the student failed or not). inspirational and physiological issues, psychological and
Absences, weekly study time, age ranges from 15 to 19, going neurological hindrances to realizing which prompts the
out with friends are the radix in student’s failure. The study was misuse of current consumption and time. High rates of
performed on a total of 395 students from which mostly male college disappointment have been trailed by review
students were classified as failed in mathematics. The accuracy redundancy which has turned into a particular trademark in
of results was found through a classification report and colleges even in the creating nations. This was not the first-
confusion matrix. Confusion matrix gives an accuracy of 85%. run in which the disappointment of students was evaluated,
This study has thriving highlighted the prevalence of multi-
various examinations and research were enveloped to asses
factorial contributors such as social and health related factors
for college failure. Social related factors were found to be more disappointment in the nearness of social, mental, wellbeing
prevalent. The cynosure of this paper is the application of and school related elements. Smoking, drinking, drug abuse
Logistic Regression and finding out how accurate the model can be one of the factors of student failure with its
results are predicted. The precision of 85% is predicted, ramification of a student losing self-confidence, becoming
revealing a good fit. discouraged and decreasing their effort in work. Other factors
evolve such as truancy from classes, dropping out, redoing
Keywords— Machine Learning, Prediction, Performance, the grade or nether education. It has been observed that a
Students, Logistic Regression. significant number of students (about 20%) are
hypothetically primitive and failed to achieve good marks.
To assist those students who are encountering academic
I. INTRODUCTION
failure, fall into diverse categories as an evasion, impedance,
Our education system today is in a state of obvious disrepair. and remediation. Considering the above-mentioned points,
The failure of students that erupted during the last quarter year student failure is of crucial importance at this hour as the
2018 convulsing the country’s social and economic advancements have been made in developing countries like
development. Because students are the most critical asset for Europe.
the educational institution, their performance takes on an Moreover, to meet the furtherance of these countries, new
essential job to become the best graduate class that will end techniques, proficiency in a particular field, craftsmanship
up being a leader, a pioneer, a premier and a worker of a must be instigated and cramming must be halted in our
specific nation, responsible for social and financial progress education system. The factors that positively out-turn the
of the nation. Educating proficient and compelling human student’s academic performance include teacher-student
powers is considered among the primary obligations of relationship, teacher’s welfare (teacher’s salaries paid on
colleges. Consistently, colleges graduate and concede time), home background (parents’ ability to supervise their
newcomer students; in this constant cycle, education quality children about admission, knowledge), friendly principle-
has a critical position. Hence, expanding the nature of teacher relationship, effective teacher’s supervision [1]. If
instructive framework is viewed as the most persuasive factor these factors have contemplated by all the people who are
In building up the nations; this is because students accountable and indispensable in student’s success and
accomplish a situation because of their scholarly failure, then student’s mental, psychological, social and
family problems can be obliterated. In this paper, we applied organized into three classifications of internal organizational
an algorithm, i.e. Logistic Regression on a dataset collected factors (proficient characteristics of instructors, space and
on the failure of Portuguese students collected by one of the appropriate facilities and equipment)[10], external
Portuguese researchers. The dataset is cut to short organizational factors (guardians’ education level and their
incorporating 13 variables according to the feature selection dealing with students’ academic failure, financial situation of
of 5 research papers, integrating common features. We have families, misty and indeterminate job prospects [1]) and
reckoned students’ failures on the premise of causes and individual factors (components like having a objective,
rationale given by other researchers according to their inspiration, planning, examining strategy, intelligence,
dataset. This paper comprises of at least five almost same consideration, anxiety, affective disorder and mental problems
research papers with different findings, variables, techniques, and lack of attendance to the course [7]). Education’s quality
and algorithms but the question was the same: What are the may take a back seat to education’s quantity. There are too
reasons and causes for the failure of students? Our goal was many endeavors that make our education system hitting the
to apply the non-identical algorithm to the dataset and check rock-bottom in quality. According to studies, this issue is
the results and accuracy of the dataset through a classification escalating each year with the goal that numerous students
report. Portuguese researchers applied Correlation, Random cannot manage the curriculum or finish it in due course [11].
Forest and Decision Tree to his dataset [2]. In comparison to
According to the last few years[12][13], several various
these algorithms, we hand-pick Logistic Regression.
significant studies have been carried out to develop different
The paper is organized as in Section 1 it introduces about the
models for assessing students’ performance by considering
importance of education and student’s academic performance.
different factors like family pay, direction from parents, teacher
Section 2 reviews the literature survey of student’s academic and student relationship, school distance and sex of the
performance in different research papers and factors students, but these studies have not investigated the learning
influencing their performance. Section 3 is about all the structures, communication skills and proper guidance of
materials and methods that are used in this paper which includes parents. Table 1 comprises of several research papers that
student’s variables, statistical techniques, data mining describes the causes and factors of students’ academic failure,
algorithm and in which computational environment model is the variables used in the research paper, their author name, in
tested in. Section 4 comprises of the results that come from the which year they are published in, their sample size, statistical
algorithm and its accuracy. Section 5 is the conclusion of the analysis and in which tools the algorithms are implemented in
whole research paper. [11][14].
II. LITERATURE SURVEY
Most of the assorted studies have been conducted to find out III. MATERIAL AND METHODS
student’s academic performance [3][4][5][6]. According to
studies, about 20% of students are failed no less than one time STUDENT DATA:
amid their education and this disappointment not just aims some Table 3 comprises of 13 variables that includes sex, age,
psychological, mental issues for them, yet in addition puts them cohabitation status of parents, mother’s education, father’s
at the danger of educational deprivation and hardship thinking education, weekly study time, number of past class failures,
contemplating their academic breakthrough, harms optimum extra-curricular activities, desire to obtain higher education,
utilization of scientific principles for training human powers access to the internet at home, in a romantic relationship,
and financial resources and furthermore social disappointment quality of family relationships, meeting with friends, current
[7]. Moreover, students’ dropout and academic failure cause a state of health and number of school absences. Some variables
few difficulties, issues and challenges for the students were numeric, and others were nominal. In order to apply
themselves alongside colossal misfortune for the nation. Logistic Regression, Data wrangling was done on the dataset.
Researches have demonstrated that students with academic Some of the nominal variables of dataset were transmuted into
failure are increasingly foreseeable to utilize drugs at older numeric and also into dichotomous, i.e. 1 or 0. Above
ages; like this, dropout and academic failure might ensue liquor mentioned variables were further censored in significance with
and drug addictions [8]. failure. Only those variables were transformed that were
correlated closely to failure, i.e. sex, weekly study time, internet
A study in one of the Arabian colleges on students[9] who
access at home, extra-curricular activities, number of past class
perpetrate suicide, encountered coma/fainting, cardiovascular
failures, cohabitation status of parents, desire to obtain higher
diseases, asthma, visual problems showed academic failure as
education, in a romantic relationship (
the most popular reason for their disease [7]. Different
Table 2). Incorporating with the data which is only in numerical
examinations have proposed that different factors can prompt
form because for evaluating final results it would be difficult to
academic failure; some studies have considered the use of illicit
convert the string into a float. For the description of the
drugs and various investigations have shown that personality
variables used in the dataset, Table 3 elucidates the variables,
factors, incentives, interest, fulfilment, abandonment,
i.e. their category, when in 0 or 1 form and what they interpret.
achievement desire, and family circumstances can influence the
level of academic success in colleges [2]. In a thorough modus
operandi, the variables associated with college failure can be
Table 1: Literature Survey

Author Context Variable Sample Statistical Analysis Analysis Tools


Size

P. Cortez Failure of students 32 650 Classification, Regression RMiner


(2008)
Madeeha Child’s failure in school and 64 699 Simple random sampling Excel
(2009) grade retention method
C. Gbollie Cause and Reasons for the 13 323 Correlation, Mean, Statistical packages for the
(2017) failure of students Standard Deviation social sciences (SPSS 17.0.)

Irfan Factors contributing in 5 155 Mean, Standard deviation, Appropriate statistical package
Mushtaq failures of students correlation,
(2012)
L. Factors positively 10 650 Frequency, percentage, Excel
Kalagbor influencing on student’s Mean
(2012) academic performance

DATA MINING MODEL:


This study is based on the information gathered during the Mathematics. The failure status of a student was categorized as
2005-2006 school year from two public schools, from the never failing in any subject (1) and failing in at least one subject
Alentejo region of Portugal. P. Cortez and A. Silva [2] have (0). The histogram in Figure 1 implements that some failures in
estimated failure of students using data mining techniques mathematics (one or more than one) are less than the number of
which includes Classification and Regression, Decision Trees, no failures and failure rate is high in males than in females
Random Forests and by integrating Business Intelligence (Figure 3). The fact that math is arduous for males because
techniques in Education. Predictive Modeling Technique, activities including going out with friends’ which consequences
Regression Analysis, which always implies prediction. It in a short period of study time. As this subject requires the most
estimates the relationship between an independent attention and hard work, they fail to do so that fails the
variable(predictor) and a dependent variable(target). In this respective subject.
paper, estimation is done on predictive analysis, not on Figure 2 explains how badly these two factors (going out with
prescriptive analysis as it is used to solve classification friends and failure) influence each other. Different colors
problems, not regression problems [15]. indicate several class failures: Red (3 class failures), Green (2
In this paper, Logistic Regression is done on the dataset of 396 class failures), Yellow (1 class failure) and blue (no failure) in
students after picking out the 13 variables out of 32, and dataset Figure 2. The Logistic Regression is derived from the Straight-
only belongs to the students who failed in Mathematics. Failure Line Equation(1) and then reducing the equation(1) ranging
of students in the Portuguese language is not estimated. Those only from 0 to 1 resulting equation(2). In this way, the Logistic
variables have extracted that ascendance on failure the most. Regressions’ predictions are in the form of probabilities of an
This algorithm produces results in a binary format that is used occasion happening, i.e., the likelihood of y=1, given specific
to predict the outcome of a categorical dependent variable. A estimations of input variables x. Hence, the results of LogR
statistical technique, Logistic Regression, used in research range between 0-1. LogR models the information using the
projects that require the analysis of the relationship of standard logistic function, which is an S-shaped curve given by
dependent variable or of a result with one or more independent the equations (3) and (4).
variables or predictors when the dependent variable is either (a)
Dichotomous, with only two classifications, for instance, if one
has failed (yes or no); (b) unordered polytomous, which is a
nominal scale variable with three or more categories, for
example, the quality of family relationships (from 1 - very bad
to 5 - excellent); or (c) ordered polytomous, which is an ordinal
scale variable with three or more categories, for example, the
completed level of education (e.g., less than primary school,
primary school, secondary school, an undergraduate degree, or
a post-graduate degree). The logistic regression was employed
to study the relationship between the failure status of the
student, their Age, Gender, study time (intensity of course) and
the extra-curricular activities are other than studying

Figure 1: Histogram of Respective Failure


Table 2: Student Data Demographics

Table 3: Description of Student Data Variables

Attribute Description For Logistic Regression


Age Age of student (numeric: from 15 to 22) ____________

Famrel Quality of family relationships (numeric: from 1 – very bad to 5 ____________


– excellent)
Gout Going out with friends (numeric: from 1 – very low to 5 – very ____________
high)
Health Current health status (numeric: from 1 – very bad to 5 – very ____________
good)
Absences Number of school absences (numeric: from 0 to 93) ____________
Sex Gender of Student (binary: female or male) 1 = female
0 = male
StudyTime Weekly study time (numeric: 1 – < 2 hour, 2 – 2 to 5 hours, 3 – 1 = 2 to 5 hours
5 to 10 hours or 4 – > 10 hours) 0 = 5 to 10 hours
Internet Internet access at home (binary: yes or no) 1 = no
0 = yes
Activities extra-curricular activities (binary: yes or no) 1 = no extra-curricular activities
0 = yes to extra-curricular activities
Failures number of past class failures (numeric: n if 1 ≤ n < 3, else 4) 1 = no failure
0 = one or more than one failure

PStatus Cohabitation status of parents (binary: apart or living together) 1 = parents are apart
0 = parents are living together

Higher wants to take higher education (binary: yes or no) 1 = yes to higher education
0 = no to higher education
Romantic in a romantic relationship (binary: yes or no) 1 = in a romantic relationship
0 = not in a romantic relationship

Figure 3: Histogram of Failures against gender


.

Figure 2: Box-Whisker Plot of Failures against Going Out with Friends


𝑌 = 𝐶 + 𝐵1 𝑋1 + 𝐵2 𝑋2 + ⋯ → 𝑅𝑎𝑛𝑔𝑒 𝑖𝑠 𝑓𝑟𝑜𝑚 − management. Anaconda Navigator is a desktop GUI
(𝑖𝑛𝑓𝑖𝑛𝑖𝑡𝑦) 𝑡𝑜 (𝑖𝑛𝑓𝑖𝑛𝑖𝑡𝑦) (Graphical User Interface) integrated into Anaconda
(1) distribution that enables users to launch applications and
manage anaconda packages, environments, and channels
𝐵𝑦 𝑅𝑒𝑑𝑢𝑐𝑖𝑛𝑔
without using command-line commands. Navigator can
𝑌 = 𝐶 + 𝐵1 𝑋1 + 𝐵2 𝑋2 + ⋯ → search for packages in Anaconda Cloud or a local Anaconda
𝐼𝑛 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛, 𝑌 𝑐𝑎𝑛 𝑏𝑒 𝑜𝑛𝑙𝑦 𝑓𝑟𝑜𝑚 0 𝑡𝑜 1 Repository, install them in an environment, run packages and
(2) update them. It is available for Windows, macOS, and Linux.

1 𝑒𝑥
−𝑥
=
1+𝑒 1 + 𝑒𝑥
(3)

𝑌
log [ ] → 𝑌 = 𝐶 + 𝐵1 𝑋1 + 𝐵2 𝑋2
1−𝑌
(4)

Equation (4) is used to predict values of y. Predominantly


it classifies and gives the result whether students failed or
not keeping in view the factors that influence failure the
Figure 4: Classification Report of y-test and predictions
most. Results will be predicted in the form of Sigmoid-
curve which is used to predict the values of y. For training Table 4: Confusion Matrix for the Predictions
the data in order to apply logistic regression to the dataset,
several code lines are required. The model is built on the PN PY
train data, and the output is predicted on the test data. The AN 5 16
output or predicted value is taken “Failure,” and all the AY 1 97
other values are taken as input. Failure is predicted against
each input. Using the split function, dependent and
independent variables are passed to the function. Split size IV. RESULTS
is set as 0.3 (70:30 ratio). First, the model is created and
fitted then predictions are made, and the evaluation is done In this research, a total of 395 students who had failed in
on the question: how the model is performing? academics were identified. The study group consists of 208
Performance can be checked though classification report females and 187 males. Before fitting the models, some
and accuracy. The classification report is given in Figure preprocessing was required by the Logistic Regression
4. The f1-score describes the harmonic mean of recall and Model. The nominal variables (e.g., sex, Pstatus, activities,
precision. The scores relating to each class clarify the higher, internet, romantic) were transformed into numeric,
accuracy of the classifier in characterizing the data points and all attributes were in 0 or 1 form. Next, the model was
in that particular class compared to all other classes. The fitted by splitting the dataset in 70:30 ratio (Logistic
support is the number of samples of the true responses that Regression (C=1.0, class-weight=None, dual=False,
lie in that class. fitintercept=True, intercept-scaling=1, max-iter=100,
multiclass='warn', n-jobs=None, penalty='l2',
Accuracy is estimated with the help of a confusion matrix. randomstate=None, solver='warn', tol=0.0001, verbose=0,
Confusion matrix has four outcomes i.e. predicted no warmstart=False). The accuracy of the predictions was
(PN), predicted yes (PY), actual no (AN), actual yes (AY). estimated through classification report comprises of
In Table 4, at the intersection of PN and AN, PY and AY, precision, recall, f1-score, support (Figure 4). As an example
these can predict the accuracy of the model by adding both of the quality of predictions, Table 4 demonstrates the
the values then dividing by the number of values. confusion matrices for the Logistic Regression Algorithm.
Accuracy score is 86%. This tells us that our results are 86% of the values are predicted accurately.
86% accurate.
V. CONCLUSION
COMPUTATIONAL ENVIRONMENT: Education is even more in a crucial position today. Today’s
All experiments revealed in this study were directed utilizing students can face future hindrances if their school activities
Python Anaconda Jupyter Notebook, an open and free source and informal learning prepare them for adult roles such as
distribution of the Python and R programming languages for natives, representatives, administrators, guardians,
scientific computing (machine learning applications, data volunteers, and business visionaries. In this paper, we have
science, predictive analysis, data processing for large-scale, addressed the prediction of the grades of secondary school
etc.), which aims to simplify deployment and package students in a core class i.e., Mathematics by using previous
school grades, demographic, social and other school related
data. First, the data was analyzed then trained and tested. The Methodology to Enhance Teaching Performance and
algorithm, i.e. Logistic Regression was applied to the dataset. Student Experience in e-Learning Environment
It concluded that past academic performances, extra- [11] Farhan M, Aslam M, Jabbar S and Khalid S 2016
curricular activities, going out with friends, these social Multimedia based qualitative assessment
factors cause academic failure. Social factors do count when methodology in eLearning: student teacher
it comes to academic failure. The applied algorithm predicted engagement analysis Multimed. Tools Appl. 1–15
the model results with accuracy count of 86%. [12] Turkish T, Journal O and Technology E 2010
TOJET: The Turkish Online Journal of Educational
Technology – January 2010, volume 9 Issue 1 9 176–
ACKNOWLEGMENT 84
We are thankful to our instructor, Dr. Sohail Jabbar for his [13] Dickinson S J 2013 Shape Perception in Human and
insight guidance, and suggestions during this research work Computer Vision
that is part of our undergraduate subject. He motivated and [14] Iqbal M M, Farhan M, Saleem Y and Aslam M 2014
encouraged us, helped us in gathering dataset to complete this Automated Web-Bot Implementation using Machine
work. Learning Techniques in eLearning Paradigm 4 90–8
REFERENCES [15] Paul A, Ahmad A, Rathore M M and Jabbar S 2016
Smartbuddy: Defining human behaviors using big
[1] Ayimah J . and Agbotse G . 2012 An analysis of
data analytics in social internet of things IEEE Wirel.
factors influencing students’ academic performance
Commun. 23 68–74
in Ho Polytechnic Journal of Polytechnics in
Ghana(Jopog) vol 5 pp 113–32
[2] Cortez P and Silva A 2008 Using Data Mining To
Predict Secondary School Student Performance Proc.
5 th Annu. Futur. Bus. Technol. Conf. 2003 5–12
[3] Gbollie C and Keamu H P 2017 Student Academic
Performance: The Role of Motivation, Strategies, and
Perceived Factors Hindering Liberian Junior and
Senior High School Students Learning Educ. Res. Int.
2017 1–11
[4] Wibawa A P, Mushtaq I and Khan S N, 2014 The
Relationship Between Background Education, Socio-
Demographic And Lifestyle Factors And Academic
Performance J. Holist. Nurs. Midwifery 27 65–73
[5] Farhan M, Jabbar S, Aslam M, Hammoudeh M,
Ahmad M, Khalid S, Khan M and Han K 2018 IoT-
based students interaction framework using attention-
scoring assessment in eLearning Futur. Gener.
Comput. Syst.
[6] Gómez-Aguilar D A, Hernández-García Á, García-
Peñalvo F J and Therón R 2015 Tap into the visual
analysis of customization of a grouping of activities
in eLearning Comput. Human Behav. 47 60–7
[7] Alsaimary I, Al-Sadoon M, Jassim A and Hamadi S
2009 Clinical findings and prevalence of helicobacter
pylori in patients with gastritis B in Al-basrah
governorate. Oman Med. J. 24 208–11
[8] Farhan M, Jabbar S, Aslam M, Ahmad A, Iqbal M
M, Khan M, and Maria M-E A 2017 A Real-Time
Data Mining Approach for Interaction Analytics
Assessment: IoT Based Student Interaction
Framework Int. J. Parallel Program.
[9] Semerci A and Aydın M K 2018 Examining High
School Teachers’ Attitudes towards ICT Use in
Education Int. J. Progress. Educ. 14 93–105
[10] Farhan M 2011 An Interactive Assessment
Methodology to Enhance Teaching Performance and
Student Experience in e-Learning Environment
Submitted by  An Interactive Assessment

You might also like