Major Project Report Sem 7
Major Project Report Sem 7
A PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
IN
September 2022
C.V RAMAN GLOBAL UNIVERSITY
BHUBANESWAR-ODISHA-752054
CERTIFICATE OF APPROVAL
This is to certify that we have examined the project entitled "Student Academic
Prediction App" submitted by Saikat Chowdhury, Registration No.-1901227319,
Rudra Shankar, Registration No.- 1901227316, Sandesh Kumar, Registration No.-
1910227321, CGU-Odisha, Bhubaneswar. We here by accord our approval of it as a
major project work carried out and presented in a manner required for its acceptance
towards completion of major project stage-I (7th Semester) of Bachelor Degree of
Computer Science & Engineering for which it has been submitted. This approval
does not necessarily endorse or accept every statement made, opinion expressed or
conclusions drawn as recorded in this major project, it only signifies the acceptance of
the major project for the purpose it has been submitted.
Every educational organization aims at providing a good and fruitful knowledge to the
students. Many educational institutions are investing more on the education mining
for predicting the student academic performance considering their previous marks
but due to immense growth in recent technologies students are distracting more
towards the social media and due to the current scenario of covid student faced
mental health issues which affects their academics performance. So, our projects
focused on considering the students academic data along with their personal and
phycological data to provide more efficiency in predicting their performance using
Linear regression algorithm.
TABLE OF CONTENTS
ABSTRACT iii
LIST OF FIGURES vi
1. INTRODUCTION 1
3. METHODOLOGY 3-7
4. SOURCE CODE 8
7. REFERENCES 14-15
LIST OF FIGURES
Quality of education is mandatory for the development of a country. The amount of data
in the domain of education is increasing everyday with emerging e-learning technologies.
Due to large amount of academic data of students most of the data remain unused. Data
mining is efficient for finding out the useful information from huge sets of data using
knowledge discovery in database. It is used in multiple domains including medical,
banking and educational purposes also called educational data mining. This app focusses
on the unidentified data which predicts the student academic outcome. The stakeholders
belong to this domain wants an early warning system to prognosticate literacy on early
stages.
To analyze the data, well known classification algorithm such Artificial Neural
Network (ANN), Decision-tree, Regression Analysis, K-Nearest Neighbor (KNN) are
used for prediction purpose. Uniqueness of the model is defined by its ability to combine
the phycological and personal data with the academic data. The objective of the model is
to achieve the highest possible accuracy in academic performance which will predict the
percentage marks of the student.
2. Previous Work
Dorina etal. (1) proposed a prophetic model for pupil’s performance by classifying
scholars into double class (successful/ unprofitable). The proposed model was con-
structed under the CRISP-DM (Cross Industry Standard Process for Data Mining) probing
approach. The bracket algorithms (OneR, J48, MLP and IBK) were ap-plied on the given
dataset. The results show that the loftiest delicacy was achieved by the MPL model
(73.59) for identification of successful while other three models per- form more for the
identification of unprofitable scholars. The model was unfit to work out for data high
dimensionality and class balancing problems. Edin Osman begovicetal. (2) builds a model
to prognosticate pupil academic success in a course by reducing data dimensionality
problem. Colorful machine learning classifiers similar as NB, MLP and j48 were
estimated in this study. The result shows that the Naïve Bayes gained the loftiest delicacy
69.65. Class imbalance problem cannot be handled by this model.
Carlos et al. [3] to address the class imbalance and data dimensionality issues, a
student failure prediction model based on machine learning techniques was developed. On
the dataset, ten classifiers were used. The ICRM classifier achieved the highest accuracy
92.7% among others. Due to diversity of student’s characteristics at each institutional
level, the performance of model was not tested for each and every levels of education.
Another Educational Data Mining Challenge is to predict the dropouts of the students
from their respective courses [4]. The result shows that the support vector machine model
with the combination of the predictor variables was more accurate while classifying the
data. The inclusion of an attribute, earned grades of pre-requisite courses, in the data set
was a constraint of this study because it's feasible that a student's understanding of pre-
requisites increased during the course of study. Ajay et al. [5] did research on student
performance prediction. The study's key contribution was to create a new social element
termed "CAT," which details how Indians were classified into four sorts of groups in the
past based on their social standing and other factors, all of which have a direct impact on
student education. The results indicated that the IBI model was the highest accuracy
(82%) achieved. Create a better version of the ID3 model [6], which predicts student
academic success. The ID3 model's flaw was its intention to choose the attributes with the
most values as a node. As a result, the tree that was created was inefficient. The proposed
model overcomes such problem. This model generated two output classes (Pass and
Fail).The classifiers including J48, wID3 and Naïve Bayes were applied and results
compared. The wID3 achieved high accuracy 93%. Alaa Khalaf et al. [7] proposes a
model which predict student success performance in different courses. This research used
three Decision Tree classifiers: J48, Hoeding tree, and Reptree. Reptree achieved the
maximum accuracy of 91 percent. The model was unable to solve difficulties with large
dimensionality data and class balancing. Dech Thammasiri et al. [8] suggested a
methodology for early detection of freshmen's poor academic performance. To overcome
the problem of class imbalance, four classification methods and three balancing methods
were used. In results the combination of support vector machine and SMOTE achieved the
90.24% highest overall accuracy. Based on their learning portfolio data, an early warning
system was presented to anticipate student learning performances during an online course.
[9]. The results revealed that techniques accompanied by time dependent factors were
more accurate than those that did not. Offline mode was not used to test this model. In
offline mode, performance could be hampered by time-dependent properties. Previously, it
was considered that data mining algorithms performed effectively only with huge data
sets, but this study shown that data mining is also appropriate for small datasets [10]. This
study offered a model for predicting student achievement. A small dataset including
student academic data was used by using three decision tree approaches (Reptree, J48,
M5P). The result claims that the Reptree obtained the highest accuracy above 90% among
them. This model not support the
data high dimensionality and class balancing problems.
3. Methodology
So the common issues raised on the above literature review such as class imbalance, data
complexion and classification error. This app has proposed a model which have following
phases. Fig(1). Shows the main steps of the proposed model.
The Student data set used in this model is collected from Kaggle[10]. This is a
dataset from the University of California, Irvine's dataset repository. This dataset
comprises students' final results at the end of a math curriculum, together with many
features that may or may not influence the students' future outcomes.
In data mining, pre-processing is crucial. Its goal is to convert raw data into a
format that mining algorithms can understand. During this phase, the following tasks are
completed.
Data Integration
Data integration is the process of combining data from several sources into a single
repository. When it comes to integrating data, redundancy is a regular issue. The dataset
consist of attributes which have redundant values such as school, age.
Data Cleaning
Missing and noisy data are dealt with in this step in order to ensure data
consistency. There are no missing data or outliers in the dataset used in this investigation.
Discretization
The goal of feature selection is to choose a subset of features that can accurately
describe the input data while reducing the complexity of the feature space and eliminating
extraneous data. Wrapper-based and filter-based approaches are the two most common
types of feature selection methods. The filter method looks for the smallest number of
relevant features while ignoring the rest. It ranks the features using variable ranking
algorithms, with the highest rated features being picked and applied to the learning
process.
To evaluate the feature ranks, this study used a filter approach with an information gain-
based selection algorithm. It's determining which features are most relevant when creating
a performance model for kids. During feature selection, a rank value is assigned to each
feature according to their influence on data classification.
• In this model, first the student will register himself/herself or the college will
register the students
• The students will accept the terms and conditions which contains that students
have to put the real data and their information will be shared to a third party app
for more accurate prediction.
• On accepting the terms and conditions he or she will follow the self-assessment
test which includes questions from three types of domain.
• Personal Questions
Now in this the student has to give their health related information
,relationship status , Social time and about the addictions , hobbies.
• Psychological Questions
Data will be stored in our database and will forwarded into our machine
learning model for the processing of data.
Linear regression is applied by taking grades as Y-axis and rest all the
attribute are linearly dependent on the grades. It helps in classifying the data.
• Confusion Matrix
Concept of confusion matrix is used to find the accuracy and recall of the
data.
Student will get the result and will be providing tips for improving their
academic result.
For our experiments, we have used skLearn. In skLearn test_train_split to split our dataset
into 3:1 ratio in which 75% of the data is used to train our model and 25% of the data is
used to test the accuracy of our model. The process is iterated for five times to get the final
result which gives us 78% accuracy.
CCI (Correctly Classified Instances): the number of instances that have been correctly classified
divided by the total number of instances. It's also known as accuracy.
Formula:(TP+TN)/(TP+FP+TN+FN).
Formula: (FP+FN)/(TP+FP+TN+FN)
Formula: Tprate=TP/(TP+FN)
• F-Measure: the recall and precision values are used to calculate the F-Measure
(double value of precision multiplied by recall divided by the value of summation of recall
and precision).
• Here we have used random forest algorithm and research about grey wolf
optimization
4.3 Grey Wolf Optimization
Fig4. It describes the relation between the predicted results and test results.
Fig7. Graph of comparisons of different algorithm
5
Fig8. User Interface
`
Machine learning algorithms can help teachers forecast student achievement sooner and
provide support for decision-making to increase student performance, as well as extracting
student performance criteria across many educational domains to develop a cohesive
taxonomy of student learning outcomes. Furthermore, machine learning algorithms can be
used to identify students who are likely to succeed academically and to discover students
who are at risk of failing, allowing students who are at risk of failing to receive additional
support early and on time. Machine learning algorithms are also used to investigate
student learning behaviour, solve student academic problems, optimise the educational
environment, and enable data-driven decision making, as well as to identify key factors
that influence student academic success in schools and investigate the relationships
between these key factors, and to find the best method for resampling and classifying
student learning outcomes datasets and predicting student learning outcomes.
In future, the proposed model will be tested on large datasets with more number of
attributes. Building a meta-analysis system on a larger dataset for future study, which can
be regarded a decision support approach based on the model that would achieve the best
efficiency and effectiveness, would be a logical continuation of this research.
Furthermore, using hybrid feature selection methods to predict student performance can
improve the study, making each feature more ideal and meaningful in terms of student
performance prediction. Extreme gradient boosting, an advanced ensemble-based machine
learning approach, could also be employed in this domain.
7. References
[1] Dorina Kababchieva. (2012). Student Performance Prediction using Data Mining
Classifi-cation Algorithms. International Journal of Computer Science and Management
Research, vol. 1.
[2] Edin Osmanbegovic and Mirza Suljic. (2012). Data mining approach for predicting
student performance. Journal of Economics and Business, vol. X, Issue 1.
[3] Carlos Marques-Vera and Alberto Cano. (2013). Predicting student failure at school
using genetic programming. ApplIntell, vol. 38, pp.315–330.
[4] Shaobo Huang and Ning Fang. (2013). predicting student academic performance in an
engi-neering dynamic course: A comparison of four types of predictive mathematical
models. Computers & Education, vol. 61, pp. 133–145.
https://doi.org/10.1016/j.compedu.2012.08. 015
[5] Ajay Kumar Pal and Saurabh Pal. (2013). Data Mining Techniques in EDM for
Predicting the Performance of Students.International Journal of Computer and
Information Technol-ogy, vol. 02, Issue 06.
[6] Ramanathan, Saksham Dhanda and Suresh Kumar D. (2013). Predicting Student
Perfor-mance using Modified ID3 Algorithm. International Journal of Engineering and
Technol-ogy, vol. 5 No 3.
[7] Alaa Khalaf Hamoud. (2016). Selection of Best Decision Tree algorithm for prediction
and classification of student Action. American International Journal of Research in
Science, Technology, Engineering & Mathematics, vol 1, pp. 26-32.
[9] Ya-Han Hu, Chia-Ling L and Sheng-Pao Shih. (2014). Developing early warning
systems to predict students’ online learning performance. Computers in Human
Behavior, 36, pp. 469–478. https://doi.org/10.1016/j.chb.2014.04.002
[10] SreckoNatek and Moti Zwilling. (2014). Student data mining solution–knowledge
manage-ment system related to higher education institutions. Expert Systems with
Applications, 41, pp.6400–6407. https://doi.org/10.1016/j.eswa.2014.04.024
[12]Ansar Siddique , Asiya Jan , Fiaz Majeed , Predicting Academic Performance Using
an Efficient Model Based on Fusion of Classifiers.
Source:https://www.mdpi.com/2076-3417/11/24/11845