Si 2
Si 2
Summer Internship-II
Report on
“Predicting Student Grades Using Multinomial Logistic Regression”
Bachelor of Technology
in
Electronics & Computer Engineering (ECM)
by
RATNALA SAI GANESH
21311A1976
B. Tech IV-Year I - Sem
Under the Guidance / Supervision of
Mrs. LATHA MADURI
Assistant Professor
Dept. of ECM
i
DEPARTMENT OF ELECTRONICS & COMPUTER ENGINEERING
SREENIDHI INSTITUTE OF SCIENCE AND TECHNOLOGY
(AUTONOMOUS)
CERTIFICATE
This is to certify that the Summer Industry Internship entitled “DATASCIENCE MASTER
VIRTUAL INTERNSHIP” being submitted by RATNALA SAI GANESH 21311A1976 in
partial fulfilment for the award of Bachelor of Technology degree in Electronics & Computer
Engineering to Sreenidhi Institute of Science and Technology, Yamnampet, Ghatkesar
Telangana, is a report of review work carried out by his/her during the academic year 2024 - 2025
under our guidance and supervision.
ii
iii
DECLARATION
This is to certify that the work reported in the present summer internship project titled
“Predicting Student Grades Using Multinomial Logistic Regression” is a record of work
done by me in the Department of Electronics and Computer Engineering, Sreenidhi
Institute of Science and Technology, Yamnampet, Ghatkesar.
The report is based on the internship done entirely by me and not copied from any other source.
iv
ACKNOWLEDGMENT
I convey my sincere thanks to Dr T. Ch. Siva Reddy, Principal, Sreenidhi Institute of Science and
Technology, Ghatkesar for providing resources to complete this internship.
I am very thankful to Dr D. Mohan, Head of the ECM Department, Sreenidhi Institute of Science
and Technology, Ghatkesar for providing an initiative to this project and giving valuable timely
suggestions over my project work and for their kind cooperation in the completion of the internship.
I convey my sincere thanks to Mrs. Latha Maduri, Assistant Professor ECM Department and all
the faculties of the ECM department, Sreenidhi Institute of Science and Technology, for their
continuous help, cooperation, and support in completing this internship.
Finally, I extend my sense of gratitude to the almighty, my parents, all my friends and teaching and non-
teaching staff, who directly or indirectly helped us in this endeavour.
v
ABSTRACT
Predicting student academic performance is crucial for identifying at-risk students and designing targeted
interventions to enhance educational outcomes. This research explores the application of Multinomial
Logistic Regression (MLR) to predict student grades across multiple categories. Leveraging historical
academic data, MLR provides a probabilistic framework to model the relationship between categorical
outcomes (grades) and predictor variables, such as demographic information, attendance records, socio-
economic background, and prior academic achievements. The study evaluates the effectiveness of MLR in
handling non-linear relationships and categorical grade outcomes while emphasizing its interpretability in
educational contexts.
The analysis highlights key predictors of academic performance, such as parental education level, study
habits, and teacher-student interactions. The model achieves robust classification accuracy and demonstrates
scalability to larger datasets, making it suitable for integration into institutional learning management
systems. Comparative performance metrics against alternative machine learning models underscore the
simplicity and efficiency of MLR for grade prediction.
The findings suggest that MLR can serve as a valuable decision-support tool for educators and
administrators, enabling data-driven strategies to improve learning experiences. Future work will explore
incorporating more dynamic variables, such as emotional well-being and co-curricular participation, to
enhance predictive accuracy.
vi
INDEX
1 INTRODUCTION 1
2 EXISTING SYSTEM 2
3 PROPSED SYSTEM 3
5 EXPLANATION 6
6 ADVANTAGES 7
7 DISADVANTAGES 8
8 RESULTS 9-10
10 BIBLIOGRAPHY 12
vii
LIST OF FIGURES
viii
1. INTRODUCTION
The prediction of student academic performance has become a pivotal area of research in the domain of
educational data mining and analytics. Educational institutions increasingly rely on data-driven insights to
identify at-risk students, optimize learning strategies, and enhance overall academic outcomes. Among
various statistical and machine learning methods, Multinomial Logistic Regression (MLR) stands out as an
effective and interpretable approach for predicting categorical outcomes, such as grades.
MLR is particularly well-suited for predicting academic performance since it handles multi-class
classification problems where the dependent variable has more than two categories. Unlike binary logistic
regression, MLR models the probabilities of multiple outcomes simultaneously, offering a comprehensive
understanding of how predictor variables influence different grade categories. Moreover, its probabilistic
nature allows stakeholders to interpret the model outcomes with clarity, making it a preferred choice in
educational settings.
This study focuses on applying MLR to predict student grades using historical academic and demographic
data. By analyzing these predictors, the research aims to highlight the effectiveness of MLR in educational
applications and demonstrate how it can facilitate proactive decision-making. Additionally, the study
compares MLR with other predictive models to underscore its simplicity, efficiency, and accuracy in multi-
class classification problems [1].
1
2. EXISTING SYSTEM
The prediction of student grades has traditionally relied on rule-based approaches and heuristic methods,
which often lack scalability and accuracy. These systems typically utilize fixed thresholds for assessing
performance indicators, such as attendance rates or test scores, to categorize students into predefined groups
(e.g., pass/fail). While easy to implement, such systems fail to account for the complexity and non-linearity
inherent in real-world academic data.
Recent advancements in educational data mining and machine learning have introduced sophisticated
methods for predicting student performance. These methods include decision trees, support vector machines
(SVM), and ensemble learning algorithms like random forests and gradient boosting. Although these
approaches provide higher predictive accuracy, they often lack interpretability, making it challenging for
educators to extract actionable insights.
Multinomial Logistic Regression (MLR) offers a balanced alternative, combining predictive accuracy with
ease of interpretation. Unlike traditional linear regression models, MLR handles categorical dependent
variables with multiple classes (e.g., grades such as A, B, and C). It assigns probabilities to each class based
on predictor variables such as demographics, socio-economic status, prior academic performance, and
behavioral traits. Additionally, MLR models are computationally efficient and can be easily integrated into
existing educational frameworks.
Existing systems leveraging MLR and similar approaches often incorporate data visualization tools to
provide educators with actionable dashboards. However, limitations persist, such as insufficient
incorporation of real-time data (e.g., emotional well-being or extracurricular activities) and difficulties in
handling missing or inconsistent data. Despite these challenges, MLR remains a valuable tool for academic
performance prediction due to its balance of simplicity and effectiveness. [2]
The proposed system builds upon the limitations of existing systems by implementing a robust framework
for predicting student grades using Multinomial Logistic Regression (MLR), enhanced with efficient data
processing and real-time prediction capabilities. This system focuses on improving accuracy, scalability,
and interpretability, ensuring it meets the needs of educational institutions.
1. Data Collection: Raw data, including academic records, attendance, socio-economic details, and
behavioral patterns, is stored in a centralized database.
2. Data Preprocessing: A Data Preprocessing Module cleanses and standardizes the data to handle
missing values, inconsistencies, and outliers. Feature scaling and encoding techniques are applied to
make the data suitable for MLR.
3. Model Training: The Model Training Module employs MLR as the primary algorithm for multi-
class classification, categorizing grades into predefined levels (e.g., A, B, C). The model is trained
iteratively, optimizing for accuracy and computational efficiency. Comparative models (e.g., SVM,
Random Forest) are also tested to validate the superiority of MLR for this application.
4. Classification and Prediction: Upon receiving a Student Request via the Live Server the system
utilizes the trained MLR model to predict the student’s grade. The prediction probabilities are also
provided, allowing educators to assess the likelihood of each grade category.
5. User Interface: An intuitive User Interface enables students, teachers, and administrators to access
the system. Predicted grades are presented alongside visual insights, such as factor importance and
recommendations for improvement.
This proposed system not only improves prediction accuracy but also empowers stakeholders to take
proactive steps for student performance enhancement. [3]
4
# 2. Scatter plot of Marks vs Predicted Grade ggplot(students_data, aes(x =
Marks, y = Grade, color = Grade)) +geom_point(size = 4) +
labs(title = "Marks vs Predicted Grade", x = "Marks", y = "Predicted Grade") +
theme_minimal()
5
5. EXPLANATION
o Step 1: Load Libraries:
- The script begins by loading the required libraries. `ggplot2` is imported for creating
visualizations, while the `nnet` package is used for training a multinomial logistic regression
model. These libraries provide the tools to handle data analysis and visualization seamlessly.
The `set.seed(42)` function is used to ensure reproducibility of results during data simulation.
o Step 2: Simulate Data
- A synthetic dataset is created to mimic student academic performance. The dataset includes
four key features: Marks, Attendance, Study_Time, and Syllabus Covered, representing
different factors influencing grades. A new column, Grade, is generated based on a logical
categorization of marks:
'A' for marks ≥ 85,
'B' for 70 ≤ marks < 85,
'C' for 50 ≤ marks < 70,
'D' for marks < 50.
- The `factor()` function ensures the grades are ordered, which is crucial for model training.
o Step 3: Train a Multinomial Logistic Regression Model
- The multinomial logistic regression model is built using the `multinom()` function, where the
Grade column is the dependent variable, and Marks , Attendance , Study_Time , and
Syllabus_Covered are the independent variables. This step enables the model to learn the
relationship between student performance metrics and their corresponding grades.
o Step 4: Make Predictions
- Using the trained model, predictions are made on the same dataset. The `predict()` function is
utilized to generate grade predictions for each student, which are then printed for review.
These predictions represent the system's assessment of each student's likely grade based on
the input features.
o Step 5: Visualizations
- Three visualizations are created to analyze and interpret the results:
Bar Plot : Displays the distribution of predicted grades, highlighting the number of
students in each grade category.
Scatter Plot : Shows the relationship between Marks and Predicted Grade ,
providing insights into how marks influence the assigned grade.
Line Plot : Examines the correlation between Syllabus Covered and Study Time ,
emphasizing trends in how study efforts align with syllabus completion.[4]
6
6 . ADVANTAGES OF PROPOSED SYSTEM
This system is a step forward in modernizing educational analytics, offering a practical, interpretable, and
scalable solution for academic performance prediction.
7
7. DISADVANTAGES OF PROPSED SYSTEM
1. Dependence on Data Quality: The system's accuracy is heavily reliant on the quality of input data.
Missing, inconsistent, or biased data can lead to inaccurate predictions, limiting its effectiveness.
2. Limited Handling of Real-Time Behavioral Data: While the system uses historical data
effectively, it may not fully incorporate real-time behavioral factors such as emotional well-being or
sudden changes in academic performance, which can significantly influence grades.
3. Complexity in Feature Selection: Identifying and selecting the most relevant features for
Multinomial Logistic Regression can be challenging. Irrelevant or redundant features may affect the
model’s performance and interpretation.
4. Difficulty with High-Dimensional Data: Although MLR is efficient for small to medium-sized
datasets, its performance may degrade when dealing with high-dimensional data, as it requires
significant computational resources.
5. Limited Generalizability: The system is trained on specific datasets, which may limit its
applicability to institutions with vastly different academic structures, grading systems, or student
demographics.
6. Overfitting Risk: The model may overfit the training data, especially if it is not regularized
appropriately. This can reduce its ability to generalize to new, unseen data.
7. Static Nature of Models: Once trained, the model requires periodic updates with new data to
maintain accuracy. This manual retraining process can be time-consuming.
8. Inadequate for Complex Relationships: MLR assumes a linear relationship between predictor
variables and the log-odds of outcomes, which may not capture complex, non-linear dependencies in
student performance data.
9. Lack of Comprehensive Stakeholder Inputs: While predictions are useful, the system may lack
features to incorporate qualitative insights from educators, such as classroom observations or teacher
evaluations.
10. Privacy and Ethical Concerns: Collecting and processing sensitive student data, such as socio-
economic background or behavioral records, raises privacy and ethical issues. Robust security
measures are required to ensure data confidentiality.
Despite these limitations, the system’s design and implementation can be refined to overcome most
challenges, making it a reliable tool for academic performance prediction .
8
8. RESULTS & OUTPUTS
The dataset for the 10 students, after assigning grades based on the marks, looks like this:
Predicted Grades:
[1] B A C D A B A C B C
Levels: D C B A
9
Fig: 8.2. Predicted Grade Distribution
10
9. CONCLUSION AND FUTURE SCOPE
8.1 CONCLUSION
In conclusion, data science is a powerful tool that is revolutionizing the way businesses operate
and make decisions. By uncovering transformative patterns, driving innovation, and enabling
real-time optimization, data science provides businesses with the insights needed to stay
competitive and achieve growth. It enhances decision-making through data-driven insights and
allows for personalized customer experiences, fostering greater satisfaction and loyalty. Whether
it’s improving operational efficiency, predicting trends, or creating new products, data science
has become an essential strategy for businesses across industries. As data continues to grow in
importance, companies that leverage data science effectively will be better equipped to adapt to
change, solve complex problems, and unlock new opportunities. Embracing data science not only
helps businesses thrive in today’s fast-paced environment but also ensures they are prepared for
future challenges.
The future scope of data science is vast and continues to evolve as technology advances. Some
key areas where data science is expected to have a significant impact in the coming years include:
1. Artificial Intelligence (AI) and Machine Learning (ML) Integration: Data science will
continue to integrate with AI and ML, enabling smarter algorithms, automated decision-
making, and predictive analytics. Businesses will increasingly rely on these technologies to
enhance automation, optimize operations, and improve customer experiences.
2. Big Data and Real-Time Analytics: With the growing amount of data generated daily, the
ability to process and analyze large datasets in real time will become even more critical. Data
science will play a pivotal role in making sense of big data and providing actionable insights
instantaneously.
3. Natural Language Processing (NLP): As NLP technology advances, businesses will be able
to analyze and interpret vast amounts of text data more effectively. This includes applications
like sentiment analysis, chatbots, and automated content creation, improving customer
engagement and operational efficiency.
11
10. BIBLIOGRAPHY
[1] Provost, F., & Fawcett, T. (2013). Data Science and its Relationship to Big Data and Data-
https://doi.org/10.1089/big.2013.1508
[2] Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64–73.
https://doi.org/10.1145/2500499
[3] Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data: a
revolution that will transform supply chain design and management. Journal of Business
https://doi.org/10.1111/jbl.12010
[4] Agarwal, R., & Dhar, V. (2014). Editorial—Big Data, Data Science, and Analytics: The
Opportunity and Challenge for IS Research. Information Systems Research, 25(3), 443–448.
https://doi.org/10.1287/isre.2014.0546
https://books.google.co.in/books?hl=en&lr=&id=TiLEEAAAQBAJ&oi=fnd&pg=PT9&dq=d
ata+science&ots=ZJr_gewVoO&sig=NT0XhXWI570DzWXu-
LXRGZVfLYQ&redir_esc=y#v=onepage&q=data%20science&f=false
12