Final Internship Report GAP
Final Internship Report GAP
With the increase in the number of graduates who wish to pursue their education, it has become
more challenging to get admission for the students in their dream university. Usually, newly
graduate students are not knowledgeable of the requirements and the procedures of the
postgraduate admission and might spent a considerable amount of money to get advice from
consultancy organisations to help them identify their admission chances. Giving the limited
number of universities that can be considered by a human consultant, however, this approach
might be bias and inaccurate. Higher education in abroad universities generally means we have
many options like Canada, USA, UK Germany, Italy, Australia etc. But we are focusing on
only the students who want to do their Masters in America. Students who want to do masters
in America have to write GRE (Graduate Records Examination) and TOEFL (Test of English
as a Foreign Language). Once they have attended the exams they have to prepare their SOP
(statement of purpose) and LOR (letter of recommendation) which are one of the crucial factors
they have to consider. These LOR and SOP plays a vital role if the student was looking for any
scholarship. Prospective graduate students always face a dilemma deciding universities of their
choice while applying to master's programs. While there are a good number of predictors and
consultancies that guide a student, they aren't always reliable since decision is made on the
basis of select past admissions. So, with increasing demand of further education, one must not
be confused in where to apply. Then the students have to choose the universities they want to
study or apply, we cannot apply to all the universities that will lead to lot of application fees.
Here comes the problem that the student doesn’t know to which university he might get
admission. There are some online blogs which help in these matters but they are not that much
accurate and don't consider all the factors and there are some consultancy offices which will
take lot of our money and time and sometimes they will give some false information.so our
goal is to develop a model which will tell the students their chance of admission into a
respective university. This model should consider all the crucial factors which plays a vital role
in student admission process and should have high accuracy.
i
Table of Contents
Abstract ................................................................................................................................... i
CHAPTER 1 .............................................................................................................................. 1
INTRODUCTION ...................................................................................................................... 1
CHAPTER 2 .............................................................................................................................. 6
LITERATURE SURVEY............................................................................................................ 6
CHAPTER 3 ............................................................................................................................ 12
CHAPTER 4 ............................................................................................................................ 18
DESIGN ................................................................................................................................... 18
ii
4.1 Data Flow ....................................................................................................................... 18
CHAPTER 5 ............................................................................................................................ 22
IMPLEMENTATION ............................................................................................................... 22
5.2 Snapshots........................................................................................................................ 25
CHAPTER 6 ............................................................................................................................ 32
BIBLIOGRAPHY .................................................................................................................... 34
List of Figures
Fig 3.1: Predicting the chance of admission .............................................................................. 15
Fig 4.1: Data Flow Diagram ..................................................................................................... 18
Fig 4.2: Sequence Diagram....................................................................................................... 19
Fig 5.1: Initial Exploration of Dataset ....................................................................................... 26
Fig 5.2: Data Preparation and Model Training .......................................................................... 27
Fig 5.3: Visualization of Feature Correlations ........................................................................... 28
Fig 5.4: Distribution of Admission Factors ............................................................................... 29
Fig 5.5: Frequency Distribution of Admission Probabilities ...................................................... 30
Fig 5.6: Scatter Plot of GRE Score and Chance of Admit .......................................................... 30
List of Tables
iii
Graduate Admission Prediction using ML Techniques 1
CHAPTER 1
INTRODUCTION
For anyone pursuing their postgraduate studies, it would be difficult for them to find out what
college they may join, based on their GPA, Quants, Verbal, TOEFL and AWA Scores. People
may apply to many universities that look for candidates with a higher score set, instead of
applying to universities at which they have a chance of getting into. This would be detrimental
to their future. It is very important that a candidate should apply to colleges that he/she has a
good chance of getting into, instead of applying to colleges that they may never get into. There
aren’t many efficient ways to find out the colleges that one can get into, relatively quickly.
The Education Based Prediction System helps a person decide what colleges they can apply
to with their scores. The dataset that is used for processing consists of the following
parameters: University name, Quants and Verbal Scores (GRE) TOEFL and AWA Scores. The
GRE Test (Graduate Record Examinations) is a standardized test used by many universities
and graduate schools around the world as part of the graduate admissions process. Other
factors are also taken into consideration while applying to colleges, such as Letter of
Recommendation (a formal document that validates someone's work, skills or academic
performance), Statement of Purpose (a critical piece of a graduate school application that tells
admissions committees who you are, what your academic and professional interests are, and
how you'll add value to the graduate program you're applying to), Co-curricular activities and
Research papers as well (research papers from journals that are not well known or have a high
percentage of plagiarism are not taken into consideration for this case). When a person has
completed their undergraduate degree and wants to pursue a Postgraduate degree in a field of
their choice, more often than not, it is very confusing for the person to figure out what colleges
they should apply to with the scores that they have obtained in GRE and TOEFL, along with
their GPA at the time of their graduation. Many candidates may apply to colleges that do not
fall under their score requirements and hence waste a lot of time. Applying to many colleges
with scores also increases the cost. There are not many efficient methods that are available to
help address this issue and hence an Education Predictor System has been developed.
In the system proposed, a person can enter their scores in the respective fields provided. The
system then processes the data entered and produces an output of the list of colleges that a
person could get into, with their scores. This is relatively quick and helps conserve time and
money. In order to achieve this we have proposed a novel method utilising Machine Learning
algorithms. To maximize the accuracy of our model, we have taken into consideration not one;
but several machine learning algorithms. These algorithms include Linear Regression,
Gradient Boosting and Random Forest. More about these algorithms will be covered in the
Algorithms section of this paper. These Algorithms are then compared and the algorithm
which has the best key performance indicators will be used to develop the Prediction System.
We also look forward to incorporate clustering of universities based on a profile and then
classifying them as less likely, highly likely acceptance etc.
It has always been a troublesome process for students in finding the perfect university and
course for their further studies. At times they do know which stream they want to get into, but
it is not easy for them to find colleges based on their academic marks and other performances.
We aim to develop and provide a place which would give a probabilistic output as to how
likely it is to get into a university given upon their details.
1.3 Objectives
The objectives of graduate admissions prediction encompass to bring students closer to their
university of choice through a robust evaluation of their profiles. This paper contains
parameters that are all relevant for graduate admissions. Barring a few exceptional cases in
which a student may unexpectedly fetch an admit in a top school, most of the results are as
expected and give a fair idea about the selection criteria.
i. To contribute a valuable tool for both prospective students and educational institutions,
offering insights into the intricate dynamics of the admission process.
ii. To demonstrate how data-driven methodologies can improve decision-making for
applicants and institutions in competitive admission scenarios.
iii. To aid universities in efficiently filtering and evaluating applications, thereby optimizing
the admissions process.
iv. To compare various machine learning algorithms, evaluate their accuracy, and select the
most effective model for graduate admission prediction.
1.4 Scope
i. Applicability to Prospective Students: The project is designed to assist graduate school
applicants in assessing their chances of admission to specific programs based on their
academic and non-academic profiles. This insight can help them make informed decisions
regarding applications and improve their preparation strategies.
ii. Utility for Educational Institutions: The predictive system can be used by universities
and colleges to streamline the application evaluation process. It can act as a supportive
tool to pre-screen applications, thereby saving time and resources.
iii. Data-Driven Decision Making: The project emphasizes the use of historical data and
machine learning algorithms to derive meaningful patterns and predictions, ensuring
objective and consistent evaluation.
iv. Versatility Across Fields: The model can be adapted to various graduate programs,
including technical, management, and research-oriented courses, provided the relevant data
is available.
Organization of report
This chapter explains about the project introduction and also the basic functionalities of the
project and also implementation of the project and what are the basic terminologies used for
the working of the project and later we discuss about the company profile and also the working
of the company and also the collaborations of the company. Chapter 1 completely introduces
the purpose of the project. Chapter 2 discusses about literature survey of the existing system
and how to overcome in the proposed system. Chapter 3 provides details about the system
requirements i.e., all functional, user and non-functional requirements about the system.
Chapter 4 details about the design of the architecture and it’s Data Flow Diagram. Chapter 5
gives the information of the code that is being implemented to build the interface and some
screenshots. Conclusion and future work details about how the ideas that the proposed system
can be further modified.
CHAPTER 2
LITERATURE SURVEY
Many aspiring graduate students want to complete their studies, prepare for the next stage,
which is a master's degree. Many of them may wonder about the basic requirements for
admission to universities, and about the universities where they can be admitted based on their
requirement [1].
The literature contains several studies that perform statistical analyses on admissions decisions.
For example authors in [2], presents an expert system, called PASS, in which Logistic
Regression is used to predict the potential of high school students in Greece to pass the national
exam for entering higher education institutes. The authors in [3] used predictive modeling to
assess admission policies and standards based on features like GPA score, ACT score, residency
race, etc. Limitations of this research include not taking into consideration other important
factors such as past work experience, technical papers of the students, etc.
These researchers' authors in [4] have used data mining and ML techniques to analyze the
current scenario of admission by predicting the enrolment behavior of students. They have used
the Apriori technique to analyze the behavior of students who are seeking admission to a
particular college. They have also used the Naïve Bayes algorithm which will help students to
choose the course and help them in the admission procedure. In their project, they were
conducting a test for students who were seeking admissions and then based on their
performance, they were suggesting students a course branch using Naïve Bayes Algorithm. But
human intervention was required to make the final decision on the status.
Acharyaet al. [1] proposed a comparative approach by developing four machine learning
regression models: linear regression, support vector machine, decision tree and random forest
for predictive analytics of graduate admission chances. Then compute error functions for the
developed models and compare their performances to select the best performing model out of
these developed models the linear regression is the best performing model with R2 score of
0.72. Janani Pet al. [2] proposed a developed project uses machine learning technique
specifically a decision tree algorithm based on the test attributes like GRE, TOEFL, CGPA,
research papers etc. According to their scores the possibilities of chance of admit is calculated.
The developed model has 93% accuracy. Navoneel Chakrabartyet al. [3] proposed a comparison
of different regression models. The developed models are gradient boosting regress or and linear
regression model. Gradient boosting regress or have to score of 0.84. That surpassing the
performance of linear regression model. They computed different other performance error
metrics like mean absolute error, mean square error, and root mean square error. Chithra
Apoorva et al. [4] proposed different machine learning algorithms for predicting the chances of
admission. The models are Linear Regression, Ridge Regression, Random Forest. These are
trained by features have a high impact on the probability of admission. Out of the generated
models the linear regression model have 79% accuracy.
developed and found that applicants with three or more strong letters of recommendation had
slightly higher admission scores [9].
One large-scale study of LORs performed a meta-analysis of previously published research
spanning undergraduate and graduate education [10]. Unlike our study, the goal was not to use
LORs as predictors of past admission decisions, but rather as predictors of future performance.
The study found that LORs have low, but positive, correlations with standardized test scores
and moderate (i.e., 0.26) correlations with prior grades. LORs also had low but positive
correlations with future performance, including a correlation of 0.10 with research
productivity, 0.28 with undergraduate GPA, 0.13 with graduate GPA, and 0.19 with completion
of a doctoral degree. An analysis of the incremental validity of LORs over the other predictors
like test scores and past GPA showed that the LORs do not substantially help with predicting
future graduate GPA but do help with predicting graduate degree completion. As degree
completion is one of our key goals when making admissions decisions, this is quite notable.
None of the prior studies described thus far utilized NLP techniques to characterize the LORs,
indicating a significant gap in research in this field. Waters and Miikkulainen presented an
admission-decision study, where they applied NLP techniques to LORs using a statistical
machine learning approach to facilitate large-scale Ph.D. admissions [6]. Their system
incorporated numerical, categorical, and textual data, with the LOR text transformed into a 50-
dimensional feature vector using a bag-of-words representation (i.e., word order is not
considered) and Latent Semantic Analysis [11] techniques. The study found that LORs
containing words such as “best,” “award,” “research,” “PhD,” etc., were predictive of
admission, while letters containing words like “good,” “class” “programming,” “technology,”
etc., were indicative of rejection. According to the authors, this pattern reflects the faculty’s
preference for candidates with strong research potential. The use of NLP in this study was
relatively straightforward as the focus was on specific words.
Several studies utilized more advanced NLP techniques on LORs, but these studies were
specifically in the context of investigating gender and racial bias in LORs. These bias-related
studies used NLP software to assess the linguistic characteristics of the LORs, including those
related to emotional content (e.g., sadness, excitement). The majority of these studies focused
on graduate medical programs [12–15], while a few studies considered graduate STEM
disciplines [16, 17] and one focused on undergraduate admissions [18]. The study on
undergraduate admissions [18] for the University of California at Berkeley showed that LORs
written for students in underrepresented racial groups were weaker than those for other
students. However, this study also assessed the impact of LORs on admission decisions and
showed that even though the LORs were weaker for these underrepresented groups, the
inclusion of LORs nonetheless improved the admission outcomes for these students.
the system within one hour time. So, the system can easily be accepted by any kind of end-
user. Hence the proposed system is technically feasible.
2.3.2 Python
Python is a multi-paradigm programming language. Object-oriented programming and
structured programming are fully supported, and many of their features support functional
programming and aspect-oriented programming (including metaprogramming[71] and
metaobjects).[72] Many other paradigms are supported via extensions, including design by
contract[73][74] and logic programming.[75] Python is known as a glue language,[76] able to
work very well with many other languages with ease of access.
2.3.3 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation and
analysis tool using its powerful data structures. The name Pandas is derived from the word
Panel Data – an Econometrics from Multidimensional data.
2.3.4 Numpy
Numpy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental
package for scientific computing with Python. Besides its obvious scientific uses, Numpy can
also be used as an efficient multi-dimensional container of generic data.
2.3.5 Matplotlib
Matplotlib is a Python library for creating static, interactive, and animated visualizations. It
provides a flexible interface for plotting a wide range of charts, such as line plots, bar charts,
histograms, and scatter plots. With extensive customization options, it is widely used for data
visualization in scientific computing and analytics.
2.3.6 Seaborn
Seaborn is a Python library for creating visually appealing statistical graphics, built on
Matplotlib. It simplifies complex plots like scatter plots, heatmaps, and box plots with concise
code. Seamlessly integrating with Pandas, it supports direct data visualization and offers
customizable themes and color palettes, making it ideal for data analysis and presentation.
2.3.7 Scikit-learn
Scikit-learn is an open-source Python library that implements a range of machine learning,
pre-processing, cross-validation, and visualization algorithms using a unified interface. It is
an open-source machine-learning library that provides a plethora of tools for various machine-
learning tasks such as Classification, Regression and Clustering.
2.3.8 Flask
Flask is a lightweight and flexible Python web framework used to build web applications. It
is known for its simplicity, scalability, and ease of use, making it ideal for beginners and small
to medium-sized projects. Flask allows developers to create routes, handle requests, and
integrate templates with minimal overhead.
• Summary
This chapter includes the literature survey and the references used by the different authors, the
existing system and the tools and technologies used for the running the project.
CHAPTER 3
SOFTWARE REQUIREMENTS SPECIFICATION
The Software Requirement Specification (SRS) for "Predicting Graduate Admissions Using
Machine Learning" outlines the functional and non-functional requirements necessary for its
development. The system must allow users to input data such as GRE scores, TOEFL scores,
CGPA, and other related parameters to predict the probability of admission. It requires seamless
integration with pre-trained machine learning models, a user-friendly web interface for input
and visualization, and the capability to display prediction results and relevant graphs. The
software must ensure compatibility with browsers, responsiveness across devices, and efficient
data handling. It should also meet technical requirements like support for Flask for backend
operations, joblib for model integration, and libraries like Matplotlib and Seaborn for graph
rendering.
Where y is the target variable value, c is a y-intercept, b is the slope , and x is the value of the
feature variable [5]. To train a Linear Regression model, you need to find the value of θ that
minimizes the RMSE by the Equation MSE cost function for a Linear Regression model:
Where hθ =is the hypothesis function using the model parameters θ, m= number of samples
in dataset, θT = is the transpose of θ , x = is the instance’s feature vector , θT.x(i) = is the dot
product of θT and x(i) , and y = expected value [6].
Random Forest is a powerful and versatile supervised machine learning algorithm that grows
and combines multiple decision trees to create a “forest.” It can be used for both classification
and regression problems in R and Python. As we know, the Random Forest model grows and
combines multiple decision trees to create a “forest.” A decision tree is another type of
algorithm used to classify data. In very simple terms, you can think of it like a flowchart that
draws a clear pathway to a decision or outcome; it starts at a single point and then branches
off into two or more directions, with each branch of the decision tree offering different
possible outcomes.
Gradient Boosting is a powerful boosting algorithm that combines several weak learners into
strong learners, in which each new model is trained to minimize the loss function such as
mean squared error or cross-entropy of the previous model using gradient descent. In each
iteration, the algorithm computes the gradient of the loss function with respect to the
predictions of the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble, and the process is
repeated until a stopping criterion is met.
3.2.4 Dataset
The dataset is available at [1]. At the time of writing this paper, the dataset has over 400
downloads and more than 2000 views. This dataset contains parameters that are considered
carefully by the admissions committee. First section contains scores including GRE, TOEFL
and Undergraduate GPA. Statement of Purpose and Letter of Recommendation are two other
important entities. Research Experience is highlighted in binary form. All the parameters are
normalized before training to ensure that values lie between the specified range. A few profiles
in the dataset contain values that have been previously obtained by students. A unique feature
of this dataset is that it contains equal number of categorical and numerical features. The data
has been collected and prepared typically from an Indian student’s perspective. However, it
can also be used by other grading systems with minor modifications. A second version of the
dataset will be released which will have an additional two hundred entries.
Below is the screen-shot of the user interface after submitting the student profile. Based on
the information you have provided, our machine learning model has calculated the likelihood
of admission to a master's program. This prediction is intended to give you an idea of how
your academic and personal metrics align with the requirements of competitive universities.
Input: The dataset interface accepts structured data in CSV format. This file contains the
features such as GRE Score, TOEFL Score, University Rating, SOP (Statement of Purpose
Strength), LOR (Letter of Recommendation Strength), CGPA, Research Experience (Binary:
0 for No, 1 for Yes). The data is loaded using a data processing module, which ensures proper
handling of missing values, outliers, and normalization where required.
Output: The processed data is passed to the machine learning module for predicting the
chances of graduate admission. The output format includes:
A feature matrix (X) containing input attributes such as GRE Score, TOEFL Score, University
Rating, SOP Strength, LOR Strength, CGPA, and Research Experience. A target vector (y)
representing the actual chance of admission. Predicted probabilities indicating the likelihood
of admission.
with visual graphs. The interface should work seamlessly on both local and web environments.
It must be compatible with common browsers, ensuring a smooth user experience.
Summary
This chapter includes software requirements specifications and the functional and
nonfunctional requirements and also the other software and hardware tools required for the
project and what are the feasibility study and also other constraints required for the functional
and non-functional requirements of the project.
CHAPTER 4
DESIGN
This chapter provides a comprehensive overview of the system's structural and functional
architecture, emphasizing the flow of data and interactions between various components. This
chapter plays a crucial role in illustrating how the system is organized and how it operates to
meet the project requirements.
The above diagram illustrates the workflow of a typical machine learning project, particularly
relevant to your graduate admissions prediction task. It outlines the key stages involved, starting
from data acquisition to model evaluation and prediction.
The process begins with Finding Data, where you gather relevant information about applicants,
such as their academic records, test scores, and other relevant factors. This data is then subjected
to Data Cleaning to handle missing values, inconsistencies, and outliers. Subsequently, Data
Analysis is performed to understand the characteristics of the data, identify patterns, and gain
insights into the relationships between different variables. Data Visualization techniques are
employed to visually represent the data and its distributions, aiding in further understanding
and identifying potential trends.
Next, the model selection and training phase begins. You can experiment with various machine
learning algorithms, such as Linear Regression, Random Forest, and Gradient Boosting
Regressor. Each algorithm has its own strengths and weaknesses, and the choice depends on
the specific characteristics of your data and the desired level of accuracy.
Once a model is trained, it is used to make Admission Predictions. The model takes the input
data of a new applicant and predicts their likelihood of admission based on the patterns learned
from the training data. Finally, the Model Decision using KPIs step involves evaluating the
model's performance using relevant metrics (Key Performance Indicators) such as accuracy,
precision, recall, and F1-score. This evaluation helps assess the model's effectiveness and
identify areas for improvement.
In essence, this diagram provides a roadmap for your graduate admissions prediction project,
guiding you through the essential steps from data collection to model deployment and
evaluation. By following this workflow and iteratively refining your approach, you can
develop a robust and accurate prediction system.
be a user-friendly web interface or a mobile application. The Frontend System then acts as a
gateway, transmitting this raw input data to the Backend System.
Within the Backend System, the data undergoes a rigorous validation and preprocessing phase.
This crucial step ensures data integrity and prepares it for consumption by the Machine
Learning Model. Validation involves checking for errors, inconsistencies, and missing values,
ensuring the data adheres to defined standards and constraints. Preprocessing encompasses a
range of techniques, such as data cleaning to handle missing values or outliers, and data
transformation to normalize or scale the data into a format suitable for the model's analysis.
The preprocessed data is then seamlessly transferred to the Machine Learning Model, the core
of this system. This model, trained on a vast historical dataset of admitted and rejected
applicants, employs sophisticated algorithms to analyze the input data and identify patterns
and relationships between various factors, such as GPA, GRE scores, research experience, and
other relevant criteria. Based on this analysis, the Machine Learning Model generates a
predicted likelihood of admission for the current applicant.
The predicted likelihood of admission is then communicated back to the Backend System.
The Backend System fulfills two vital functions at this stage. Firstly, it stores the applicant's
data, including their input details and the predicted outcome, in a secure and organized
database. This data repository serves as a valuable resource for future analysis, trend
identification, and system improvements. Secondly, the Backend System transmits the
predicted likelihood percentage back to the Frontend System. Finally, the Frontend System
receives the predicted likelihood percentage and presents it to the applicant in a clear and
concise manner. This provides the applicant with an immediate assessment of their admission
prospects, empowering them to make informed decisions regarding their application strategy.
By automating this critical aspect of the admission process, the system offers several key
advantages. It streamlines the evaluation process, enabling admissions committees to process
applications more efficiently. It provides applicants with a data-driven and objective
assessment of their chances, fostering transparency and fairness. Furthermore, the system can
be continuously improved by refining the Machine Learning Model with new data and
incorporating feedback from stakeholders.
Summary
The design chapter provides a detailed representation of the system’s architecture, showcasing
how data flows through various processes and how different components interact to achieve
the desired functionality. The inclusion of the Data Flow Diagram and Sequence Diagram
ensures a clear understanding of the system's operation and helps visualize the implementation
details effectively.
CHAPTER 5
IMPLEMENTATION
The implementation involves a structured approach that leverages machine learning techniques
to predict the probability of a student being admitted to a graduate program. This begins with
data collection, where a dataset containing various attributes related to graduate admissions is
obtained. Common datasets include features like GRE scores, TOEFL scores, undergraduate
GPA, ratings of the statement of purpose and letters of recommendation, work or research
experience, and other related attributes.
Next, the data undergoes preprocessing to ensure it is clean and ready for analysis. This
involves handling missing values, encoding categorical data, and normalizing numerical
features to ensure they are scaled appropriately. Feature engineering may also be conducted to
create new meaningful variables or remove redundant ones. An exploratory data analysis
(EDA) phase is performed to understand the relationships between variables and identify
patterns or trends in the dataset. Visualizations such as scatter plots, correlation matrices, and
histograms are used to gain insights into the data distribution.
After preprocessing, the project transitions to the model development phase, where various
machine learning algorithms are applied. Initial models may include Linear Regression for a
baseline, followed by more advanced techniques such as Decision Trees, Random Forest,
Gradient Boosting (e.g., XGBoost), or Neural Networks, depending on the complexity and size
of the dataset. The dataset is split into training and testing subsets to evaluate the models'
performance. Cross-validation is employed to ensure the model generalizes well to unseen
data.
The final phase involves building a user-friendly interface to make predictions accessible to
end-users. This could be implemented as a web-based application using frameworks like Flask
or Django, where users can input their details (e.g., GRE scores, GPA, etc.) and receive an
admission probability. Additionally, the model can be integrated with visualization dashboards
to provide detailed insights into the predictions and factors influencing the results. This project
not only demonstrates practical machine learning implementation but also serves as a valuable
tool for students to make informed decisions about their graduate applications.
The provided code outlines the process of training multiple machine learning models to predict
the "Chance of Admit" for graduate admissions based on various features. It begins by
importing essential Python libraries, such as pandas and numpy for data manipulation,
matplotlib and seaborn for visualization, and machine learning tools from sklearn. The dataset
is prepared by splitting it into the input variables (x) and the target variable (y), where x contains
all features except the "Chance of Admit" column, which is set as the target (y). Three different
regression models are trained: Linear Regression, Random Forest Regressor, and Gradient
Boosting Regressor. Finally, evaluation metrics such as Mean Absolute Error (MAE), Mean
Squared Error (MSE), and R² Score are imported to assess the accuracy and performance of the
trained models
The above code is used to evaluate the performance of different regression models. First, it
calculates the mean squared error (MSE) between the actual test labels (test_y) and the
predicted values from the Linear Regression model (y_pred_lr). Then, it computes the root
mean squared error (RMSE) by taking the square root of the MSE. RMSE is a commonly
used metric to assess the accuracy of a regression model, with lower values indicating better
performance. The R² score is also calculated for the Linear Regression model using the
r2_score() function, which measures how well the model fits the data (with 1 being a perfect
fit and values closer to 0 indicating a worse fit). After that, a Pandas DataFrame is created to
compile the performance results of all three models
This code demonstrates the process of building and saving a Linear Regression model using
scikit-learn. First, it imports necessary libraries. The code then creates an instance of the
StandardScaler class and uses it to standardize the training and test features (train_x and
test_x). Standardization scales the data so that each feature has a mean of 0 and a standard
deviation of 1. Next, the code initializes the Linear Regression model (lr = LR()) and fits it to
the standardized training data (train_x and train_y). Finally, the code uses joblib.dump() to
save the trained Linear Regression model to a file (linear_regression_model.pkl).
5.2 Snapshots
A snapshot is a new instance of an existing project. Some key points to remember about
snapshots are as follows: A snapshot is a separate project. Making a change to one snapshot
in a snapshot set does not affect the other snapshots in the set. A snapshot is an executable
project. Snapshots are generally created for data protection, but they can also be used for
testing application software and data mining. A storage snapshot can be used for disaster
recovery (DR) when information is lost due to human error.
In this section, a sample screenshots of project pages are included with the description.
Description explains about when this screen appears and what the actions it will do are and
outcome of that screen is explained.
The above code snippet demonstrates the initial steps in a "Graduate Admission Prediction"
project using the Python programming language and the Pandas library.
Firstly, data = pd.read_csv('GAdata.csv') reads data from a CSV file named "GAdata.csv" and
stores it in a Pandas DataFrame called data. This DataFrame is a 2-dimensional, labeled data
structure that efficiently holds and manipulates tabular data.
Next, data.columns displays the names of the columns in the DataFrame. These column names
likely represent features relevant to graduate admissions, such as GRE Score, TOEFL Score,
University Rating, Statement of Purpose (SOP), Letter of Recommendation (LOR), CGPA,
Research experience, and the target variable "Chance of Admit."
Finally, data.head() displays the first five rows of the DataFrame. This provides a quick
overview of the data, allowing for an initial inspection of the values and data types.
This code snippet represents the initial data loading and exploration phase of the project. The
subsequent steps would involve data cleaning, feature engineering, model selection, training,
evaluation, and deployment.
The above code outlines the process of building and evaluating a linear regression model to
predict the "Chance of Admit" based on various features. Initially, the data is prepared by
separating the features (independent variables) from the target variable ("Chance of Admit").
Then, the data is divided into training and testing sets to evaluate the model's performance on
unseen data. Next, a linear regression model is created and trained on the training data. The
trained model is then used to make predictions on the testing set. Finally, the model's
performance is evaluated using metrics such as Mean Absolute Error (MAE), Mean Squared
Error (MSE), Root Mean Squared Error (RMSE), and R-squared score. These metrics provide
insights into the accuracy and reliability of the model's predictions.
The heatmap visualizes the correlation matrix among the various features and the target
variable ("Chance of Admit") in the graduate admissions dataset. It reveals that CGPA, GRE
Score, and TOEFL Score exhibit strong positive correlations with the "Chance of Admit,"
suggesting these are highly influential factors. University Rating, SOP, and LOR also show
moderate positive correlations, indicating their significance in the admission process.
Research experience appears to have a weaker correlation, implying it might not be as
influential as the other factors. This heatmap provides valuable insights into the relative
importance of each feature, guiding feature selection and model development for more
accurate predictions in the graduate admission prediction project.
The boxplot provides a visual summary of the distribution of five key features: University
Rating, Statement of Purpose (SOP), Letter of Recommendation (LOR), Cumulative Grade
Point Average (CGPA), and Research Experience. The boxplots reveal that CGPA is skewed
towards higher values, suggesting a majority of applicants have a high CGPA. University
Rating, SOP, and LOR distributions are relatively similar with medians around 3.5. The
Research Experience feature is binary, with a majority of applicants lacking research
experience. This boxplot information can be used to guide data preprocessing steps such as
scaling and outlier handling, as well as feature engineering for your graduate admission
prediction project.
The histogram visualizes the distribution of the "Chance of Admit" variable in the graduate
admissions dataset. The distribution appears to be roughly bell-shaped, indicating a normal or
near-normal distribution. The "Chance of Admit" values range from approximately 0.4 to 1.0,
with the majority of applicants having a "Chance of Admit" between 0.6 and 0.8. This
histogram provides valuable insights into the distribution of the target variable, which can
guide the selection of appropriate machine learning models and evaluation metrics. It can also
help identify potential outliers and aid in interpreting the model's predictions in the context of
the graduate admission prediction project.
The scatter plot visualizes the relationship between GRE Score and Chance of Admit in the
graduate admissions dataset. The plot reveals a positive correlation between the two variables,
indicating that higher GRE scores are generally associated with higher chances of admission.
However, the points are not perfectly aligned, suggesting that GRE Score is not the sole
determinant of admission and that other factors also significantly influence the admission
decision. The scatter plot confirms the importance of GRE Score as a predictor and provides
valuable insights for model selection and feature engineering in the graduate admission
prediction project.
Summary
This chapter contains the implementation and screenshots of the project. The implementation
which includes the packages that are used to build our model and code snippet of the model.
The implementation chapter which includes screenshots and their functionality of that page are
explained.
CHAPTER 6
CONCLUSION AND FUTURE ENHANCEMENTS
7.1 Conclusion
After evaluating all four models on the dataset, we compare the performances to find out which
model predicts better. MSE and R2 Scores are tabulated for all the models.
It is clear that Linear Regression performs the best on our dataset, with a low MAE and high
R2 score, closely followed by Random Forest Regressor. This can be attributed to the linear
dependencies of features in the dataset. Higher values of test scores, GPA and other factors
generally result in greater chances of admission. The inclusion of a few outliers has influenced
the Linear model to some extent. The overall objective of the research was achieved
successfully as the system allow the students to save the extra amount of time and money that
they would spend on education consultants and application fees for the universities where they
have fewer chances of securing admission. Also, it will help the students to make better and
faster decision regarding application to the universities.
BIBLIOGRAPHY