Mini Project
Mini Project
ON MACHINE LEARNING
A Project report Submitted to
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY, KAKINADA
in partial fulfillment of the requirement
For the award of Degree of
BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE&ENGINEERING
BY
K.SAI KALPANA—(19KQ5A0501)
B.MAHESH BABU—(18KQ1A0537)
G.TULASI RAM—(18KQ1A0549)
2018-2022
PACE INSTITUTE OF TECHNOLOGY & SCIENCES
(Approved by A.I.C.T.E., New Delhi & Govt of Andhra Pradesh, Affiliated to JNTU Kakinada)
ACCREDITED BY NAAC WITH ‘A’ GRADE & NBA (An ISO 9001:2015 Certified Institution)
NH-5,Near Valluramma Temple,ONGOLE-523272,Contact No:08592-201167,www.pace.ac.in
CERTIFICATE
This is to certify that the project entitled “STUDENT PERFORMANCE ANALYSIS BASED
ON MACHINE LEARNING” is a bonafide work done by KONKALA SAI KALPANA
(Regd No 19KQ5A0501), BADUKURI MAHESH BABU (Regd No 18KQ1A0537), GORANTLA
TULASI RAM (Regd No 18KQ1A0549) , submitted in the partial fulfillment of the requirements for the award
of Degree of Bachelor of Technology in Computer Science and Engineering during the academic years 2017-2021.
The results embodied in this project report have not been submitted to any other University or Institute for the award
of any other degree or diploma.
DECLARATION
ACKNOWLEDGEMENT
K.SAI KALPANA(19KQ5A0501)
B.MAHESHBABU(18KQ1A0537)
G.TULASI RAM(18KQ1A0549)
INDEX
TITLES PAGE NO:
CONTENTS
1. ABSTRACT 07
7. ALGORITHM 19-35
9. IMPLEMENTATION 47-52
9.2 SCREENSHOTS
12. CONCLUSION 60
1. ABSTRACT
Performance analysis of outcome based on learning is a system which will strive for excellence at
different levels and diverse dimensions in the field of student’s interests. This paper proposes a complete EDM
framework in a form of a rule based recommender system that is not developed to analyze and predict the student’s
performance only, but also to exhibit the reasons behind it. The proposed framework analyzes the students’
demographic data, study related and psychological characteristics to extract all possible knowledge from students,
teachers and parents. Seeking the highest possible accuracy in academic performance prediction using a set of powerful
data mining techniques. The framework succeeds to highlight the student’s weak points and provide appropriate
recommendations. The realistic case study that has been conducted on 200 students proves the outstanding
The previous predictive models only focused on using the student’s demographic data like gender, age,
family status, family income and qualifications. In addition to the study related attributes including the homework and
study hours as well as the previous achievements and grades. These previous work were only limited to provide the
prediction of the academic success or failure, without illustrating the reasons of this prediction. Most of the previous
researches have focused to gather more than 40 attributes in their data set to predict the student’s academic
performance. These attributes were from the same type of data category whether demographic, study related attributes
Disadvantage:
● As a result, these generated rules did not fully extract the knowledge for the reasons behind the student’s
dropout.
● Apart from the previously mentioned work, there were previous statistical analysis models from the perspective
of educational psychology that conducted a couple of studies to examine the correlation between the mental health
● The type of the recommendations was too brief, they missed illustrating the methodologies to apply them.
2.2 PROPOSED SYSTEM:-
The proposed framework firstly focuses on merging the demographic and study related attributes with
the educational psychology fields, by adding the student’s psychological characteristics to the previously used data set
(i.e., the students’ demographic data and study related ones). After surveying the previously used factors for predicting
the student’s academic performance, we picked the most relevant attributes based on their rationale and correlation
Advantage:
● The proposal aims to analyze student’s demographic data, study related details and psychological characteristics in
terms of final state to figure whether the student is on the right track or struggling or even failing. In addition to
extensive comparison of our proposed model with the other previous related models.
System Architecture:
3. LITERATURE SURVEY
Title: Learning patterns of university student retentiAuthor: Nandeshwar, T. Menzies and A. Nelson on.
Learning predictors for student retention is very difficult. After reviewing the literature, it is evident that there is
considerable room for improvement in the current state of the art. As shown in this paper, improvements are
possible if we explore a wide range of learning methods; take care when selecting attributes; assess the efficacy of
the learned theory not just by its median performance, but also by the variance in that performance; study the delta
of student factors between those who stay and those who are retained. Using these techniques, for the goal of
predicting if students will remain for the first three years of an undergraduate degree, the following factors were
found to be informative: family background and family's social-economic status, high school GPA and test scores.
classification/clustering analyze a set of data and generate a set of grouping rules which can be used to classify
future data. Data mining is the process is to extract information from a data set and transform it into an
understandable structure. It is the computational process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The actual
data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously
unknown interesting patterns. Data mining involves six common classes of tasks. Anomaly detection, Association
rule learning, Clustering, Classification, Regression, Summarization. Classification is a major technique in data
mining and widely used in various fields. Classification is a data mining (machine learning) technique used to
predict group membership for data instances. In this paper, we present the basic classification techniques. Several
major kinds of classification method including decision tree induction, Bayesian networks, k-nearest neighbor
classifier, the goal of this study is to provide a comprehensive review of different classification techniques in data
mining.
Author: Salam Ismaeel, Ali Miri et al
Title: Using the Extreme Learning Machine (ELM) technique for heart disease diagnosis
One of the most important applications of machine learning systems is the diagnosis of heart disease which affect
the lives of millions of people. Patients suffering from heart disease have lot of independent factors such as age,
sex, serum cholesterol, blood sugar, etc. in common which can be used very effectively for diagnosis. In this paper
an Extreme Learning Machine (ELM) algorithm is used to model these factors. The proposed system can replace
a costly medical checkup with a warning system for patients of the probable presence of heart disease. The system
is implemented on real data collected by the Cleveland Clinic Foundation where around 300 patient’s information
has been collected. Simulation results show this architecture has about 80% accuracy in determining heart disease.
namely, Naive Bayes. It is implemented as web based application in this user answers the predefined questions. It
retrieves hidden data from stored database and compares the user values with trained data set. It can answer
complex queries for diagnosing heart disease and thus assist healthcare practitioners to make intelligent clinical
decisions which traditional decision support systems cannot. By providing effective treatments, it also helps to
We present Health Gear, a real-time wearable system for monitoring, visualizing and analyzing physiological
signals. Health Gear consists of a set of noninvasive physiological sensors wirelessly connected via Bluetooth to a
cell phone which stores, transmits and analyzes the physiological data, and presents it to the user in an intelligible
way. In this paper, we focus on an implementation of HealthGear using a blood oximeter to monitor the user's
blood oxygen level and pulse while sleeping. We also describe two different algorithms for automatically detecting
sleep apnea events, and illustrate the performance of the overall system in a sleep study with 20 volunteers.
4. SYSTEM LOW LEVEL DESIGN
4.1 MODULES
• DATA COLLECTION
• DATA PRE-PROCESSING
• FEATURE EXTRATION
• EVALUATION MODEL
DATA COLLECTION
Data used in this paper is a set of student details in the school records. This step is concerned with selecting
the subset of all available data that you will be working with. ML problems start with data preferably, lots of data
(examples or observations) for which you already know the target answer. Data for which you already know the target
DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it. Three common data pre-processing
steps are:
1. Formatting
2. Cleaning
3. Sampling
Formatting:The data you have selected may not be in a format that is suitable for you to work with. The data may
be in a relational database and you would like it in a flat file, or the data may be in a proprietary file format and
Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances that are incomplete
and do not carry the data you believe you need to address the problem. These instances may need to be removed.
Additionally, there may be sensitive information in some of the attributes and these attributes may need to be
Sampling: There may be far more selected data available than you need to work with. More data can result in
much longer running times for algorithms and larger computational and memory requirements. You can take a
smaller representative sample of the selected data that may be much faster for exploring and prototyping solutions
FEATURE EXTRATION
Next thing is to do Feature extraction is an attribute reduction process. Unlike feature selection, which ranks the
existing attributes according to their predictive significance, feature extraction actually transforms the attributes.
The transformed attributes, or features, are linear combinations of the original attributes. Finally, our models are
trained using Classifier algorithm. We use classify module on Natural Language Toolkit library on Python. We use
the labelled dataset gathered. The rest of our labelled data will be used to evaluate the models. Some machine
learning algorithms were used to classify pre-processed data. The chosen classifiers were Random forest. These
Evaluation is an integral part of the model development process. It helps to find the best model that represents our
data and how well the chosen model will work in the future. Evaluating model performance with the data used for
training is not acceptable in data science because it can easily generate overoptimistic and over fitted models. There
are two methods of evaluating models in data science, HoldOut and Cross-Validation to avoid over fitting, both
methods use a test set (not seen by the model) to evaluate model performance. Performance of each classification
model is estimated base on its averaged. The result will be in the visualized form. Representation of classified data
in the form of graphs. Accuracy is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of total predictions.
5. DATA FLOW DIAGRAMS
D at abet
featu re
Exrract'o n
Ay or ith m
E DA
dataset
Trai n
Algorithm
predñ:t
Reu It
6. SYSTEM DESIGN
probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would
be only two possible classes. In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no). Mathematically, a logistic regression model predicts P(Y=1) as a
function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as
Generally, logistic regression means binary logistic regression having binary target variables, but there can be two
more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression
Binary or Binomial :In such a kind of classification, a dependent variable will have only two possible types either 1
and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.
Multinomial :In such a kind of classification, dependent variable can have 3 or more possible unordered types or the
types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type
C”.
Ordinal :In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types
having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”,
“Excellent” and each category can have the scores like 0,1,2,3.
HOW logistic regression WORKS
The following are the basic steps involved in performing the random forest algorithm
Logistic regression uses an equation as the representation, very much like linear regression. Input values (x) are
combined linearly using weights or coefficient values (referred to as the Greek capital letter Beta) to predict an output
value (y).
● It can easily extend to multiple classes(multinomial regression) and a natural probabilistic view of class predictions.
● It not only provides a measure of how appropriate a predictor (coefficient size) Is, but also its direction of association
(positive or negative)
● Good accuracy for many simple datasets and it performs well when the dataset is linearly separable.
● Logistic regression is less inclined to over-fitting but it can overfit in high dimensional datasets. One may consider
Domain Specification
MACHINE LEARNING
Machine Learning is a system that can learn from example through selfimprovement and without being explicitly
coded by programmer. The breakthrough comes with the idea that a machine can singularly learn from the data (i.e.,
example) to produce accurate results.
Machine learning combines data with statistical tools to predict an output. This output is then used by corporate to
makes actionable insights. Machine learning is closely related to data mining and Bayesian predictive modeling. The
machine receives data as input, use an algorithm to formulate answers.
A typical machine learning tasks are to provide a recommendation. For those who have a Netflix account, all
recommendations of movies or series are based on the user's historical data. Tech companies are using unsupervised
learning to improve the user experience with personalizing recommendation. Machine learning is also used for a
variety of task like fraud detection, predictive maintenance, portfolio optimization, automatize task and so on.
For instance, the machine is trying to understand the relationship between the wage of an individual and the likelihood
to go to a fancy restaurant. It turns out the machine finds a positive relationship between wage and going to a high-
end restaurant: This is the model Inferring When the model is built, it is possible to test how powerful it is on never-
seen-before data. The new data are transformed into a features vector, go through the model and give a prediction.
This is all the beautiful part of machine learning. There is no need to update the rules or train again the model. You
can use the model previously trained to make inference on new data.
The life of Machine Learning programs is straightforward and can be summarized in the following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies that knowledge to new sets of data.
Machine learning Algorithms and where they are used?
Machine learning can be grouped into two broad learning tasks: Supervised and Unsupervised. There are many other
algorithms
Supervised learning
An algorithm uses training data and feedback from humans to learn the relationship of given inputs to a given output.
For instance, a practitioner can use marketing expense and weather forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict new data.
There are two categories of supervised learning:
● Classification task
● Regression task
Classification
Imagine you want to predict the gender of a customer for a commercial. You will start gathering data on the height,
weight, job, salary, purchasing basket, etc. from your customer database. You know the gender of each of your
customer, it can only be male or female. The objective of the classifier will be to assign a probability of being a male
or a female (i.e., the label) based on the information (i.e., features you have collected). When the model learned how
to recognize male or female, you can use new data to make a prediction. For instance, you just got new information
from an unknown customer, and you want to know if it is a male or female. If the classifier predicts male = 70%, it
means the algorithm is sure at 70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes, but if a classifier needs to predict
object, it has dozens of classes (e.g., glass, table, shoes, etc. each object represents a class)
Regression
When the output is a continuous value, the task is a regression. For instance, a financial analyst may need to forecast
the value of a stock based on a range of feature like equity, previous stock performances, macroeconomics index.
The system will be trained to estimate the price of the stocks with the lowest possible error.
Unsupervised learning
In unsupervised learning, an algorithm explores input data without being given an explicit output variable (e.g.,
explores customer demographic data to identify patterns) You can use it when you do not know how to classify the
data, and you want the algorithm to find patterns and classify the data for you
Application of Machine learning
Augmentation:
● Machine learning, which assists humans with their day-to-day tasks, personally or commercially without having
complete control of the output. Such machine learning is used in different ways such as Virtual Assistant, Data
analysis, software solutions. The primary user is to reduce errors due to human bias.
Automation:
● Machine learning, which works entirely autonomously in any field without the need for any human intervention.
For example, robots performing the essential process steps in manufacturing plants. Finance Industry
● Machine learning is growing in popularity in the finance industry. Banks are mainly using ML to find patterns
● The government makes use of ML to manage public safety and utilities. Take the example of China with the
massive face recognition. The government uses Artificial intelligence to prevent jaywalker. Healthcare industry
● Healthcare was one of the first industry to use machine learning with image detection.
Marketing
● Broad use of AI is done in marketing thanks to abundant access to data. Before the age of mass data, researchers
develop advanced mathematical tools like Bayesian analysis to estimate the value of a customer. With the boom of
data, marketing department relies on AI to optimize the customer relationship and marketing campaign.
Machine learning gives terrific results for visual pattern recognition, opening up many potential applications in
physical inspection and maintenance across the entire supply chain network. Unsupervised learning can quickly
search for comparable patterns in the diverse dataset. In turn, the machine can perform quality inspection throughout
For instance, IBM's Watson platform can determine shipping container damage. Watson combines visual and
In past year stock manager relies extensively on the primary method to evaluate and forecast the inventory. When
combining big data and machine learning, better forecasting techniques have been implemented (an improvement of
20 to 30 % over traditional forecasting tools). In term of sales, it means an increase of 2 to 3 % due to the potential
For example, everybody knows the Google car. The car is full of lasers on the roof which are telling it where it is
regarding the surrounding area. It has radar in the front, which is informing the car of the speed and motion of all the
cars around it. It uses all of that data to figure out not only how to drive the car but also to figure out and predict
what potential drivers around the car are going to do. What's impressive is that the car is processing almost a
Deep Learning
Deep learning is a computer software that mimics the network of neurons in a brain. It is a subset of machine
learning and is called deep learning because it makes use of deep neural networks. The machine uses different layers
to learn from the data. The depth of the model is represented by the number of layers in the model. Deep learning is
the new state of the art in term of AI. In deep learning, the learning phase is done through a neural network.
Reinforcement Learning
Reinforcement learning is a subfield of machine learning in which systems are trained by receiving virtual
"rewards" or "punishments," essentially learning by trial and error. Google's DeepMind has used reinforcement
learning to beat a human champion in the Go games. Reinforcement learning is also used in video games to improve
● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
AI in Finance: The financial technology sector has already started using AI to save time, reduce costs,
and add value. Deep learning is changing the lending industry by using more robust credit scoring. Credit
decision-makers can use AI for robust credit lending applications to achieve faster, more accurate risk
assessment, using machine intelligence to factor in the character and capacity of applicants.
Underwrite is a Fintech company providing an AI solution for credit makers company. underwrite.ai
uses AI to detect which applicant is more likely to pay back a loan. Their approach radically outperforms
traditional methods.
AI in HR: Under Armour, a sportswear company revolutionizes hiring and modernizes the candidate
experience with the help of AI. In fact, Under Armour Reduces hiring time for its retail stores by 35%.
Under Armour faced a growing popularity interest back in 2012. They had, on average, 30000 resumes
a month. Reading all of those applications and begin to start the screening and interview process was
taking too long. The lengthy process to get people hired and on-boarded impacted Under Armour's ability
to have their retail stores fully staffed, ramped and ready to operate.
At that time, Under Armour had all of the 'must have' HR technology in place such as transactional
solutions for sourcing, applying, tracking and onboarding but those tools weren't useful enough. Under
armour choose HireVue, an AI provider for HR solution, for both on-demand and live interviews. The
results were bluffing; they managed to decrease by 35% the time to fill. In return, the hired higher quality
staffs.
For example, deep-learning analysis of audio allows systems to assess a customer's emotional tone. If
the customer is responding poorly to the AI chatbot, the system can be rerouted the conversation to
real, human operators that take over the issue.
Apart from the three examples above, AI is widely used in other sectors/industries.
Execution time From few minutes to hours Up to weeks. Neural Network nee
to compute a significant number
weights
In the table below, we summarize the difference between machine learning and deep learning.
Machine learning Deep learning
With machine learning, you need fewer data to train the algorithm than deep learning. Deep learning
requires an extensive and diverse set of data to identify the underlying structure. Besides, machine
learning provides a faster-trained model. Most advanced deep learning architecture can take days to a
week to train. The advantage of deep learning over machine learning is it is highly accurate. You do
not need to understand what features are the best representation of the data; the neural network learned
how to select critical features. In machine learning, you need to choose for yourself what features to
include in the model.
TensorFlow
the most famous deep learning library in the world is Google's TensorFlow. Google product uses
machine learning in all of its products to improve the search engine, translation, image captioning or
recommendations.
To give a concrete example, Google users can experience a faster and more refined the search with AI.
If the user types a keyword a the search bar, Google provides a recommendation about what could be
the next word.
Google wants to use machine learning to take advantage of their massive datasets to give users the best
experience. Three different groups use machine learning:
● Researchers
● Data scientists
● Programmers.
They can all use the same toolset to collaborate with each other and improve their efficiency.
Google does not just have any data; they have the world's most massive computer, so TensorFlow was
built to scale. TensorFlow is a library developed by the Google Brain Team to accelerate machine
learning and deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has several
wrappers in several languages like Python, C++ or Java.
TensorFlow Architecture
It is called Tensorflow because it takes input as a multi-dimensional array, also known as tensors. You
can construct a sort of flowchart of operations (called a Graph) that you want to perform on that input.
The input goes in at one end, and then
it flows through this system of multiple operations and comes out the other end as output.
This is why it is called TensorFlow because the tensor goes in it flows through a list of
operations, and then it comes out the other side.
Development Phase: This is when you train the mode. Training is usually done on your Desktop or
laptop.
Run Phase or Inference Phase: Once training is done Tensorflow can be run on many different platforms.
You can run it on
You can train it on multiple machines then you can run it on a different machine, once you have the
trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were initially designed for video
games. In late 2010, Stanford researchers found that GPU was also very good at matrix operations and
algebra so that it makes them very fast for doing these kinds of calculations. Deep learning relies on a
lot of matrix multiplication. TensorFlow is very fast at computing the matrix multiplication because it
is written in C++. Although it is implemented in C++, TensorFlow can be accessed and controlled by
other languages mainly, Python.
Finally, a significant feature of TensorFlow is the TensorBoard. The TensorBoard enables to monitor
graphically and visually what TensorFlow is doing.
PYTHON OVERVIEW
Python is Interpreted: Python is processed at runtime by the interpreter. You do not need
to compile your program before executing it. This is similar to PERL and PHP.
Python is Interactive: You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python is Object-Oriented: Python supports Object-Oriented style or
technique of programming that encapsulates code within objects.
History of Python
Python was developed by Guido van Rossum in the late eighties and early nineties at the National
Research Institute for Mathematics and Computer Science in the Netherlands
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, Unix shell, and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU General Public
License (GPL).
Python is now maintained by a core development team at the institute, although Guido van Rossum
still holds a vital role in directing its progress.
Python Features
Python's features include:
Easy-to-learn: Python has few keywords, simple structure, and a clearly defined syntax.
This allows the student to pick up the language quickly.
Easy-to-read: Python code is more clearly defined and visible to the eyes.
A broad standard library: Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.
Interactive Mode: Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
Portable: Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
Extendable: You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
Databases: Python provides interfaces to all major commercial databases.
GUI Programming: Python supports GUI applications that can be created and ported to
many system calls, libraries, and windows systems, such as Windows MFC, Macintosh,
and the X Window system of Unix.
Scalable: Python provides a better structure and support for large programs than shell
scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed below:
It can be used as a scripting language or can be compiled to byte-code for building large
applications.
It provides very high-level dynamic data types and supports dynamic type checking.
IT supports automatic garbage collection.
It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
8. SOFTWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
● Python
● Anaconda Navigator
● Python built-in modules
• Numpy
• Pandas
• Matplotlib
• Sklearn
• Seaborm
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution that
allows you to launch applications and easily manage conda packages, environments and channels
without using command-line
commands. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository.
It is available for Windows, mac OS and Linux.
In order to run, many scientific packages depend on specific versions of other packages. Data scientists
often use multiple versions of many packages, and use multiple environments to separate these different
versions.
The command line program conda is both a package manager and an environment manager, to help
data scientists ensure that each version of each package has all the dependencies it requires and works
correctly.
Navigator is an easy, point-and-click way to work with packages and environments without needing to
type conda commands in a terminal window. You can use it to find the packages you want, install them
in an environment, run the packages and update them, all inside Navigator.
The simplest way is with Spyder. From the Navigator Home tab, click Spyder, and write and execute
your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an increasingly popular
system that combine your code, descriptive text, output, images and interactive interfaces into a single
notebook file that is edited, viewed and used in a web browser.
What’s new in 1.9?
● Add support for Offline Mode for all environment related actions.
● Add support for custom configuration of main windows links.
● Numerous bug fixes and performance enhancements.
8.2 PYTHON
Python
Python is a general-purpose, versatile and popular programming language. It's great as a first language
because it is concise and easy to read, and it is also a good language to have in any programmer's stack
as it can be used for everything from web development to soitware development and scientific
applications.
It has simple easy-to-use syntax, making it the perfect language for someone trying to learn
computer programming for the first time.
Features of Python
A simple language which is easier to learn, Python has a very simple and elegan
syntax. It's much easier to read and write Python programs compared to other languages like: C++,
Java, C#. Python makes programming fun and allows you to focus on the solution rather than syntax.
If you are a newbie, it's a great choice to start your journey with Python.
Portability
You can move Python programs from one platform to another, and run it without any changes. It runs
seamlessly on almost all platforms including Windows, Mac OS X and Linux.
Extensible and Embeddable Suppose an application requires high performance. You can easily
combine pieces of C/C++ or other languages with Python code. This will give your application high
performance as well as scripting capabilities which other languages may not provide out of the box.
Large standard libraries to solve common tasks Python has a number of standard libraries which
makes life of a programmer much easier since you don't have to write all the code yourself. For
example: Need to connect MySQL database on a Web server You can use MySQLdb library using
import MySQL db Standard libraries in Python are well tested and used by hundreds of people. So you
can be sure that it won't break your application.
● Object-oriented
Everything in Python is an object. Object oriented programming (OOP) helps you
solve a complex problem intuitively.
With OOP, you are able to divide these complex problems into smaller sets by creating
object
Python
History and Versions:
Python is predominantly a dynamic typed programming language which was initiated by Guido van
Rossum in the year 1989. The major design philosophy that was given more importance was the
readability of the code and expressing an idea in fewer lines of code rather than the verbose way of
expressing things as in C++ and Java [K-8][K-9]. The other design philosophy that was worth
mentioning was that, there should be always a single way and a single obvious way to express a given task
which is contradictory to other languages such as C++, Perl etc. [K-10]. Python compiles to an
intermediary code and this in turn is interpreted by the Python Runtime Environment to the Native
Machine Code. The initial versions of Python were heavily inspired from lisp (for functional
programming constructs). Python had
heavily borrowed the module system, exception model and also keyword arguments from Modula-3
language [K-10]. Pythons’ developers strive not to entertain premature optimization, even though it
might increase the performance by a few basis points [K-9]. During its design, the creators had
conceptualized the language as being a very extensible language, and hence they had designed the
language to have a small core library which was extended by a huge standard library [K-7]. Thus as a
result, python is used as a scripting language as it can be easily embedded into any application, though
it can be used to develop a full-fledged application. The reference implementation of python is CPython.
There are also other implementations like Jython, Iron Python which can use python syntax as well as
can use any class of Java (Jython) or .Net class (Iron Python). Versions: Python has two versions 2.x
version and 3.x version. The 3.x version is a backward incompatible release was released to fix many
design issues which plagued the 2.x series. The latest in the 2.x series is 2.7.6 and the latest in 3.x series
is 3.4.0. 1.5.2 Paradigms: Python supports multi-paradigms such as: Object-Oriented, Imperative,
Functional, Procedural, and Reflective. In Object-Oriented Paradigm, Python supports most of the
OOPs concepts such as Inheritance (It also has support for Multiple Inheritance), Polymorphism but its
lack of support for encapsulation is a blatant omission as Python doesn’t have private, protected
members: all class members are public [K- 11]. Earlier Python 2.6 versions didn’t support some OOP’s
concepts such as Abstraction through Interfaces and Abstract Classes [K-19]. It also supports
Concurrent paradigm, but with Python we will not be able to make truly multitasking applications as the
inbuilt threading API is limited by GIL (Global Interpreter Lock) and hence applications that use the
threading API cannot run on multi-core parallelly [K-12].The only remedy is that, the user has to either
use the multi-processing module which would fork processes or use Interpreters that haven’t
implemented GIL such as Jython or Iron Python [K-12]. 1.5.3 Compilation, Execution and
Memory Management: Compilation, Execution and Memory Management: 21 A Comparative Studies
of Programming Languages (Comparative Studies of Six Programming Language) Just like the other
Managed Languages, Python compiles to an intermediary code and this in turn is interpreted by the
Python Runtime Environment to the Native Machine Code. The reference implementation (i.e.
CPython) doesn’t come with a JIT compiler because of which the execution speed is slow compared to
native programming languages [K-17]. We can use PyPy interpreter as it includes a JIT compiler rather
than using the Python interpreter that comes by default with the python language, if speed of execution
is one of the important factors [K-18]. The Python Runtime Environment also takes care of all the
allocation and deallocation of memory through the Garbage Collector. When a new object is created,
the GC allocates the necessary memory, and once the object goes out of its scope, the GC doesn’t
release memory immediately but instead it becomes eligible for Garbage Collection, which would
eventually release the memory. Typing Strategies: Python is a strongly dynamic typed language. Python
3 also supports optional static typing [K-20]. There are a few advantages in using a dynamic typed
language, the most prominent one would be that the code is more readable as there is less code (in other
words has less boiler-plate code). But the main disadvantage in having python as a dynamic
programming language is that there would be no way to guarantee that a particular piece of code would
run successfully for all the different data-types scenarios simply because it had run successfully with
one type. Basically, we don’t have any means to find out an error in the code, till the code has started
running. 1.5.4 Strengths and Weaknesses and Application Areas: Python is predominantly used as a
scripting language used in developing standalone applications that are being developed with Static-
Typed languages, because of the flexibility it provides due to its dynamic typed nature. Python favours
rapid application development, which qualifies it to be used for prototyping. To a certain
extent, Python is also used in developing websites. Due to its dynamic typing and of the presence of a
Virtual Machine, there is a considerable overhead which translates to way less performance when we
compare with native programming languages and hence it is not suited.
8.3 NUMPY
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides
a multidimensional array object, various derived objects (such as masked arrays and matrices), and an
assortment of routines for fast operations on arrays, including mathematical, logical, shape
manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical
operations, random simulation and much more. At the core of the NumPy package, is the ndarray object.
This encapsulates n-dimensional arrays of homogeneous data types, with many operations being
performed in compiled code for performance. There are several important differences between NumPy
arrays and the standard Python sequences: • NumPy arrays have a fixed size at creation, unlike Python
lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete
the original. • The elements in a NumPy array are all required to be of the same data type, and thus will
be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects,
thereby allowing for arrays of different sized elements. • NumPy arrays facilitate advanced
mathematical and other types of operations on large numbers of data. Typically, such operations are
executed more efficiently and with less code than is possible using Python’s built-in sequences. • A
growing plethora of scientific and mathematical Python-based packages are using NumPy arrays;
though these typically support Python-sequence input, they convert such input to NumPy arrays prior to
processing, and they often output NumPy arrays. In other words, in order to efficiently use much
(perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to
use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.
The points about sequence size and speed are particularly important in scientific computing. As a simple
example, consider the case of multiplying each element in a 1-D sequence with the corresponding
element in another sequence of
the same length. If the data are stored in two Python lists, a and b, we could iterate over each element:
The Numeric Python extensions (NumPy henceforth) is a set of extensions to the Python programming
language which allows Python programmers to efficiently manipulate large sets of objects organized in
grid-like fashion. These sets of objects are called arrays, and they can have any number of dimensions:
one dimensional arrays are similar to standard Python sequences, two-dimensional arrays are similar to
matrices from linear algebra. Note that one-dimensional arrays are also different from any other Python
sequence, and that two-dimensional matrices are also different from the matrices of linear algebra, in
ways which we will mention later in this text. Why are these extensions needed? The core reason is a
very prosaic one, and that is that manipulating a set of a million numbers in Python with the standard
data structures such as lists, tuples or classes is much too slow and uses too much space. Anything
which we can do in NumPy we can do in standard Python – we just may not be alive to see the program
finish. A more subtle reason for these extensions however is that the kinds of operations that
programmers typically want to do on arrays, while sometimes very complex, can often be decomposed
into a set of fairly standard operations. This decomposition has been developed similarly in many array
languages. In some ways, NumPy is simply the application of this experience tothe Python language –
thus many of the operations described in NumPy work the way they do because experience has shown
that way to be a good one, in a variety of contexts. The languages which were used to guide the
development of NumPy include the infamous APL family of languages, Basis, MATLAB, FORTRAN,
S and S+, and others. This heritage will be obvious to users of NumPy who already have experience
with these other languages. This tutorial, however, does not assume any such background, and all that
is expected of the reader is a reasonable working knowledge of the standard Python language. This
document is the “official” documentation for NumPy. It is both a tutorial and the most authoritative
source of information about NumPy with the exception of the source code. The tutorial material will
walk you through a set of manipulations of simple, small, arrays of numbers, as well as image files. This
choice was made because: • Aconcrete data set makes explaining the behavior of some functions much
easier to motivate than simply talking about abstract operations on abstract data sets; • Every reader
will at least an intuition as to the meaning of the data and organization of image files, and
The result of various manipulations can be displayed simply since the data set has
a natural graphical representation. All users of NumPy, whether interested in image processing or not,
are encouraged to follow the tutorial with a working NumPy installation at their side, testing the
examples, and, more importantly, transferring the understanding gained by working on images to their
specific domain. The best way to learn is by doing – the aim of this tutorial is to guide you along this
“doing.”
9. PYTHON ENVIRONMENT
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's understand
how to set up our Python environment.
8. IMPLEMENTATION
8.1 SAMPLE CODE
Coding And Test Cases :
Code
import warnings
warnings.filterwarnings('ignore') import
pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score,confusion_matrix
df=pd.read_csv('xAPI-Edu-Data.csv')
df.head() df.shape
df.info()
df.dropna()
df.isnull().sum()
#EDA
sns.countplot(x="gender", order=['F','M'], data=df, palette="Set1") plt.show()
sns.countplot(x="gender", order=['F','M'], hue="Class", hue_order=['L','M','H'],
data=df, palette="muted")
plt.show()
df['NationalITy'].value_counts(normalize=True).plot(kind='bar') plt.show()
df['PlaceofBirth'].value_counts(normalize=True).plot(kind='bar') plt.show()
sns.countplot(y="NationalITy", data=df, palette="muted")
plt.show()
sns.countplot(y="NationalITy", hue="Class", hue_order=['L','M','H'], data=df,
palette="muted")
plt.show()
sns.countplot(x="Relation", order=['Mum','Father'], data=df, palette="Set1") plt.show()
sns.countplot(x="Relation", order=['Mum','Father'], hue="Class",
hue_order=['L','M','H'], data=df, palette="muted")
plt.show()
sns.countplot(x="StageID", data=df, palette="muted") plt.show()
sns.countplot(x="StageID", hue="Class", hue_order=['L','M','H'], data=df,
palette="muted")
plt.show()
sns.countplot(x="GradeID", data=df, palette="muted") plt.show()
sns.countplot(x="GradeID", hue="Class", hue_order=['L','M','H'], data=df,
palette="muted")
plt.show() plt.subplot(1,2,1)
sns.countplot(x="SectionID", order=['A','B','C'], data=df, palette="muted")
plt.subplot(1,2,2)
sns.countplot(x="SectionID", order=['A','B','C'], hue="Class",
hue_order=['L','M','H'], data=df, palette="muted") plt.show()
plt.subplot(1,2,1)
sns.countplot(y="Topic", data=df, palette="muted")
plt.subplot(1,2,2)
sns.countplot(y="Topic", hue="Class", hue_order=['L','M','H'], data=df,
palette="muted")
plt.show()
sns.countplot(x="ParentschoolSatisfaction", data=df, palette="muted") plt.show()
sns.countplot(x="ParentschoolSatisfaction", hue="Class", hue_order=['L','M','H'], data=df,
palette="muted")
plt.show() plt.figure(figsize=(8, 8))
sns.countplot('Class', data=df)
plt.title('Balanced Classes')
plt.show()
#Pre-processing
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
df['LGender'] = le.fit_transform(df['gender'])#.values.reshape(-1,1).ravel())
df['LNationalITy'] = le.fit_transform(df['NationalITy'])
df['LPlaceofBirth'] = le.fit_transform(df['PlaceofBirth'])
df['LStageID'] = le.fit_transform(df['StageID'])
df['LGradeID'] = le.fit_transform(df['GradeID'])
df['LSectionID'] = le.fit_transform(df['SectionID'])
df['LTopic'] = le.fit_transform(df['Topic']) df['LSemester']
= le.fit_transform(df['Semester']) df['LRelation'] =
le.fit_transform(df['Relation'])
df['LParentschoolSatisfaction'] = le.fit_transform(df['ParentschoolSatisfaction'])
df['LParentAnsweringSurvey'] = le.fit_transform(df['ParentAnsweringSurvey'])
df['LStudentAbsenceDays'] = le.fit_transform(df['StudentAbsenceDays']) df['LClass'] =
le.fit_transform(df['Class'])
df.head(1) df=df.drop(["gender"],axis=1)
df=df.drop(["NationalITy"],axis=1)
df=df.drop(["PlaceofBirth"],axis=1)
df=df.drop(["StageID"],axis=1)
df=df.drop(["GradeID"],axis=1)
df=df.drop(["SectionID"],axis=1)
df=df.drop(["Topic"],axis=1)
df=df.drop(["Semester"],axis=1)
df=df.drop(["Relation"],axis=1)
df=df.drop(["ParentAnsweringSurvey"],axis=1)
df=df.drop(["StudentAbsenceDays"],axis=1)
df=df.drop(["ParentschoolSatisfaction"],axis=1)
df=df.drop(["Class"],axis=1)
df.head()
df.to_csv('data.csv')
#Univariate Selection
from sklearn.feature_selection import SelectKBest from
sklearn.feature_selection import chi2 x=df.iloc[:,df.columns
!='LClass'] y=df.iloc[:,df.columns =='LClass']
bestfeatures = SelectKBest(score_func=chi2, k=10) fit =
bestfeatures.fit(x,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(x.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
featureScores.nlargest(10,'Score')
corr = df.corr()
#logistic regression
from sklearn.linear_model import LogisticRegression lr=LogisticRegression()
lr.fit(x_train,y_train)
predict1=lr.predict(x_test)
model1=accuracy_score(y_test,predict1)
print(model1)
plt.show()
Test Case
1. Test case related to Dataset
10. RESULTS AND SCREENSHOTS
Data Visualization with Matplotlib and seaborn
EDA part has been done as a module to represent the dataset graphically on each attribute from
dataset.
00-0
KW
jordan
Palestine
lebanon
Tunis
5audiArabia
Egypt
Lybia
Iran
Morocco
ve nzuela
ue
J75
KW Class
lebano n
Egypt
5audiArabia
}ordan
é venzuela
' Iran
Tunis
Morocco
Syria
Palestine
0 l0 20 30 40 SO b0 7B BO
Iraq
Lybia
S 150
Mum Father
Relation
Class
- L
100
20
Father
Relation
Relation
5tagelD
lowerlevel Middle5
choo
l
lowerl Hiq
evel hsch
ool
Middle
School
Sta geID
G-04 G-07 G-08 G-06 G-05 G-09 G-L2 G-LI G-LO G-02
lass
150
SectionID Section ID
Math ath
Arabic abic
Science nce
English En lish
Duran ran
Spanish nish
French nch
History Hi
Biology
Chemistry Che istry
20 100 0 10 20
Balanced Classes
Elass
Accuracy with logistic regression:-
performance .The user interaction model could be derived for giving the record of student
dynamically and it could give staff an alert message about those students who are having
low performance . We could build the prediction using Neural Network and can expect
improvised results. We can add non- academic attributes along with academics attributes.
12. CONCLUSION
Finally, performance analysis for students are a major problem. It is important that they are
countered. The work reported in this thesis indicates the machine learning techniques with
student records where we analyses the performance of student and categorized it into three