Assignment 3 - LP1
Assignment 3 - LP1
3
Aim:
Assignment on Classification technique
Every year many students give the GRE exam to get admission in foreign Universities. The data set
contains GRE Scores (out of 340), TOEFL Scores (out of 120), University Rating (out of 5), Statement of
Purpose strength (out of 5), Letter of Recommendation strength (out of 5), Undergraduate GPA (out of
10), Research Experience (0=no, 1=yes), Admitted (0=no, 1=yes). Admitted is the target variable. Data
Set Available on kaggle (The last column of the dataset needs to be changed to 0 or 1)Data Set :
https://www.kaggle.com/mohansacharya/graduate-admissions The counselor of the firm is supposed
check whether the student will get an admission or not based on his/her GRE score and Academic Score.
So to help the counselor to take appropriate decisions build a machine learning model classifier using
Decision tree to predict whether a student will get admission or not. Apply Data pre-processing (Label
Encoding, Data Transformation….) techniques if necessary. Perform data-preparation (Train-Test Split)
C. Apply Machine Learning Algorithm D. Evaluate Model.
Theory:
Classification: Classification may be defined as the process of predicting class or category from
observed values or given data points. The categorized output can have the form such as “Black” or
“White” or “spam” or “no spam”.Mathematically, classification is the task of approximating a mapping
function (f) from input variables (X) to output variables (Y).
Naive Bayes, Logistic regression, K-nearest neighbours, (Kernel) SVM, Decision tree
1. Logistic Regression Algorithm: It is a Machine Learning classification algorithm that is used to
predict the probability of a categorical dependent variable. In logistic regression, the dependent
variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
Logistic regression model predicts P(Y=1) as a function of X.
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step
2. Fitting Logistic Regression to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Visualizing the test set result.
2. Decision Tree Algorithm: Decision trees can be constructed by an algorithmic approach that can
split the dataset in different ways based on different conditions. Decisions tress is the most powerful
algorithms that falls under the category of supervised algorithms.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Solve decision tree such problems there is a technique which is called as Attribute selection
measure or ASM. There are two popular techniques for ASM, which are:
1. Information Gain: Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute. It calculates how much information a feature
provides us about a class. According to the value of information gain, we split the node and build
the decision tree.
2. Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
no
3. Gini Index: Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm. An attribute with the low Gini index should
be preferred as compared to the high Gini index.
3. SVM Algorithm: Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
1. Sentiment Analysis
2. Email Spam Classification
3. Document Classification
4. Image Classification
Code:
# To load the
dataset import
pandas as pd
importmatplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
df =
pd.read_csv('Admission_Predict.csv')
print(df.head(5))
##########################################################################
#To drop the irrelevant column and check if there are any null values in the
print(df.isnull().sum())
plt.show()
plt.show()
plt.show()
fig = sns.distplot(df['SOP'],
kde=False) plt.title("Distribution of
#Show CGPA
plt.title("Distribution of CGPA")
plt.show()
#It is clear from the distributions, students with varied merit apply for the university.
#Understanding the relation between different factors responsible for graduate admissions GRE Score vs
TOEFL Score
#regplot() :Plot data and a linear regression model fit.
plt.show()
#People with higher GRE Scores also have higher TOEFL Scores which is justified because both TOEFL
and GRE have a verbal section which although not similar are relatable
plt.show()
#Although there are exceptions, people with higher CGPA usually have higher GRE scores maybe
because they are smart or hard working
#hue: Variables that define subsets of the data, which will be drawn on separate facets in the grid.
plt.title("LOR vs
CGPA") plt.show()
#LORs (Letter of Recommendation strength) are not that related with CGPA so it is clear that a persons
LOR is not dependent on that persons academic excellence.
#Having research experience is usually related with a good LOR which might be justified by the fact that
supervisors have personal interaction with the students performing research which usually results in
good LORs
plt.show()
#GRE scores and LORs are also not that related. People with different kinds of LORs have all kinds of
GRE scores
#SOP vs CGPA
plt.title("SOP vs CGPA")
plt.show()
#CGPA and SOP are not that related because Statement of Purpose is related to academic performance,
but since people with good CGPA tend to be more hard working so they have good things to say in their
SOP which might explain the slight move towards higher CGPA as along with good SOPs
plt.show()
#SOP vs TOEFL
plt.title("SOP vs TOEFL")
plt.show()
importnumpy as np
corr = df.corr()
print(corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
as_cmap=True)
plt.show()
print("Results...")
model = model
axis model.fit(X_train,
test result
predictions = model.predict(X_test)
classifier = RandomForestRegressor()
classifier.fit(X,y)
#X.columns features in
dataset feature_names =
X.columns
print(feature_names)
importance_frame = pd.DataFrame()
#Two Dimensional Array Format column
names importance_frame['Features'] =
X.columns
importance_frame['Importance'] = classifier.feature_importances_
plt.yticks([1,2,3,4,5,6,7], importance_frame['Features'])
plt.xlabel('Importance')
#Clearly, CGPA is the most factor for graduate admissions followed by GRE Score.
plt.title('Feature Importances')
plt.show()
Output:
Conclusion: Thus we have studied different classification techniques.