0% found this document useful (0 votes)
24 views

Assignment 3 - LP1

The document discusses building a machine learning classifier using decision trees to predict graduate school admissions. It describes preprocessing the dataset, applying classification algorithms like decision trees and logistic regression, and evaluating the model's performance. Code is provided to preprocess and explore the admissions dataset, including plotting various graphs.

Uploaded by

bbad070105
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Assignment 3 - LP1

The document discusses building a machine learning classifier using decision trees to predict graduate school admissions. It describes preprocessing the dataset, applying classification algorithms like decision trees and logistic regression, and evaluating the model's performance. Code is provided to preprocess and explore the admissions dataset, including plotting various graphs.

Uploaded by

bbad070105
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Experiment No.

3
Aim:
Assignment on Classification technique
Every year many students give the GRE exam to get admission in foreign Universities. The data set
contains GRE Scores (out of 340), TOEFL Scores (out of 120), University Rating (out of 5), Statement of
Purpose strength (out of 5), Letter of Recommendation strength (out of 5), Undergraduate GPA (out of
10), Research Experience (0=no, 1=yes), Admitted (0=no, 1=yes). Admitted is the target variable. Data
Set Available on kaggle (The last column of the dataset needs to be changed to 0 or 1)Data Set :
https://www.kaggle.com/mohansacharya/graduate-admissions The counselor of the firm is supposed
check whether the student will get an admission or not based on his/her GRE score and Academic Score.
So to help the counselor to take appropriate decisions build a machine learning model classifier using
Decision tree to predict whether a student will get admission or not. Apply Data pre-processing (Label
Encoding, Data Transformation….) techniques if necessary. Perform data-preparation (Train-Test Split)
C. Apply Machine Learning Algorithm D. Evaluate Model.

Theory:
Classification: Classification may be defined as the process of predicting class or category from
observed values or given data points. The categorized output can have the form such as “Black” or
“White” or “spam” or “no spam”.Mathematically, classification is the task of approximating a mapping
function (f) from input variables (X) to output variables (Y).

Building a Classifier in Python:

Step1: Importing necessary python

package Step2: Importing dataset

Step3: Organizing data into training & testing

sets Step4: Model evaluation

Step5: Finding accuracy

Classification Algorithms Include:

Naive Bayes, Logistic regression, K-nearest neighbours, (Kernel) SVM, Decision tree
1. Logistic Regression Algorithm: It is a Machine Learning classification algorithm that is used to
predict the probability of a categorical dependent variable. In logistic regression, the dependent
variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
Logistic regression model predicts P(Y=1) as a function of X.

Logistic Regression Algorithm Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.

Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step
2. Fitting Logistic Regression to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Visualizing the test set result.

2. Decision Tree Algorithm: Decision trees can be constructed by an algorithmic approach that can
split the dataset in different ways based on different conditions. Decisions tress is the most powerful
algorithms that falls under the category of supervised algorithms.

Decision Tree Algorithm Steps:

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.

Solve decision tree such problems there is a technique which is called as Attribute selection
measure or ASM. There are two popular techniques for ASM, which are:

1. Information Gain: Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute. It calculates how much information a feature
provides us about a class. According to the value of information gain, we split the node and build
the decision tree.

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

2. Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,S= Total number of samples, P(yes)= probability of yes, P(no)= probability of

no

3. Gini Index: Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm. An attribute with the low Gini index should
be preferred as compared to the high Gini index.

Gini Index= 1- ∑jPj2

3. SVM Algorithm: Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.

SVM Algorithm Steps:


1. Importing the dataset
2. Splitting the dataset into training and test samples
3. Classifying the predictors and target
4. Initializing Support Vector Machine and fitting the training data
5. Predicting the classes for test set
6. Attaching the predictions to test set for comparing
7. Comparing the actual classes and predictions
8. Calculating the accuracy of the predictions

Applications of Classifications Algorithms:

1. Sentiment Analysis
2. Email Spam Classification
3. Document Classification
4. Image Classification

Code:
# To load the

dataset import

pandas as pd

importmatplotlib.pyplot as plt

#seaborn: for data visualization and exploratory data

analysis importseaborn as sns

import warnings

warnings.filterwarnings("ignore")

#Read data in csv file store into dataframe

df =

pd.read_csv('Admission_Predict.csv')

print(df.head(5))

##########################################################################

#To drop the irrelevant column and check if there are any null values in the

dataset df = df.drop(['Serial No.'], axis=1)

print(df.isnull().sum())

#To see the distribution of the variables of graduate applicants.

#distplot() plot distributed data as observations


#KDE: Kerner Density Estimate, probability density function of a continuous random variable Show
GRE Score

fig = sns.distplot(df['GRE Score'], kde=False)

plt.title("Distribution of GRE Scores")

plt.show()

#Show TOEFL Score

fig = sns.distplot(df['TOEFL Score'], kde=False)

plt.title("Distribution of TOEFL Scores")

plt.show()

#Show University Ratings

fig = sns.distplot(df['University Rating'], kde=False)

plt.title("Distribution of University Rating")

plt.show()

#Show SOP Ratings

fig = sns.distplot(df['SOP'],

kde=False) plt.title("Distribution of

SOP Ratings") plt.show()

#Show CGPA

fig = sns.distplot(df['CGPA'], kde=False)

plt.title("Distribution of CGPA")

plt.show()

#It is clear from the distributions, students with varied merit apply for the university.

#Understanding the relation between different factors responsible for graduate admissions GRE Score vs
TOEFL Score
#regplot() :Plot data and a linear regression model fit.

fig = sns.regplot(x="GRE Score", y="TOEFL Score", data=df)

plt.title("GRE Score vs TOEFL Score")

plt.show()

#People with higher GRE Scores also have higher TOEFL Scores which is justified because both TOEFL
and GRE have a verbal section which although not similar are relatable

#GRE Score vs CGPA

fig = sns.regplot(x="GRE Score", y="CGPA", data=df)

plt.title("GRE Score vs CGPA")

plt.show()

#Although there are exceptions, people with higher CGPA usually have higher GRE scores maybe
because they are smart or hard working

#LOR vs CGPA show wheather Research 0 or 1

#lmplot():a 2D scatterplot with an optional overlaid regression line.

#hue: Variables that define subsets of the data, which will be drawn on separate facets in the grid.

fig = sns.lmplot(x="CGPA", y="LOR ", data=df, hue="Research")

plt.title("LOR vs

CGPA") plt.show()

#LORs (Letter of Recommendation strength) are not that related with CGPA so it is clear that a persons
LOR is not dependent on that persons academic excellence.

#Having research experience is usually related with a good LOR which might be justified by the fact that
supervisors have personal interaction with the students performing research which usually results in
good LORs

#GRE Score vs LOR SHOW WHEATHER Research 0 or 1

fig = sns.lmplot(x="GRE Score", y="LOR ", data=df, hue="Research")


plt.title("GRE Score vs LOR")

plt.show()

#GRE scores and LORs are also not that related. People with different kinds of LORs have all kinds of
GRE scores

#SOP vs CGPA

fig = sns.regplot(x="CGPA", y="SOP", data=df)

plt.title("SOP vs CGPA")

plt.show()

#CGPA and SOP are not that related because Statement of Purpose is related to academic performance,
but since people with good CGPA tend to be more hard working so they have good things to say in their
SOP which might explain the slight move towards higher CGPA as along with good SOPs

#GRE Score vs SOP

fig = sns.regplot(x="GRE Score", y="SOP", data=df)

plt.title("GRE Score vs SOP")

plt.show()

#Similary, GRE Score and CGPA is only slightly related

#SOP vs TOEFL

fig = sns.regplot(x="TOEFL Score", y="SOP", data=df)

plt.title("SOP vs TOEFL")

plt.show()

.#Correlation among variables

importnumpy as np

#corr():Find the pairwise correlation of all columns in the dataframe

corr = df.corr()
print(corr)

#plt.subplot:Crate a figure & set sub plots

fig, ax = plt.subplots(figsize=(8, 8))

#Make a diverging palette between two HUSL

colors. #cmap: colour map set

colormap = sns.diverging_palette(220, 10, as_cmap=True)

#zeros_like():Returns an array of given shape and type as given array, with

zeros. dropSelf = np.zeros_like(corr)

#np.triu_indices_from(dropSelf): Return indices of array

dropSelf[np.triu_indices_from(dropSelf)] = True

colormap = sns.diverging_palette(220, 10,

as_cmap=True)

sns.heatmap(corr, cmap=colormap, linewidths=.5, annot=True, fmt=".2f", mask=dropSelf)

plt.show()

fromsklearn.model_selection import train_test_split

#drop col chances of admission

X = df.drop(['Chance of Admit '], axis=1)

y = df['Chance of Admit ']

#split data for training & tasting

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20, shuffle=False)

#DecisionTree, Random Forest, K Neighbor, SVR, Linear Regression

fromsklearn.tree import DecisionTreeRegressor

fromsklearn.ensemble import RandomForestRegressor

fromsklearn.svm import SVR


fromsklearn.linear_model import LinearRegression

fromsklearn.metrics import mean_squared_error

#These methods predict the future applicant's chances of admission.

models = [['DecisionTree :',DecisionTreeRegressor()],

['Linear Regression :', LinearRegression()],

['SVM :', SVR()]]

print("Results...")

#For loop for generating model

results forname,model in models:

model = model

#Fit training data of x & y

axis model.fit(X_train,

y_train) #Pass predicted or

test result

predictions = model.predict(X_test)

#Difference between actual value & predicted value

print(name, (np.sqrt(mean_squared_error(y_test, predictions))))

classifier = RandomForestRegressor()

classifier.fit(X,y)

#X.columns features in

dataset feature_names =

X.columns

print(feature_names)

#Initialize importance_frame[] in 2 dim array.

importance_frame = pd.DataFrame()
#Two Dimensional Array Format column

names importance_frame['Features'] =

X.columns

#classifier.feature_importance is decision tree based on correlation value As per importance of admission

importance_frame['Importance'] = classifier.feature_importances_

#Sort the features by high to low bar graph

importance_frame = importance_frame.sort_values(by=['Importance'], ascending=True)

#Visualize 7 Feature Importances

#bar: plots horizontal rectangles with constant heights.

plt.barh([1,2,3,4,5,6,7], importance_frame['Importance'], align='center', alpha=0.5)

#yticks: set feature lable on y axis

plt.yticks([1,2,3,4,5,6,7], importance_frame['Features'])

plt.xlabel('Importance')

#Clearly, CGPA is the most factor for graduate admissions followed by GRE Score.

plt.title('Feature Importances')

plt.show()

Output:
Conclusion: Thus we have studied different classification techniques.

You might also like