0% found this document useful (0 votes)
13 views1 page

Salary Estimation Using K-Nearest Neighbour

The document outlines a process for predicting whether a job applicant's salary is above or below 50K based on various features such as age, education, capital gain, and hours worked per week. It details steps including data collection, dataset loading, feature mapping, dataset segregation, and scaling to ensure equal contribution of features. The K-Nearest Neighbor algorithm is employed for classification, with emphasis on finding the optimal number of neighbors and validating the model's accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views1 page

Salary Estimation Using K-Nearest Neighbour

The document outlines a process for predicting whether a job applicant's salary is above or below 50K based on various features such as age, education, capital gain, and hours worked per week. It details steps including data collection, dataset loading, feature mapping, dataset segregation, and scaling to ensure equal contribution of features. The K-Nearest Neighbor algorithm is employed for classification, with emphasis on finding the optimal number of neighbors and validating the model's accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

To Predict

Predicting whether this Job applicant got


Finding the Problem - Salary above 50K or Not from Previous
1 Application Company - HR

Input: Age, Education number, Capital


Gain & Hours/week Output: Salary above/
below 50K

Based on Age, Education Number, Capital


Gain, Hours per week to estimate the
2 Collecting Dataset salary is above 50K or below 50K

Pandas - Load CSV Format Dataset

dataset = pandas.read_csv('dataset.csv')

No. of
Rows &
Load Dataset from the directory & Columns
Summarize the details such as no. of rows
3 Load & Summarize Dataset and Columns & Content
dataset.shape

Display 1st 5
rows of dataset

dataset.head(5)

Function: .map

Mapping Data from to Text to If the data is <=50K or >=50K, kind of text,
4 Binary Numbers Here we need to Map <50K as 0 & >50K as 1

SYNTAX: dataset.iloc[:,
start_col:end_col]

iloc - It helps us select a value that X = dataset.iloc[:, :-1].values


5 Segregating Dataset into X & Y belongs to a particular row or column
Y = dataset.iloc[:, -1].values

train_test_split(X, Y, test_size = 0.25,


6 Splitting Dataset to Train & Test Useful for validation random_state = 0)

PROBLEM

Since both the features have different


scales, there is a chance that higher
weightage is given to features with higher
magnitude. This will impact the
performance of the machine learning
algorithm and obviously, we do not want
our algorithm to be biassed towards one
feature.

SOLUTION

we scale our data to make all the features


contribute equally to the result

Salary Estimation 7 Feature Scaling


using K-Nearest Here the values are centered around the
mean with a unit standard deviation. This
Neighbour means that the mean of the attribute
becomes zero and the resultant
Standardization distribution has a unit standard deviation.

Types
Normalization is a scaling technique in
which values are shifted and rescaled so
that they end up ranging between 0 and 1.
Normalization It is also known as Min-Max scaling

Euclidean distance is calculated as the


square root of the sum of the squared
differences between a new point (x) and
Euclidean Distance an existing point (y)

Based on Minkowski Distance Metric we


gonna classify the data points |
p = 1 , Manhattan Distance
8 Algorithm K-Nearest Neighbor p = 2 , Euclidean Distance
This is the distance between real vectors
Manhattan Distance using the sum of their absolute difference

It is used for categorical variables. If the


value (x) and the value (y) are the same,
the distance D will be equal to 0 .
Hamming Distance Otherwise D=1

From this figure, we can observe that K


Finding the Best K-Value - Choose the K Value where we are getting Value range from 15 to 35, our mean error
9 number of neighbors least mean error is low

Training our Model for Pre-processed


10 Training Dataset model.fit(X_train, y_train)

11 Validation Obtaining the accuracy of the Model Confusion Matrix

Observing how our model is classifying result = model.predict(sc.transform(


12 Prediction our new data newEmp))

You might also like