Here's An Visualization of The K-Nearest Neighbors Algorithm
Here's An Visualization of The K-Nearest Neighbors Algorithm
Neighbors
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity
to make classifications or predictions about the grouping of an individual data point.
In this case, we have data points of Class A and B. We want to predict what the question mark box (test data point) is. If we consider a k
value of 1 (1 nearest data point), we will obtain a prediction of Class A.
In this sense, it is important to consider the value of k. Hopefully from this diagram, you should get a sense of what the K-Nearest
Neighbors algorithm is. It considers the 'K' Nearest Neighbors (data points) when it predicts the classification of the test point.
Let's download and import the data on China's GDP using pandas read_csv() method.
Download Dataset
Our objective is to build a classifier, to predict the class of unknown cases. We will use a specific type of classification called K nearest
Our objective is to build a classifier, to predict the class of unknown cases. We will use a specific type of classification called K nearest
neighbour.
region tenure age marital address income ed employ retire gender reside custcat
0 2 13 44 1 9 64.0 4 5 0.0 0 2 1
1 3 11 33 1 7 136.0 5 5 0.0 0 6 4
2 3 68 52 1 24 116.0 1 29 0.0 1 2 3
3 2 33 33 0 12 33.0 2 0 0.0 1 1 1
4 2 23 30 1 9 30.0 1 2 0.0 0 4 3
Data Exploration
Let's first have a descriptive exploration on our data.
df.describe()
region tenure age marital address income ed employ retire gender resid
count 1000.0000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 2.0220 35.526000 41.684000 0.495000 11.551000 77.535000 2.671000 10.987000 0.047000 0.517000 2.33100
std 0.8162 21.359812 12.558816 0.500225 10.086681 107.044165 1.222397 10.082087 0.211745 0.499961 1.43579
min 1.0000 1.000000 18.000000 0.000000 0.000000 9.000000 1.000000 0.000000 0.000000 0.000000 1.00000
25% 1.0000 17.000000 32.000000 0.000000 3.000000 29.000000 2.000000 3.000000 0.000000 0.000000 1.00000
50% 2.0000 34.000000 40.000000 0.000000 9.000000 47.000000 3.000000 8.000000 0.000000 1.000000 2.00000
75% 3.0000 54.000000 51.000000 1.000000 18.000000 83.000000 4.000000 17.000000 0.000000 1.000000 3.00000
max 3.0000 72.000000 77.000000 1.000000 55.000000 1668.000000 5.000000 47.000000 1.000000 1.000000 8.00000
df['custcat'].value_counts()
3 281
1 266
4 236
2 217
Name: custcat, dtype: int64
281 Plus Service, 266 Basic-service, 236 Total Service, and 217 E-Service customers
df.hist(column='income', bins=50)
array([[<AxesSubplot:title={'center':'income'}>]], dtype=object)
df.columns
df.columns
df.hist(column='tenure', bins=50)
array([[<AxesSubplot:title={'center':'tenure'}>]], dtype=object)
y = df['custcat'].values
Normalization
Data Standardization gives the data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is
based on the distance of data points
X = StandardScaler().fit(X).transform(X.astype(float))
type(X)
numpy.ndarray
We know the outcome of each data point in the testing dataset, making it great to test with! Since this data has not been used to train the
model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing.
Let's split our dataset into train and test sets. Around 80% of the entire dataset will be used for training and 20% for testing.
Classification
k = 4
#Train Model and Predict
model = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
model
KNeighborsClassifier(n_neighbors=4)
y_pred = model.predict(X_test)
Evaluation
print("Train set Accuracy: ", accuracy_score(y_train, model.predict(X_train)))
print("Test set Accuracy: ", accuracy_score(y_test, y_pred))
Ks = 20
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
for n in range(1,Ks):
#Train Model and Predict
model = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
y_pred=model.predict(X_test)
mean_acc[n-1] = accuracy_score(y_test, y_pred)
std_acc[n-1]=np.std(y_pred==y_test)/np.sqrt(y_pred.shape[0])
mean_acc
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)
Thank you
Author
Moazzam Ali
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js