0% found this document useful (0 votes)
53 views

KNN Algorithm in Machine Learning

The document provides an introduction to the K-nearest neighbors (KNN) machine learning algorithm. It explains that KNN is a simple supervised learning method used for classification problems. It works by finding the K closest training examples in the feature space and assigning the data point to the most common class among its K neighbors. The document discusses why KNN is useful, what it is, how to choose the K value, when it should be used, and how the KNN algorithm makes predictions by calculating distances to neighbors.
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

KNN Algorithm in Machine Learning

The document provides an introduction to the K-nearest neighbors (KNN) machine learning algorithm. It explains that KNN is a simple supervised learning method used for classification problems. It works by finding the K closest training examples in the feature space and assigning the data point to the most common class among its K neighbors. The document discusses why KNN is useful, what it is, how to choose the K value, when it should be used, and how the KNN algorithm makes predictions by calculating distances to neighbors.
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to KNN(K Nearest Neighbor)

0:01
hello and welcome to k-nearest neighbors
0:05
algorithm tutorial my name is Richard
0:08
Kirchner and I'm with the simply learned
0:10
team today we're gonna cover the K
0:12
nearest neighbors a lot refer to as K in
0:14
n and k n n is really a fundamental
0:17
place to start in the machine learning
0:19
it's a basis of a lot of other things
0:21
and just the logic behind it is easy to
0:23
understand and incorporate it in other
0:26
forms of machine learning so today
0:28
what's in it for you
0:29
why do we need KN n what is KN n how do
0:34
we choose the factor K when do we use KN
0:38
n how does knn algorithm work and then
0:42
we'll dive in to my favorite part the
0:44
use case predict whether a person will
0:46
have diabetes or not that is a very
0:48
common and popular used data set as far
0:51
as testing out models and learning how
0:54
to use the different models and machine
0:56
learning by now we all know machine
Why do we need KNN?
0:58
learning models make predictions by
1:01
learning from the past data available so
1:03
we have our input values our machine
1:05
learning model builds on those inputs of
1:07
what we already know and then we use
1:09
that to create a predicted output is
1:11
that a dog little kid looking over there
1:14
and watching the black cat cross their
1:16
path no dear you can differentiate
1:18
between a cat and a dog based on their
1:21
characteristics cats cats have sharp
1:25
claws uses to climb smaller lengths of
1:27
ears meows and purrs doesn't love to
1:30
play around dogs have dull claws bigger
1:33
lengths of ears barks loves to run
1:35
around usually don't see a cat running
1:38
around people other why do you have a
1:39
cat that does that where dogs do and we
1:42
can look at these we can say we can
1:43
evaluate their sharpness of the claws
1:45
how sharper their claws and we can
1:47
evaluate the length of the ears and we
1:49
can usually sort out cats from dogs
1:51
based on even those two characteristics
1:54
now tell me if it is a cat or a dog
1:56
nod question usually little kids know
1:58
cats and dogs bite now listen them a
2:01
place where there's not many cats or
2:02
dogs so if we look at the sharpness of
2:04
the claws the length of the years and we
2:06
can see that the cat has a smaller ears
2:09
and sharper claws and the other animals
2:11
its features are more like
2:13
Katz's it must be a cat sharp claws
2:16
length of ears and it goes in the cat
2:18
group because KNN is based on feature
2:21
similarity we can do classification
2:23
using KNN classifier so we have our
2:26
input value the picture of the black cat
2:28
it goes into our trained model and it
2:30
predicts that this is a cat coming out
2:32
so what is KN n what is the KN algorithm
What is KNN?
2:36
K nearest neighbors is what that stands
2:39
for
2:40
is one of the simplest supervised
2:41
machine learning algorithms mostly used
2:44
for classification so we want to know is
2:47
this a dog or it's not a dog is it a cat
2:50
or not a cat it classifies a data point
2:52
based on how its neighbors are
2:54
classified KN in stores all available
2:56
cases and classifies new cases based on
2:59
a similarity measure and here we gone
3:01
from cats and dogs right into wine
3:04
another favorite of mine KN in stores
3:06
all available cases and classifies new
3:08
cases based on a similarity measure and
3:10
here you see we have a measurement of
3:12
sulfur dioxide versus the chloride level
3:15
and then the different wines they've
3:16
tested and where they fall on that graph
3:18
based on how much sulfur dioxide and how
3:20
much chloride k and KN is a perimeter
3:22
that refers to the number of nearest
3:24
neighbors to include in the majority of
3:26
the voting process and so if we add a
3:28
new glass of wine there red or white we
3:31
want to know what the neighbors are in
3:32
this case we're gonna put k equals five
3:34
and we'll talk about K in just a minute
3:36
a data point is classified by the
3:38
majority of votes from its five nearest
3:40
neighbors here the unknown point would
3:42
be classified as red since four out of
3:44
five neighbors are red so how do we
3:47
choose K how do we know K equals five I
3:50
mean that's what is the value we put in
How do we choose the factor 'K'?
3:52
there and so we can talk about it how do
3:53
we choose the factor K knn algorithm is
3:56
based on feature similarity choosing the
3:58
right value of K is a process called
4:01
parameter tuning and is important for
4:04
better accuracy so at K equals three we
4:06
can classify we have a question mark in
4:08
the middle as either a as a square or
4:10
not is it a square or is it in this case
4:12
a triangle and so if we set K equals to
4:15
three we're going to look at the three
4:16
nearest neighbors we're going to say
4:18
this is a square and if we put K equals
4:20
to seven we classify as a triangle
4:23
depending on what the other data is
4:24
around and you can see is the K change
4:27
depending on where that point is that
4:28
drastically changes your answer and we
4:31
jump here we go how do we choose the
4:33
factor of K you'll find this in all
4:35
machine learning choosing these factors
4:37
that's the face you get he's like oh my
4:39
gosh you'd say choose the right K did I
4:42
set it right my values in whatever
4:44
machine learning tool you're looking at
4:45
so that you don't have a huge bias in
4:47
one direction or the other and in terms
4:49
of K n n the number of K if you choose
4:52
it too low the bias is based on it's
4:55
just too noisy it's right next to a
4:57
couple things and it's gonna pick those
4:59
things and you might get asked you to
5:00
answer and if your K is too big then
5:03
it's gonna take forever to process so
5:05
you're gonna run into processing issues
5:07
and resource issues so what we do the
5:10
most common use and there's other
5:11
options for choosing K is to use the
5:14
square root of n so it is a total number
5:17
of values you have you take the square
5:19
root of it in most cases you also if
5:21
it's an even number so if you're using
5:23
like in this case squares and triangles
5:25
if it's even you want to make your K
5:28
value odd that helps it select better so
5:31
in other words you're not going to have
5:32
a balance between two different factors
5:34
that are equal so you usually take the
5:36
square root of n and if it's even you
5:38
add one to it or subtract one from it
5:39
and that's where you get the K value
5:41
from that is the most common use and
5:43
it's pretty solid
5:44
it works very well when do we use KN n
When do we use KNN?
5:47
we can use K n when data is labeled so
5:50
you need a label on it we know we have a
5:51
group of pictures with dogs dogs cats
5:54
cats data is noise free and so you can
5:57
see here when we have a class that we
6:00
have like underweight 140 23 Hello Kitty
6:03
normal that's pretty confusing we have a
6:05
variety of data coming in so it's very
6:07
noisy and that would cause an issue Dana
6:10
said is small so we're usually working
6:12
with smaller data sets where you might
6:14
get into gig of data if it's really
6:16
clean it doesn't have a lot of noise
6:18
because K and N is a lazy learner ie it
6:21
doesn't learn a discriminative function
6:23
from the training set so it's very lazy
6:25
so if you have very complicated data and
6:27
you have a large amount of it you're not
6:29
going to use the KNN but it's really
6:30
great to get a place to start even with
6:32
large data you can sort out a small
6:34
sample and get an idea of what that
6:36
looks like using the KNN and also just
6:38
using for smaller data sets
6:40
works really good how does a knn
How does the KNN algorithm work?
6:43
algorithm work consider a data set
6:45
having two variables height and
6:47
centimeters and weight in kilograms and
6:50
each point is classified as normal or
6:52
underweight so we can see right here we
6:55
have two variables
6:56
you know true/false or either normal or
6:58
they're not they're underweight on the
6:59
basis of the given data we have to
7:01
classify the below set as normal or
7:03
underweight using KNN so if we have new
7:06
data coming in this says 57 kilograms
7:08
and 177 centimeters is that going to be
7:12
normal or underweight to find the
7:14
nearest neighbors we'll calculate the
7:16
Euclidean distance according to the
7:18
Euclidean distance formula the distance
7:20
between two points in the plane with the
7:23
coordinates XY and a B is given by
7:25
distance D equals the square root of x
7:29
minus a squared plus y minus B squared
7:32
and you can remember that from the two
7:34
edges of a triangle we're computing the
7:36
third edge since we know the x side and
7:39
the y side let's calculate it to
7:40
understand clearly
7:42
so we have our unknown point and we
7:44
placed it there in red and we have our
7:46
other points where the data is scattered
7:47
around the distance d1
7:50
is a square root of 170 minus 167
7:53
squared plus 57 minus 51 squared which
7:56
is about six point seven and distance
7:59
two is about 13 and distance three is
8:03
about 13 point four
8:04
similarly we will calculate the
8:06
Euclidean distance of unknown data point
8:08
from all the points in the data set
8:10
and because we're dealing with small
8:12
amount of data that's not that hard to
8:13
do it's actually pretty quick for a
8:15
computer and it's not really complicated
8:17
maths you can just see how close is the
8:18
data based on the Euclidean distance
8:20
hence we have calculated the Euclidean
8:23
distance of unknown data point from all
8:25
the points as showing where x1 and y1
8:27
equal 57 and 170 whose class we have to
8:31
classify now we're looking at they were
8:33
saying well here's a Euclidean distance
8:35
who's gonna be their closest neighbors
8:37
now let's calculate the nearest neighbor
8:39
at K equals 3 and we can see the three
8:42
closest neighbors put some at normal and
8:45
that's pretty self-evident when you look
8:46
at this graph it's pretty easy to say
8:48
okay what we're just voting normal
8:50
normal normal three votes for normal
8:51
this is gonna be a normal weight so
8:53
majority of neighbor
8:54
are pointing towards normal hints as per
8:56
knn algorithm the class of 57 170 should
9:00
be normal so a recap of KN n positive
9:03
integer K is specified along with a new
9:05
sample we select the K entries in our
9:08
database which are closest to the new
9:09
sample we find the most common
9:11
classification of these entries this is
9:13
the classification we give to the new
9:15
sample so as you can see it's pretty
9:17
straightforward we're just looking for
9:18
the closest things that match what we
Use case - Predict whether a person will have diabetes or not
9:20
got so let's take a look and see what
9:22
that looks like in a use case in Python
9:25
so let's dive in to the predict diabetes
9:27
use case so use case predict diabetes
9:30
the objective predict whether a person
9:33
will be diagnosed with diabetes or not
9:35
we have a data set of 768 people who
9:39
were or were not diagnosed with diabetes
9:41
and let's go ahead and open that file
9:44
and just take a look at that data and
9:45
this is in a simple spreadsheet format
9:48
the data itself is comma separated very
9:51
common set of data and it's also a very
9:53
common way to get the data and you can
9:54
see here we have columns a through I
9:57
that's what 1 2 3 4 5 6 7 8 8 columns
10:02
with a particular attribute and then the
10:05
ninth column which is the outcome is
10:07
whether they have diabetes as a data
10:09
scientist the first thing you should be
10:10
looking at is insolent well you know if
10:12
someone has insulin they have diabetes
10:14
that's why they're taking it when that
10:16
could cause issue and some of the
10:17
machine learning packages but for a very
10:19
basic setup this works fine for doing
10:22
the KNN and the next thing you notice is
10:24
it didn't take very much to open it up I
10:26
can scroll down to the bottom of the
10:28
data there's 768 it's pretty much a
10:30
small data set you know at 769 I can
10:33
easily fit this into my RAM on my
10:36
computer I can look at it I can
10:38
manipulate it and it's not gonna really
10:40
tax just a regular desktop computer you
10:43
don't even need an enterprise version to
10:44
run a lot of this so let's start with
10:46
importing all the tools we need and
10:48
before that of course we need to discuss
10:50
what IDE I'm using certainly you can use
10:53
any particular editor for python but i
10:55
like to use for doing very basic visual
10:58
stuff the anaconda which is great for
11:01
doing demos with the jupiter notebook
11:03
and just a quick view of the anaconda
11:05
navigator which
11:07
the new release out there which is
11:08
really nice you can see under home I can
11:11
choose my application we're gonna be
11:12
using Python three six I have a couple
11:15
different versions on this particular
11:17
machine if I go under environments
11:18
create a unique environment for each one
11:21
which is nice and there's even a little
11:22
button there where I can install
11:24
different packages so if I click on that
11:25
button and open the terminal I can then
11:27
use a simple pip install to install
11:29
different packages I'm working with
11:30
let's go ahead and go back on your home
11:32
and we're gonna launch our notebook and
11:34
I've already you know kind of like the
11:36
old cooking shows I've already prepared
11:38
a lot of my stuff so we don't have to
11:39
wait for it to launch because it takes a
11:41
few minutes for it to open up a browser
11:43
window in this case I'm gonna it's gonna
11:45
open up Chrome because that's my default
11:47
that I use and since the script is pre
11:49
done you'll see I have a number of
11:50
windows open up at the top the one we're
11:52
working in and since we're working on
11:54
the KNN predict whether a person will
11:57
have diabetes or not let's go and put
11:58
that title in there
11:59
and I'm also gonna go up here and click
12:02
on sell actually we want to go ahead and
12:04
first insert a cell below and then I'm
12:06
gonna go back up to the top cell and I'm
12:09
gonna change the cell type to markdown
12:11
that means this is not going to run as
12:13
Python it's a markdown language so if I
12:15
run this first one it comes up in nice
12:16
big letters which is kind of nice
12:18
mine just what we're working on and by
12:20
now you should be familiar with doing
12:22
all of our imports we're gonna import
12:24
the pandas as PD import numpy is NP
12:28
pandas is the pandas dataframe and numpy
12:31
is a number array very powerful tools to
12:33
use in here so we have our imports so
12:36
we've brought in our pandas our numpy or
12:38
to general python tools and then you can
12:41
see over here we have our train tests
12:43
split by now I used to be familiar with
12:45
splitting the data we want to split part
12:47
of it for training our thing and then
12:49
training our particular model and then
12:51
we want to go ahead and test the
12:53
remaining data to see how good it is pre
12:55
processing a standard scaler
12:57
preprocessor so we don't have a bias of
12:59
really large numbers remember in the
13:01
data we had like number of pregnancies
13:03
isn't gonna get very large where the
13:05
amount of insulin they taking it up to
13:07
256 so 256 versus 6 that will skew
13:11
results so we want to go ahead and
13:12
change that so that they're all uniform
13:14
between minus 1 and 1 and then the
13:16
actual tool this is the K neighbors
13:19
classifier we're going to
13:20
and finally the last three are three
13:24
tools to test all about testing our
13:26
model how good is it we just put down
13:27
tests on there and we have our confusion
13:29
matrix our f1 score and our accuracy so
13:32
we have our two general python modules
13:35
we're importing and then we have our six
13:38
module specific from the SK learn setup
13:41
and then we do need to go ahead and run
13:43
this so these are actually imported
13:46
there we go and then move on to the next
13:48
step and so in this set we're gonna go
13:49
ahead and load the database we're gonna
13:51
use pandas remember pandas as PD and
13:54
we'll take a look at the data in Python
13:56
we looked at it in a simple spreadsheet
13:57
but usually I like to also pull it up so
14:00
that we can see what we're doing so
14:01
here's our data set equals peed read CSV
14:05
that's a panda's command and the
14:08
diabetes folder I just put in the same
14:10
folder where my I python script is if
14:12
you put it in a different folder you
14:14
need the full length on there we can
14:16
also do a quick length of the data set
14:19
that is a simple Python command Elian
14:22
for length
14:22
we might even let's go ahead and print
14:24
that we'll go print and if you do it on
14:26
its own line links that data set that
14:28
you put a notebook it'll automatically
14:30
print it but when you're in most of your
14:32
different setups you want to do the
14:33
print in front of there and then we want
14:35
to take a look at the actual data set
14:37
and since we're in pandas we can simply
14:40
do data set head and again let's go
14:42
ahead and add the print in there if you
14:44
put a bunch of these in a row you know
14:46
the data set one head data set to head
14:49
it only prints out the last one so I
14:51
used to always like to keep the print
14:52
statement in there but because most
14:54
projects only use one data frame pandas
14:57
dataframe doing it this way doesn't
14:59
really matter the other way works just
15:00
fine and you can see when we hit the Run
15:02
button we have the 768 lines which we
15:05
knew and we have our pregnancies it's
15:07
automatically given a label on the left
15:09
remember the head only shows the first
15:11
five lines so we have zero through four
15:15
and just a quick look at the data you
15:17
can see it matches what we looked at
15:18
before we have pregnancy glucose blood
15:20
pressure all the way to age and then the
15:23
outcome on the end and we're gonna do a
15:25
couple things in this next step we're
15:28
going to create a list of columns where
15:30
we can't have zero there's no such thing
15:32
as zeros
15:33
in thickness or zero blood pressure or
15:36
zero glucose any of those you'd be dead
15:39
so not a really good factor if they
15:41
don't if they have a zero in there
15:42
because they didn't have the data and
15:43
we'll take a look at that because we're
15:44
gonna start replacing that information
15:46
with a couple of different things and
15:49
let's see what that looks like so first
15:51
we create a nice list as you can see we
15:53
have the values talked about glucose
15:55
blood pressure skin thickness and this
15:57
is a nice way when you're working with
15:58
columns is to list the columns you need
16:00
to do some kind of transformation on a
16:02
very common thing to do and then for
16:04
this particular setup we certainly could
16:06
use the there's some Panda tools that
16:09
will do a lot of this where we can
16:10
replace the n/a but we're gonna go ahead
16:12
and do it as a data set column equals
16:16
data set column don't replace this is
16:18
this is still pandas you can do a direct
16:20
there's also one that's that you look
16:22
for your n/a n/a a lot of different
16:24
options in here but the n/a n/a numpy na
16:27
n is what that stands for is is none
16:28
doesn't exist so the first thing we're
16:31
doing here is we're replacing the zero
16:34
with a numpy none there's no data there
16:37
that's what that says that's what this
16:39
is saying right here so put the zero in
16:41
and we're gonna play zeroes with no data
16:43
so if it's a zero that means a person's
16:46
well hopefully not dead oh they just
16:47
didn't get the data the next thing they
16:49
want to do is we're going to create the
16:50
mean which is the integer from the data
16:53
set from the column mean where we skip n
16:56
A's we can do that that is a pandas
16:59
command there the Skip ena so we're
17:01
gonna figure out the mean of that data
17:02
set and then we're gonna take that data
17:04
set column and we're gonna replace all
17:07
the N P na n with the means why did we
17:11
do that we could have actually just
17:12
taken this step and gone right down here
17:14
and just replace zero and skip anything
17:16
were except you could actually there's a
17:18
way to skip zeros and then just replace
17:20
all the zeros but in this case we want
17:22
to go ahead and do it this way so you
17:23
could see that we're switching this to a
17:25
non-existent value and we're gonna
17:27
create the mean well this is the average
17:29
person so if we don't know what it is if
17:32
they did not get the data and the data
17:34
is missing one of the tricks is you
17:36
replace it with the average what is the
17:38
most common data for that this way you
17:41
can still use the rest of those values
17:43
to do your computation and it kind of
17:45
just brings that particular
17:47
those missing values out of the equation
17:49
let's go ahead and take this and we'll
17:51
go ahead and run it doesn't actually do
17:53
anything so we're still preparing our
17:55
data if you wanted to see what that
17:57
looks like we don't have anything in the
17:58
first few lines so it's not gonna show
18:00
up but we certainly could look at a row
18:02
let's do that let's go into our data set
18:05
with a printed data set and let's pick
18:08
in this case let's just do glucose and
18:11
if I run this this is gonna pray on all
18:13
the different glucose levels going down
18:16
and we thankfully don't see anything in
18:18
here that looks like missing data at
18:20
least on the ones that show so you can
18:21
see you skipped a bunch in the middle
18:23
that's what it does we have two me lines
18:24
and Jupiter notebook he'll skip a few
18:26
and go on to the next and a data set let
18:29
me go and remove this will just zero out
18:31
that and of course before we do any
18:33
processing before proceeding any further
18:35
we need to split the data set into our
18:37
trained and testing data that way we
18:39
have something to train it with and
18:41
something to test it on and you're going
18:43
to notice we did a little something here
18:44
with the pandas data base code there we
18:47
go my drawing tool we've added in this
18:49
right here off the data set and what
18:52
this says is that the first one in
18:54
pandas this is from the PD pandas it's
18:57
gonna say within the data set we want to
18:59
look at the eye location and it is all
19:02
rows that's what that says so we're
19:03
gonna keep all the rows but we're only
19:05
looking at 0 column 0 to 8
19:07
remember column 9 here it is right up
19:09
here we printed it in here as outcome
19:11
well that's not part of the training
19:13
data that's part of the answer yes
19:15
column 9 but it's listed as 8 number 8
19:17
so 0 to 8 is 9 columns so 8 is the value
19:21
and when you see it in here 0 this is
19:24
actually 0 to 7 it doesn't include the
19:26
last one and then we go down here to Y
19:28
which is our answer and we want just the
19:31
last one just column 8 and you can do it
19:34
this way with this particular notation
19:35
and then if you remember we imported the
19:37
Train test split that's part of the SK
19:40
learn right there and we simply put in
19:43
our X and our Y we're gonna do random
19:46
state equals 0 you don't have to
19:47
necessarily seed it that's a seed number
19:49
I think the default is 1 when you seed
19:51
it I have to look that up and then the
19:53
test size test size is 0.2 that simply
19:55
means we're gonna take 20% of the data
19:57
and put it aside so that we can test it
19:59
later
20:00
that's all that is and again we're gonna
20:02
run it not very exciting so far we
20:04
haven't had any printout other than to
20:05
look at the data but that is a lot of
20:08
this is prepping this data once you prep
20:10
it the actual lines of code are quick
20:11
and easy and we're almost there but the
20:14
actual running of our KNN we need to go
20:16
ahead and do a scale the data if you
20:18
remember correctly we're fitting the
20:20
data and a standard scaler which means
20:21
instead of the data being from you know
20:24
five to three hundred three in one
20:26
column and the next column is one to six
20:29
we're gonna set that all so that all the
20:30
data is between minus one and one that's
20:33
what that standard scaler does keeps it
20:35
standardized and we only want to fit the
20:39
scaler with the training set but we want
20:41
to make sure the testing set is the X
20:43
test going in is also transformed so
20:46
it's processing at the same so here we
20:49
go with their standard scaler we're
20:51
gonna call it SC underscore X for the
20:52
scalar and we're gonna import the
20:55
standard scalar into this variable and
20:56
then our X train equals SC underscore X
21:00
dot fit transform so we're creating the
21:03
scalar on the X train variable and then
21:05
our X test we're also going to transform
21:07
it so we've trained and transformed the
21:10
X train and then the X test isn't part
21:12
of that training it isn't part of that
21:14
of training the Transformer
21:15
it just gets transformed that's all it
21:17
does and again we're gonna go and run
21:19
this if you look at this we've now gone
21:21
through these steps all three of them
21:24
we've taken care of replacing our zeros
21:27
four key columns that shouldn't be 0 and
21:31
we replace that with the means of those
21:33
columns that way that they fit right in
21:36
with our data models we've come down
21:38
here we split the data so now we have
21:40
our test data and our training data and
21:43
then we've taken and we've scaled the
21:45
data so all of our data going in oh no
21:47
we don't trip we don't train the Y part
21:50
the why train and why test that never
21:53
has to be trained it's only the data
21:55
going in that's what we want to train in
21:57
there then define the model using K
21:59
neighbors classifier and fit the train
22:01
data in the model so we do all that data
22:03
prep and you can see down here we're
22:05
only going to have a couple lines of
22:07
code where we're actually building our
22:09
model and training it that's one of the
22:12
cool things about Python and how far
22:13
we've come it's such an exciting time to
22:15
be in machine learning because there's
22:16
so many automated tools
22:18
let's see before we do this let's do a
22:20
quick links' of and let's do why we want
22:24
let's just do length of Y and we get 7
22:27
or 68 and if we import math we do math
22:31
dot square root
22:33
let's do Y train there we go it's
22:36
actually supposed to be X train before
22:38
we do this let's go ahead and do import
22:41
math and do math square root length of Y
22:44
test and when I run that we get 12 point
22:47
409 I want to see show you where this
22:49
number comes from we're about to use 12
22:51
is an even number so if you know if
22:53
you're ever voting on things remember
22:55
the neighbors all vote don't want to
22:57
have an even number of neighbors voting
22:58
so we want to do something odd and let's
23:00
just take one away we'll make it 11 let
23:02
me delete this out of here this one the
23:04
reasons I love Jupiter notebook because
23:05
you can flip around and do all kinds of
23:07
things on the fly so we'll go ahead and
23:08
put in our classifier we're creating our
23:10
classifier now and it's going to be the
23:12
K neighbors classifier and neighbors
23:14
equal 11 remember we did 12 minus 1 for
23:17
11 so we have an odd number of neighbors
23:19
P equals 2 because we're looking for is
23:22
it are the diabetic or not and we're
23:24
using the Euclidean metric there are
23:27
other means of measuring the distance
23:29
you could do like square square means
23:31
value there's all kinds of measure this
23:33
but the Euclidean is the most common one
23:35
and it works quite well it's important
23:37
to evaluate the model let's use the
23:39
confusion matrix to do that and we're
23:41
going to do is a confusion matrix
23:42
wonderful tool and then we'll jump into
23:45
the f1 score and finally an accuracy
23:49
score which is probably the most
23:50
commonly used quoted number when you go
23:53
into a meeting or something like that so
23:55
let's go ahead and paste that in there
23:56
and we'll set the CM equal to confusion
23:59
matrix why test why predict so those are
24:02
the two values we're going to put in
24:04
there and let me go ahead and run that
24:05
and print it out and the way you
24:07
interpret this is you have the Y
24:10
predicted which would be your title up
24:12
here you can do assist do PR IDI predict
24:16
it across the top and actual going down
24:21
it's always hard to write in here actual
24:24
that means that this column here down
24:26
the middle
24:27
that's the important column and it means
24:29
that our prediction said 94 and
24:32
prediction in the actual agreed on 94
24:35
and 32 this number here the 13 and the
24:40
15 those are what was wrong
24:42
so you could have like three different
24:43
if you're looking at this across three
24:45
different variables instead of just two
24:47
you'd end up with the third row down
24:48
here and the column going down the
24:50
middle so in the first case we have the
24:52
the and I believe the zero has a 90 for
24:55
people who don't have diabetes the
24:57
prediction said 213 of those people did
24:59
have diabetes and we're at high risk and
25:01
the 32 that had diabetes it had correct
25:04
but our prediction said another 15 out
25:08
of that 15 it classified as incorrect so
25:12
you can see where that classification
25:13
comes in and how that works on the
25:15
confusion matrix then we're gonna go
25:18
ahead and print the f1 score let me just
25:20
run that and you see we got 2 point 6 9
25:23
and our f1 score the f1 takes into
25:28
account both sides of the balance of
25:30
false positives where if we go ahead and
25:33
just do the accuracy account and that's
25:35
what most people think of is it looks at
25:38
just how many we got right out of how
25:40
many we got wrong so a lot of people in
25:42
your data scientists and you're talking
25:44
the other data scientists they're gonna
25:46
ask you what the f1 score the F score is
25:48
if you're talking to the general public
25:50
or the decision-makers in the business
25:53
they're gonna ask what the accuracy is
25:54
and the accuracy is always better than
25:56
the f1 score but the f1 score is more
25:59
telling let's just know that there's
26:01
more false positives than we would like
26:03
on here but 82% not too bad for a quick
26:06
flash look at people's different
26:09
statistics and running an SK learn and
26:11
running the KNN the K nearest neighbor
26:14
on it so we have created a model using
26:16
KN n which can predict whether a person
26:19
will have diabetes or not or at the very
26:21
least whether they should go get a
26:22
checkup and have their glucose checked
26:25
regularly or not the print accurate
26:27
score we got the point eight one eight
26:29
was pretty close to what we got and we
26:31
can pretty much round that off and just
26:32
say we have an accuracy of 80% tells us
26:35
that it's a pretty fair fit in the model
26:37
to pull that all together there's always
26:39
a lot of fun make sure we cover
26:40
everything we went over
26:41
we covered why we need a KN in looking
26:44
at cats and dogs great if you have a cat
26:46
door and you want to figure out where
26:47
there's a cat or dog coming in don't let
26:49
the dog in or out using Euclidean
26:51
distance the simple distance calculated
26:53
by the two sides of the triangle or the
26:55
square root of the two sides squared
26:57
choosing the value of K we discuss that
27:00
a little bit at least one of the main
27:01
choices that people use for choosing K
27:03
and how KNN works then finally we did a
27:06
full cane and classifier for diabetes
27:08
prediction thank you for joining us
27:10
today for more information visit
27:19
www.subscriptorium.com

You might also like