0% found this document useful (0 votes)
4 views17 pages

Supervised Learningclassification Part2

Uploaded by

Chandini Gujju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Supervised Learningclassification Part2

Uploaded by

Chandini Gujju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Supervised

Learning:Classification
Nearest Neighbor Part-2
Consider the following dataset, apply KNN and tell the classification of
sample point having sepal length=5.2 and sepal width =3.1
If k=1, setosa
If k=2, setosa if k=3,setosa
Diagnosing Breast Cancer with
the k-NN Algorithm
• We will utilize the Wisconsin Breast Cancer
Diagnostic dataset.
• The breast cancer data includes 569 examples
of cancer biopsies, each with 32 features.
• One feature is an identification number,
another is the cancer diagnosis, and 30 are
numeric-valued laboratory measurements.
• The diagnosis is coded as "M" to indicate
malignant or "B" to indicate benign.
• Download the wisc_bc_data.csv file and save
it to your R working directory.
• Save the Wisconsin breast cancer data to the
wbcd data frame:
> wbcd <- read.csv ("wisc_bc_data.csv",
stringsAsFactors = FALSE)
• If we want to find the structure of wbcd,
execute:
> str(wbcd)
• The first variable is an integer variable named id. As
this is simply a unique identifier (ID) for each
patient in the data, it does not provide useful
information, and we will need to exclude it from
the model.
> wbcd <- wbcd[-1]
• The next variable indicates whether the example is
from a benign or malignant mass. The table()
output indicates that 357 masses are benign while
212 are malignant:
> table(wbcd$diagnosis)
• We will need to recode the diagnosis variable.
> wbcd$diagnosis<- factor(wbcd$diagnosis,
levels = c("B", "M"), labels = c("Benign",
"Malignant"))
• Now, when we look at the prop.table() output,
we notice that the values have been labeled
Benign and Malignant with 62.7 percent and
37.3 percent of the masses, respectively:
> round(prop.table(table(wbcd$diagnosis)) *
100, digits = 1)
• The remaining 30 features are all numeric.
>summary(wbcd[c("radius_mean",
"area_mean", "smoothness_mean")])
• Now here we can see that the impact of area
is going to be much larger than the
smoothness in the distance calculation.
• To normalize these features, we need to create a
normalize() function.
> normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
• We can now apply the normalize() function to the
numeric features in our data frame.
• The lapply() function takes a list and applies a
specified function to each list element.
> wbcd_n <- as.data.frame(lapply(wbcd[2:31],
normalize))

• To confirm that the transformation was applied correctly,


let's look at one variable's summary statistics:
> summary(wbcd_n$area_mean)

• We will use the first 469 records for the training dataset
and the remaining 100 to simulate new patients
• we will split the wbcd_n data frame into wbcd_train
and wbcd_test:
> wbcd_train <- wbcd_n[1:469, ]
> wbcd_test <- wbcd_n[470:569, ]
• When we constructed our normalized training and
test datasets, we excluded the target variable,
diagnosis.
• For training the k-NN model, we will need to store
these class labels in factor vectors, split between the
training and test datasets:
> wbcd_train_labels <- wbcd[1:469, 1]
> wbcd_test_labels <- wbcd[470:569, 1]

•To classify our test instances, we will use a k-NN


implementation from the class package, which
provides a set of basic R functions for classification.
> install.packages("class")
• To load the package during any session in which
you wish to use the functions, execute
> library(class)

• Now we can use the knn() function to classify


the test data:
> wbcd_test_pred <- knn(train = wbcd_train, test
= wbcd_test, cl = wbcd_train_labels, k = 21)
• The knn() function returns a factor vector of
predicted labels for each of the examples in the test
dataset, which we have assigned to
wbcd_test_pred.
• The next step of the process is to evaluate how well
the predicted classes in the wbcd_test_pred vector
match up with the known values in the
wbcd_test_labels vector.
• To do this, we can use the CrossTable() function in
the gmodels package.
> install.packages("gmodels")

• Load the package using,


> library(gmodels)
> CrossTable(x = wbcd_test_labels, y =
wbcd_test_pred, prop.chisq=FALSE)
Algorithm
1. Read the given dataset wbcd
2. Display the structure and analyze it
3. Remove id column
4. Find the number of B and M samples
5. Recode the B and M labels to “Benign” and
“Malignant”
6. Generate the summary of "radius_mean",
"area_mean", "smoothness_mean“
7. Apply normalization (as value of area is much larger
than smoothness)
8. The lapply() function takes a list and applies a
specified function to each list element excluding
‘diagnosis’.
9. Check the values of area_mean after normalization
10. Split the dataset into training and testing phase
11. Apply knn
12. Apply crosstable() function to to evaluate how well
the predicted classes in the wbcd_test_pred vector
match up with the known values in the
wbcd_test_labels vector.

You might also like