Supervised Learningclassification Part2
Supervised Learningclassification Part2
Learning:Classification
Nearest Neighbor Part-2
Consider the following dataset, apply KNN and tell the classification of
sample point having sepal length=5.2 and sepal width =3.1
If k=1, setosa
If k=2, setosa if k=3,setosa
Diagnosing Breast Cancer with
the k-NN Algorithm
• We will utilize the Wisconsin Breast Cancer
Diagnostic dataset.
• The breast cancer data includes 569 examples
of cancer biopsies, each with 32 features.
• One feature is an identification number,
another is the cancer diagnosis, and 30 are
numeric-valued laboratory measurements.
• The diagnosis is coded as "M" to indicate
malignant or "B" to indicate benign.
• Download the wisc_bc_data.csv file and save
it to your R working directory.
• Save the Wisconsin breast cancer data to the
wbcd data frame:
> wbcd <- read.csv ("wisc_bc_data.csv",
stringsAsFactors = FALSE)
• If we want to find the structure of wbcd,
execute:
> str(wbcd)
• The first variable is an integer variable named id. As
this is simply a unique identifier (ID) for each
patient in the data, it does not provide useful
information, and we will need to exclude it from
the model.
> wbcd <- wbcd[-1]
• The next variable indicates whether the example is
from a benign or malignant mass. The table()
output indicates that 357 masses are benign while
212 are malignant:
> table(wbcd$diagnosis)
• We will need to recode the diagnosis variable.
> wbcd$diagnosis<- factor(wbcd$diagnosis,
levels = c("B", "M"), labels = c("Benign",
"Malignant"))
• Now, when we look at the prop.table() output,
we notice that the values have been labeled
Benign and Malignant with 62.7 percent and
37.3 percent of the masses, respectively:
> round(prop.table(table(wbcd$diagnosis)) *
100, digits = 1)
• The remaining 30 features are all numeric.
>summary(wbcd[c("radius_mean",
"area_mean", "smoothness_mean")])
• Now here we can see that the impact of area
is going to be much larger than the
smoothness in the distance calculation.
• To normalize these features, we need to create a
normalize() function.
> normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
• We can now apply the normalize() function to the
numeric features in our data frame.
• The lapply() function takes a list and applies a
specified function to each list element.
> wbcd_n <- as.data.frame(lapply(wbcd[2:31],
normalize))
• We will use the first 469 records for the training dataset
and the remaining 100 to simulate new patients
• we will split the wbcd_n data frame into wbcd_train
and wbcd_test:
> wbcd_train <- wbcd_n[1:469, ]
> wbcd_test <- wbcd_n[470:569, ]
• When we constructed our normalized training and
test datasets, we excluded the target variable,
diagnosis.
• For training the k-NN model, we will need to store
these class labels in factor vectors, split between the
training and test datasets:
> wbcd_train_labels <- wbcd[1:469, 1]
> wbcd_test_labels <- wbcd[470:569, 1]