0% found this document useful (0 votes)
4 views

Supervised Learningclassification Part2

Uploaded by

Chandini Gujju
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Supervised Learningclassification Part2

Uploaded by

Chandini Gujju
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Supervised

Learning:Classification
Nearest Neighbor Part-2
Consider the following dataset, apply KNN and tell the classification of
sample point having sepal length=5.2 and sepal width =3.1
If k=1, setosa
If k=2, setosa if k=3,setosa
Diagnosing Breast Cancer with
the k-NN Algorithm
• We will utilize the Wisconsin Breast Cancer
Diagnostic dataset.
• The breast cancer data includes 569 examples
of cancer biopsies, each with 32 features.
• One feature is an identification number,
another is the cancer diagnosis, and 30 are
numeric-valued laboratory measurements.
• The diagnosis is coded as "M" to indicate
malignant or "B" to indicate benign.
• Download the wisc_bc_data.csv file and save
it to your R working directory.
• Save the Wisconsin breast cancer data to the
wbcd data frame:
> wbcd <- read.csv ("wisc_bc_data.csv",
stringsAsFactors = FALSE)
• If we want to find the structure of wbcd,
execute:
> str(wbcd)
• The first variable is an integer variable named id. As
this is simply a unique identifier (ID) for each
patient in the data, it does not provide useful
information, and we will need to exclude it from
the model.
> wbcd <- wbcd[-1]
• The next variable indicates whether the example is
from a benign or malignant mass. The table()
output indicates that 357 masses are benign while
212 are malignant:
> table(wbcd$diagnosis)
• We will need to recode the diagnosis variable.
> wbcd$diagnosis<- factor(wbcd$diagnosis,
levels = c("B", "M"), labels = c("Benign",
"Malignant"))
• Now, when we look at the prop.table() output,
we notice that the values have been labeled
Benign and Malignant with 62.7 percent and
37.3 percent of the masses, respectively:
> round(prop.table(table(wbcd$diagnosis)) *
100, digits = 1)
• The remaining 30 features are all numeric.
>summary(wbcd[c("radius_mean",
"area_mean", "smoothness_mean")])
• Now here we can see that the impact of area
is going to be much larger than the
smoothness in the distance calculation.
• To normalize these features, we need to create a
normalize() function.
> normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
• We can now apply the normalize() function to the
numeric features in our data frame.
• The lapply() function takes a list and applies a
specified function to each list element.
> wbcd_n <- as.data.frame(lapply(wbcd[2:31],
normalize))

• To confirm that the transformation was applied correctly,


let's look at one variable's summary statistics:
> summary(wbcd_n$area_mean)

• We will use the first 469 records for the training dataset
and the remaining 100 to simulate new patients
• we will split the wbcd_n data frame into wbcd_train
and wbcd_test:
> wbcd_train <- wbcd_n[1:469, ]
> wbcd_test <- wbcd_n[470:569, ]
• When we constructed our normalized training and
test datasets, we excluded the target variable,
diagnosis.
• For training the k-NN model, we will need to store
these class labels in factor vectors, split between the
training and test datasets:
> wbcd_train_labels <- wbcd[1:469, 1]
> wbcd_test_labels <- wbcd[470:569, 1]

•To classify our test instances, we will use a k-NN


implementation from the class package, which
provides a set of basic R functions for classification.
> install.packages("class")
• To load the package during any session in which
you wish to use the functions, execute
> library(class)

• Now we can use the knn() function to classify


the test data:
> wbcd_test_pred <- knn(train = wbcd_train, test
= wbcd_test, cl = wbcd_train_labels, k = 21)
• The knn() function returns a factor vector of
predicted labels for each of the examples in the test
dataset, which we have assigned to
wbcd_test_pred.
• The next step of the process is to evaluate how well
the predicted classes in the wbcd_test_pred vector
match up with the known values in the
wbcd_test_labels vector.
• To do this, we can use the CrossTable() function in
the gmodels package.
> install.packages("gmodels")

• Load the package using,


> library(gmodels)
> CrossTable(x = wbcd_test_labels, y =
wbcd_test_pred, prop.chisq=FALSE)
Algorithm
1. Read the given dataset wbcd
2. Display the structure and analyze it
3. Remove id column
4. Find the number of B and M samples
5. Recode the B and M labels to “Benign” and
“Malignant”
6. Generate the summary of "radius_mean",
"area_mean", "smoothness_mean“
7. Apply normalization (as value of area is much larger
than smoothness)
8. The lapply() function takes a list and applies a
specified function to each list element excluding
‘diagnosis’.
9. Check the values of area_mean after normalization
10. Split the dataset into training and testing phase
11. Apply knn
12. Apply crosstable() function to to evaluate how well
the predicted classes in the wbcd_test_pred vector
match up with the known values in the
wbcd_test_labels vector.

You might also like