0% found this document useful (0 votes)
2 views2 pages

9.18 Problem Set - Classification

The document outlines a problem set focused on classification methods using a breast cancer biopsy dataset. It details the steps for data preparation, including cleaning and splitting the dataset, and applying logistic regression, random forest, and k-NN for prediction. The results discuss the accuracy of each method and emphasize the importance of minimizing false negatives in medical diagnoses.

Uploaded by

Trinh Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views2 pages

9.18 Problem Set - Classification

The document outlines a problem set focused on classification methods using a breast cancer biopsy dataset. It details the steps for data preparation, including cleaning and splitting the dataset, and applying logistic regression, random forest, and k-NN for prediction. The results discuss the accuracy of each method and emphasize the importance of minimizing false negatives in medical diagnoses.

Uploaded by

Trinh Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

---

title: '9.18 Problem Set: Classification'


author: 'Trinh Le'
---
```{r}
install.packages("randomForest")
```

```{r}
library(MASS)
```
```{r}
library(tidyverse)
library(class)
library(randomForest)
```

1.Review the dataset documentation and describe the dataset in your own words.
```{r}
#The biopsy dataset from the MASS package contains data about breast cancer biopsies. It
includes:Predictor Variables: 10 numerical features describing cell nuclei
characteristics.Target Variable: class, which indicates whether the case is benign or
malignant.
```
2.Prepare your data:
a.Save to a variable as a tibble.
```{r}
library(MASS)
library(tidyverse)
data <- as_tibble(biopsy)
```
b.Remove the ID variable.
```{r}
data <- data %>% select(-ID)
```

c.Drop rows that have any NA values using the function drop_na().
```{r}
data <- data %>% drop_na()
```
d.Split the data into a training set and a testing set, using 80% of the records for the
training set and 20% for the testing set.
```{r}
set.seed(42) # For reproducibility
n <- nrow(data)
train_indices <- sample(1:n, size = 0.8 * n)
train_set <- data[train_indices, ]
test_set <- data[-train_indices, ]
```

3.Use each of the following classification methods to predict class. Use all of the
available predictor variables. For each method, report accuracy on the test set and print
a confusion matrix.
a.Logistic regression (Note that you will need to convert the class variable to be 1 or 0
instead of “benign” or “malignant.” Be sure to not include the original class variable in
your model.)
```{r}
train_set <- train_set %>%
mutate(class_binary = if_else(class == "malignant", 1, 0))

test_set <- test_set %>%


mutate(class_binary = if_else(class == "malignant", 1, 0))

logreg_model <- glm(class_binary ~ . -class, family = binomial, data = train_set)


logreg_pred <- predict(logreg_model, newdata = test_set, type = "response")
logreg_binary <- if_else(logreg_pred > 0.5, 1, 0)

logreg_accuracy <- Accuracy(y_pred = logreg_binary, y_true = test_set$class_binary)


logreg_conf_matrix <- ConfusionMatrix(y_pred = logreg_binary, y_true =
test_set$class_binary)

logreg_accuracy
logreg_conf_matrix
```

b.Random forest
```{r}
library(randomForest)

rf_model <- randomForest(class ~ . -class_binary, data = train_set)

rf_pred <- predict(rf_model, newdata = test_set)

rf_accuracy <- Accuracy(y_pred = rf_pred, y_true = test_set$class)


rf_conf_matrix <- ConfusionMatrix(y_pred = rf_pred, y_true = test_set$class)

rf_accuracy
rf_conf_matrix
```

c.k-NN (use k = 5. Note that the data is already normalized and ready to use for k-NN.)
```{r}
library(class)

knn_pred <- knn(


train = train_set %>% select(-class, -class_binary),
test = test_set %>% select(-class, -class_binary),
cl = train_set$class,
k = 5
)

knn_accuracy <- Accuracy(y_pred = knn_pred, y_true = test_set$class)


knn_conf_matrix <- ConfusionMatrix(y_pred = knn_pred, y_true = test_set$class)

knn_accuracy
knn_conf_matrix
```

4.Briefly discuss the results. Which method performed the best? Do you think false
positives or false negatives are more important in this case?
```{r}
#Which Method Performed the Best? Based on the accuracy values: Logistic Regression had an
accuracy of [logreg_accuracy].Random Forest had an accuracy of [rf_accuracy].k-NN had an
accuracy of [knn_accuracy]. The method with the highest accuracy is the best at predicting
whether the case is benign or malignant.
#False Positives vs. False Negatives: False Negatives (malignant classified as benign) are
more dangerous because they could delay treatment for a potentially life-threatening
condition. False Positives (benign classified as malignant) are less critical but can
cause unnecessary stress and additional testing.
#Conclusion: The best method is accurate at distinguishing between benign and malignant
cases, but it's crucial to reduce false negatives to ensure no malignant cases are missed.
```

You might also like