9.18 Problem Set - Classification
9.18 Problem Set - Classification
```{r}
library(MASS)
```
```{r}
library(tidyverse)
library(class)
library(randomForest)
```
1.Review the dataset documentation and describe the dataset in your own words.
```{r}
#The biopsy dataset from the MASS package contains data about breast cancer biopsies. It
includes:Predictor Variables: 10 numerical features describing cell nuclei
characteristics.Target Variable: class, which indicates whether the case is benign or
malignant.
```
2.Prepare your data:
a.Save to a variable as a tibble.
```{r}
library(MASS)
library(tidyverse)
data <- as_tibble(biopsy)
```
b.Remove the ID variable.
```{r}
data <- data %>% select(-ID)
```
c.Drop rows that have any NA values using the function drop_na().
```{r}
data <- data %>% drop_na()
```
d.Split the data into a training set and a testing set, using 80% of the records for the
training set and 20% for the testing set.
```{r}
set.seed(42) # For reproducibility
n <- nrow(data)
train_indices <- sample(1:n, size = 0.8 * n)
train_set <- data[train_indices, ]
test_set <- data[-train_indices, ]
```
3.Use each of the following classification methods to predict class. Use all of the
available predictor variables. For each method, report accuracy on the test set and print
a confusion matrix.
a.Logistic regression (Note that you will need to convert the class variable to be 1 or 0
instead of “benign” or “malignant.” Be sure to not include the original class variable in
your model.)
```{r}
train_set <- train_set %>%
mutate(class_binary = if_else(class == "malignant", 1, 0))
logreg_accuracy
logreg_conf_matrix
```
b.Random forest
```{r}
library(randomForest)
rf_accuracy
rf_conf_matrix
```
c.k-NN (use k = 5. Note that the data is already normalized and ready to use for k-NN.)
```{r}
library(class)
knn_accuracy
knn_conf_matrix
```
4.Briefly discuss the results. Which method performed the best? Do you think false
positives or false negatives are more important in this case?
```{r}
#Which Method Performed the Best? Based on the accuracy values: Logistic Regression had an
accuracy of [logreg_accuracy].Random Forest had an accuracy of [rf_accuracy].k-NN had an
accuracy of [knn_accuracy]. The method with the highest accuracy is the best at predicting
whether the case is benign or malignant.
#False Positives vs. False Negatives: False Negatives (malignant classified as benign) are
more dangerous because they could delay treatment for a potentially life-threatening
condition. False Positives (benign classified as malignant) are less critical but can
cause unnecessary stress and additional testing.
#Conclusion: The best method is accurate at distinguishing between benign and malignant
cases, but it's crucial to reduce false negatives to ensure no malignant cases are missed.
```