Classification
Classification
Classification is a type of supervised learning where the aim is to assign a label or category
to input data based on a set of features. In R, classification models are developed and
evaluated using statistical and machine learning methods. Below is an overview of the
theoretical concepts involved in classification.
Fundamentals of Classification
Objective: Predict a discrete target variable (class label) based on input features.
Inputs: A dataset with:
o Features: Predictor variables (e.g., age, income, etc.).
o Class Labels: Target categories to predict (e.g., "yes" or "no").
Output: A model that maps features to class labels.
Common Classification Algorithms are Decision Tree, Random Forest, Support Vector
Machine etc.
Steps to Build a Classification Model in R
1. Data Preparation: Load, clean, and split the dataset.
2. EDA: Explore the data to understand relationships and distributions.
3. Model Training: Fit the model to the training data.
4. Model Evaluation: Use metrics like accuracy, precision, recall, and ROC-AUC.
5. Hyperparameter Tuning: Optimize the model for better performance.
6. Validation: Test the model on unseen data.
IMPLEMENTATION
OUTPUT
> colSums(is.na(data3))
LungCap.cc. Age..years. Height.inches.
0 0 0
Smoke Gender Caesarean
0 0 0
> data3$Smoke <- as.factor(data3$Smoke)
> set.seed(123) # For reproducibility
> library(caTools)
> split <- sample.split(data3$Smoke, SplitRatio = 0.7)
> train_data <- subset(data3, split == TRUE)
> test_data <- subset(data3, split == FALSE)
> library(rpart)
> tree_model <- rpart(Smoke ~ LungCap.cc. + Age..years. + Height.inches.,
+ data = train_data,
+ method = "class")
> # Visualize the tree
> library(rpart.plot)
> rpart.plot(tree_model)
> predictions <- predict(tree_model, test_data, type = "class")
> table(Predicted = predictions, Actual = test_data$Smoke)
Actual
Predicted no yes
no 189 20
yes 5 3
> confusionMatrix <- table(Predicted = predictions, Actual = test_data$Smoke)
> # Accuracy
> accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
> print(accuracy)
[1] 0.8847926
> # Using the caret package for detailed metrics
> library(caret)
Loading required package: ggplot2
Use suppressPackageStartupMessages() to eliminate
package startup messages
Loading required package: lattice
> confusionMatrix(predictions, test_data$Smoke)
Confusion Matrix and Statistics
Reference
Prediction no yes
no 189 20
yes 5 3
Accuracy : 0.8848
95% CI : (0.8346, 0.924)
No Information Rate : 0.894
P-Value [Acc > NIR] : 0.71613
Kappa : 0.1469
Sensitivity : 0.9742
Specificity : 0.1304
Pos Pred Value : 0.9043
Neg Pred Value : 0.3750
Prevalence : 0.8940
Detection Rate : 0.8710
Detection Prevalence : 0.9631
Balanced Accuracy : 0.5523
'Positive' Class : no