0% found this document useful (0 votes)
8 views4 pages

Classification

The document outlines the process of performing classification in R Studio, focusing on predicting discrete target variables based on input features. It details the steps for building a classification model, including data preparation, exploratory data analysis, model training, evaluation, and validation, along with required R packages like caret and randomForest. An implementation example is provided, demonstrating the use of decision trees to classify smoking status based on lung capacity, age, and height, with an achieved accuracy of approximately 88.48%.

Uploaded by

productionsankit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Classification

The document outlines the process of performing classification in R Studio, focusing on predicting discrete target variables based on input features. It details the steps for building a classification model, including data preparation, exploratory data analysis, model training, evaluation, and validation, along with required R packages like caret and randomForest. An implementation example is provided, demonstrating the use of decision trees to classify smoking status based on lung capacity, age, and height, with an achieved accuracy of approximately 88.48%.

Uploaded by

productionsankit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

LAB PRACTICAL-9

Objective: Perform Classification in R STUDIO.

Classification is a type of supervised learning where the aim is to assign a label or category
to input data based on a set of features. In R, classification models are developed and
evaluated using statistical and machine learning methods. Below is an overview of the
theoretical concepts involved in classification.
Fundamentals of Classification
 Objective: Predict a discrete target variable (class label) based on input features.
 Inputs: A dataset with:
o Features: Predictor variables (e.g., age, income, etc.).
o Class Labels: Target categories to predict (e.g., "yes" or "no").
 Output: A model that maps features to class labels.
Common Classification Algorithms are Decision Tree, Random Forest, Support Vector
Machine etc.
Steps to Build a Classification Model in R
1. Data Preparation: Load, clean, and split the dataset.
2. EDA: Explore the data to understand relationships and distributions.
3. Model Training: Fit the model to the training data.
4. Model Evaluation: Use metrics like accuracy, precision, recall, and ROC-AUC.
5. Hyperparameter Tuning: Optimize the model for better performance.
6. Validation: Test the model on unseen data.

R packages required for Classification

1. caret: Provides a unified interface to numerous machine learning algorithms for


classification and regression tasks.

2. randomForest: Implements the Random Forest algorithm for classification and


regression tasks.

3. rpart: Builds decision trees for classification and regression tasks.

IMPLEMENTATION

1. Import .CSV data file to R studio.


Type the command in the console: data1<-(file.choose(), Header=T)
A dialog box will appear, choose the dataset from the dialog box and that dataset will be
available in Global Environment in R studio.
Then, we can perform all the operations on the data in the R Studio.
PROGRAM CODE:
str(data3) # View structure of the dataset
summary(data3) # Summary statistics
colSums(is.na(data3))
data3$Smoke <- as.factor(data3$Smoke)
set.seed(123) # For reproducibility
library(caTools)
split <- sample.split(data3$Smoke, SplitRatio = 0.7)
train_data <- subset(data3, split == TRUE)
test_data <- subset(data3, split == FALSE)
library(rpart)
tree_model <- rpart(Smoke ~ LungCap.cc. + Age..years. + Height.inches.,
data = train_data,
method = "class")
# Visualize the tree
library(rpart.plot)
rpart.plot(tree_model)
predictions <- predict(tree_model, test_data, type = "class")
table(Predicted = predictions, Actual = test_data$Smoke)
confusionMatrix <- table(Predicted = predictions, Actual = test_data$Smoke)
# Accuracy
accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
print(accuracy)
# Using the caret package for detailed metrics
library(caret)
confusionMatrix(predictions, test_data$Smoke)

OUTPUT

> str(data3) # View structure of the dataset


'data.frame': 725 obs. of 6 variables:
$ LungCap.cc. : num 6.47 10.12 9.55 11.12 4.8 ...
$ Age..years. : int 6 18 16 14 5 11 8 11 15 11 ...
$ Height.inches.: num 62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
$ Smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ Gender : chr "male" "female" "female" "male" ...
$ Caesarean : chr "no" "no" "yes" "no" ...
> summary(data3) # Summary statistics
LungCap.cc. Age..years. Height.inches.
Min. : 0.507 Min. : 3.00 Min. :45.30
1st Qu.: 6.150 1st Qu.: 9.00 1st Qu.:59.90
Median : 8.000 Median :13.00 Median :65.40
Mean : 7.863 Mean :12.33 Mean :64.84
3rd Qu.: 9.800 3rd Qu.:15.00 3rd Qu.:70.30
Max. :14.675 Max. :19.00 Max. :81.80
Smoke Gender Caesarean
no :648 Length:725 Length:725
yes: 77 Class :character Class :character
Mode :character Mode :character

> colSums(is.na(data3))
LungCap.cc. Age..years. Height.inches.
0 0 0
Smoke Gender Caesarean
0 0 0
> data3$Smoke <- as.factor(data3$Smoke)
> set.seed(123) # For reproducibility
> library(caTools)
> split <- sample.split(data3$Smoke, SplitRatio = 0.7)
> train_data <- subset(data3, split == TRUE)
> test_data <- subset(data3, split == FALSE)
> library(rpart)
> tree_model <- rpart(Smoke ~ LungCap.cc. + Age..years. + Height.inches.,
+ data = train_data,
+ method = "class")
> # Visualize the tree
> library(rpart.plot)
> rpart.plot(tree_model)
> predictions <- predict(tree_model, test_data, type = "class")
> table(Predicted = predictions, Actual = test_data$Smoke)
Actual
Predicted no yes
no 189 20
yes 5 3
> confusionMatrix <- table(Predicted = predictions, Actual = test_data$Smoke)
> # Accuracy
> accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
> print(accuracy)
[1] 0.8847926
> # Using the caret package for detailed metrics
> library(caret)
Loading required package: ggplot2
Use suppressPackageStartupMessages() to eliminate
package startup messages
Loading required package: lattice
> confusionMatrix(predictions, test_data$Smoke)
Confusion Matrix and Statistics

Reference
Prediction no yes
no 189 20
yes 5 3

Accuracy : 0.8848
95% CI : (0.8346, 0.924)
No Information Rate : 0.894
P-Value [Acc > NIR] : 0.71613

Kappa : 0.1469

Mcnemar's Test P-Value : 0.00511

Sensitivity : 0.9742
Specificity : 0.1304
Pos Pred Value : 0.9043
Neg Pred Value : 0.3750
Prevalence : 0.8940
Detection Rate : 0.8710
Detection Prevalence : 0.9631
Balanced Accuracy : 0.5523

'Positive' Class : no

You might also like