0% found this document useful (0 votes)
21 views

Final Report

This project aims to develop a predictive model for cancer diagnosis using machine learning techniques on a dataset containing demographic, lifestyle, and health variables. Several machine learning models are trained and evaluated on the dataset, including random forest, decision tree, KNN, and logistic regression. The most effective model is selected based on evaluation metrics like accuracy, sensitivity, specificity, precision, and F1 score. The significance of the project is that the developed model could help medical practitioners, researchers, and policymakers make informed decisions to improve cancer prevention and outcomes.

Uploaded by

Salma Shaheen
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Final Report

This project aims to develop a predictive model for cancer diagnosis using machine learning techniques on a dataset containing demographic, lifestyle, and health variables. Several machine learning models are trained and evaluated on the dataset, including random forest, decision tree, KNN, and logistic regression. The most effective model is selected based on evaluation metrics like accuracy, sensitivity, specificity, precision, and F1 score. The significance of the project is that the developed model could help medical practitioners, researchers, and policymakers make informed decisions to improve cancer prevention and outcomes.

Uploaded by

Salma Shaheen
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Classification

Group Project
(Multivariate Statistics
STAT8031 - Fall 2023
- Section 1)

Group Members:
Salma Shaheen
Student ID | 8913789
Asim Javed
Student ID | 8783167
Shivani Baireddy
Student ID | 8971642
Sai Kumar Reddy
Student ID | 8977581
Abstract
This project focuses on the development of a predictive model for cancer diagnosis using the Cancer
dataset, encompassing diverse demographic, lifestyle, and health-related variables. Our primary
objective is to enhance early detection and gain insights into factors influencing cancer incidence,
aligning with the broader goal of improving healthcare outcomes and patient well-being. Leveraging
machine learning techniques such as Random Forest, Decision Tree, KNN, and Logistic Regression,
we conducted data cleaning, exploratory data analysis, and model evaluation, utilizing metrics such
as accuracy, sensitivity, specificity, precision, and F1 score.

A distinctive emphasis is placed on considering two critical aspects—Model Deployment and


Operationalization, and Long-Term Maintenance and Adaptability. In the pursuit of a reliable
predictive tool for cancer, we choose a model that easy for the practical deployment in real-world
settings. Furthermore, a keen focus is placed on the long-term sustainability of the models,
considering their adaptability to evolving healthcare scenarios. This project's significance lies in not
only offering predictive insights but also in ensuring the seamless integration and enduring utility of
the developed models for the benefit of medical practitioners, researchers, and policymakers. These
models aim to empower informed decision-making for cancer prevention and intervention
strategies, thereby contributing meaningfully to the healthcare landscape.

1
Table of Contents

1. Introduction -------------------------------------------------------------------------------------4
2. Methodologies ---------------------------------------------------------------------------------6
3. Conclusion -------------------------------------------------------------------------------------18
4. Appendix ---------------------------------------------------------------------------------------19

2
1 Introduction
The project aims to develop a predictive model for cancer diagnosis based on a comprehensive
dataset. The primary goal is to enhance early detection and provide valuable insights into factors
influencing cancer incidence. This aligns with the broader objective of improving healthcare
outcomes and patient well-being.

The dataset, named 'cancer_main,' comprises diverse demographic, lifestyle, and health-related
variables for a cohort of individuals. It includes information on cancer diagnosis, gender, age,
employment status, and other relevant factors. The dataset serves as the foundation for training and
evaluating machine learning models to predict and understand patterns of cancer occurrence.

The project employs a range of machine learning techniques, including Random Forest, Decision
Tree, KNN, and Logistic Regression models. Data cleaning and exploratory data analysis have been
conducted to ensure the dataset's quality and understand its characteristics. Model performance is
evaluated through metrics such as accuracy, sensitivity, specificity, precision, and F1 score to identify
the most effective predictive model.

The significance of this project lies in its potential to contribute to the field of healthcare by providing
a reliable tool for cancer prediction. By leveraging machine learning methodologies on a rich dataset,
the project aims to offer valuable insights into the complex interactions between various factors and
cancer occurrence. The outcomes of this research can inform medical practitioners, researchers, and
policymakers to make informed decisions for cancer prevention and intervention strategies.

1.1 Explanation of Dataset


This dataset represents a synthetic collection of responses gathered from a university-
conducted survey, aimed at studying the potential risk factors for lung cancer. The survey
includes a variety of demographic, lifestyle, and health-related questions. The dataset is
purely fictional and created for educational and research purposes only.

Data Description
The dataset consists of the following columns, each representing a different aspect of the
respondents' profiles and responses:

1 ID: A unique 5-digit identifier assigned to each respondent. These are randomly
generated and hold no real-world significance.
2 Gender: The gender of the respondent indicated as either 'Female' or 'Male'.
3 Age: The age of the respondent, ranging from 18 to 90 years.
4 Marital Status: Marital status of the respondent, categorized as 'Married', 'Single',
'Widowed', or 'Separated'.
5 Children: The number of children the respondent has, ranging from 0 to 5.
6 Smoker: Indicates whether the respondent smokes, with options 'Yes' or 'No'.
7 Employed: Employment status of the respondent, indicated as 'Yes' for employed
and 'No' for unemployed.

3
8 Years Worked: The total number of years the respondent has been employed,
ranging from 0 to 40 years. This value is set to 0 for those not employed.
9 Income Level: The self-assessed income level of the respondent, categorized as
'High', 'Medium', or 'Low'. Unemployed respondents automatically fall under the 'Low'
category.
10 Social Media: Indicates whether the respondent uses social media platforms, with
options 'Yes' or 'No'.
11 Online Gaming: Denotes whether the respondent engages in online gaming, with
options 'Yes' or 'No'.
12 Cancer: Indicates whether the respondent has been diagnosed with lung cancer,
with options 'Yes' or 'No'. This field is artificially manipulated based on a combination
of factors such as age, smoking status, employment duration, and lifestyle choices to
simulate potential risk factors for lung cancer.
13 The primary response variable in this dataset is "Cancer". It indicates whether an
individual has been diagnosed with cancer or not. Understanding the factors
contributing to the likelihood of cancer is crucial for preventive healthcare and
personalized interventions.

The predictors chosen for classification are:

Predictors

Gender

Age

Marital Status

Children

Smoker

Employed

Years Worked

Income Level

Social Media

Online Gaming

These predictors were selected based on their potential relevance to the occurrence of
cancer and existing literature on similar studies.

4
1.2 Descriptive Analytics of Data
In preparation for the analysis, a comprehensive data cleaning process was undertaken
using the 'tidyverse' package in R. The dataset, denoted as 'Cancer_main,' was inspected
for any missing values. The summary revealed that there are no missing values in the
dataset.
Additionally, a visual exploration of the numerical variables 'Age,' 'Children,' and 'Years
Worked' was conducted using the 'GGally' package. The resulting pairs plot, Figure 1.1,
provided insights into the distribution of these variables and aided in the detection of
potential outliers.

Figure 1.1
After visualizing the numerical variables using the pairs plot, an additional step was taken to
programmatically identify potential outliers. The Interquartile Range (IQR) method was
employed, where any data points falling outside a certain threshold (e.g., 1.5 times the IQR)
were considered as potential outliers. No such outlier has been found during this process.
The coding of this part is detailed in the Appendix.
There is a total of 1000 persons in the dataset, with a majority of 776 individuals diagnosed
with cancer and 224 individuals not having the condition. This yields a notable ratio of 97:28,
emphasizing a higher prevalence of cancer in the region. Gender distribution reveals 515
females and 485 males among the participants. The average age of the individuals is
approximately 54-55 years, with the maximum age reaching 89. Despite this, the median
age stands at 54, indicating a relatively even distribution with a slight skew towards younger
ages. The mean for years worked is 10, suggesting an average duration of employment or
professional activity. Additionally, the median for the number of children is 3, reflecting a
family size typical of those affected by cancer. This dataset provides valuable insights into
the demographic and health characteristics of the population, emphasizing the prominence
of cancer and its potential impact on various aspects of life.
In-depth coding details and implementation can be referred to in the Appendix section.

5
2. Methodologies
In addressing the predictive modeling task for cancer presence, our analysis employs
a multifaceted approach encompassing diverse statistical techniques. This section
delineates the methodologies utilized, each tailored to extract valuable insights from
the dataset. Initially, we delve into logistic regression, a well-established method for
modeling binary outcomes. Logistic regression provides a foundation for
understanding the probabilistic relationship between predictor variables and the
likelihood of cancer presence. Subsequently, we explore alternative techniques, such
as K-Nearest Neighbors (KNN), Decision Tree and Random Forests, each
contributing distinct strengths to our analytical arsenal.

To perform each modeling, we strategically partitioned our dataset into an 80-20 ratio,
allocating one subset for training the model (cancer) and the other for meticulously
testing each model's performance (cancer_hidden).

The following subsections detail the implementation steps, considerations, and


motivations underlying each methodology, offering a comprehensive perspective on
our modeling strategy.

2.1 Logistic Regression


Logistic Regression Overview:

Logistic Regression is a statistical method used for modeling the probability of a binary
outcome. It's particularly suitable for classification problems, where the response
variable is categorical with two levels (in this case, the presence or absence of
cancer). Logistic Regression models the log-odds of the probability as a linear
combination of predictor variables.

Data Preparation:
The analysis focuses on predicting the presence of cancer based on various features.
 The dataset was split into two parts: cancer (for training) and cancer_hidden
(for prediction and evaluation).
Variables Selection:
The 'ID' column was removed from both datasets as it does not contribute to the
modeling. For the Logistic Regression model on the cancer dataset, we selected the
following variables as predictors:

 Gender
 Age

6
 Marital Status
 Children
 Smoker
 Employed
 Years Worked
 Income Level
 Social Media
 Online Gaming

The choice of these variables was based on their potential relevance to cancer
prediction. Age, marital status, smoking habits, and other factors are commonly
associated with health outcomes, making them suitable predictors for this analysis.

Data Cleaning:
The data was considered clean as it was already in categorical form.
Libraries such as caret, tidyverse, and rsample were loaded for further analysis.
 Response Variable Transformation: The response variable 'Cancer' was
transformed into a factor with labels 'No' and 'Yes'.
Model Training and Evaluation:

 A logistic regression model was built using the glm function.


 The model predicts 'Cancer' based on features like 'Gender,' 'Age,' 'Marital
Status,' 'Children,' 'Smoker,' 'Employed,' 'Years Worked,' 'Income Level,' 'Social
Media,' and 'Online Gaming.'
 The family parameter was set to "binomial" to indicate binary outcome.
 The summary of the logistic regression model was examined to understand the
significance of each predictor.
 Predictions were made on the hidden dataset (cancer_hidden) using the trained
logistic regression model.
 Predictions were converted to a factor, and model performance was evaluated
using confusion matrix metrics.

Experimental Results:
Logistic Regression Model Summary:
The logistic regression model was built to predict the presence of cancer based on
various features. The following key coefficients were obtained:

Predictor Coeffients

7
Gender (Male) -0.15
Age 0.03
Marital Status (Separated) 0.31
Marital Status (Single) 1.08
Marital Status (Widow) 0.03
Children 0.32
Smoker (Yes) 1.44
Employed (Yes) 1.44
Years worked 0.03
Income Level (Low) 0.96
Income Level (Medium) -0.17
Social Media (Yes) 1.26
Online Gaming (Yes) -1.27

In the logistic regression model predicting the likelihood of having cancer, several key
predictors exhibit distinctive impacts.
 Gender (Male): Men are less likely to have cancer compared to women, as
indicated by the negative coefficient.
 Age: With each additional year, the log-odds of having cancer increase slightly
by 0.03, reflecting the age-related risk.
 Marital Status (Separated, Single, Widowed): Being separated or single is
associated with an increased likelihood of having cancer, especially being single
which has a substantial impact. Widowed individuals also show a slightly
elevated risk.
 Children: Having more children is associated with an increased log-odds of
having cancer, with the coefficient of 0.32 indicating the magnitude of this effect.
 Smoker (Yes): Smoking significantly raises the likelihood of having cancer, as
evidenced by the high positive coefficient.

8
 Employed (Yes): Being employed is associated with an increased likelihood of
having cancer compared to being unemployed.
 Years Worked: Each additional year worked contributes slightly to the log-odds
of having cancer, with a coefficient of 0.03.
 Income Level (Low, Medium): Having a low income is associated with an
increased likelihood of having cancer, while a medium income level is
associated with a decreased likelihood compared to other income levels.
 Social Media (Yes): Using social media is linked to a significantly increased
likelihood of having cancer.
 Online Gaming (Yes): Interestingly, engaging in online gaming is associated
with a decreased likelihood of having cancer.
Prediction Performance on Testing Dataset:
The model's predictions on the hidden dataset resulted in a confusion matrix with the
following statistics:
Accuracy 79.2%
Sensitivity (Recall) 37.50%
Specificity 87.06%
Precision (Positive Predictive Value) 35.29%
F1 Score 0.3636

 Accuracy: This metric indicates the overall correctness of the model's


predictions. In this case, the model correctly classified approximately 79.2% of
all instances, reflecting a reasonably accurate predictive performance.
 Sensitivity: It measures the model's ability to correctly identify positive
instances among all actual positive instances. Here, the model captured only
37.50% of the actual positive cases, suggesting a limitation in detecting
instances of the positive class.
 Specificity: gauges the model's ability to correctly identify negative instances
among all actual negative instances. A specificity of 87.06% indicates a strong

9
ability to identify true negatives, showcasing the model's proficiency in
distinguishing negative cases.
 Precision: It measures the accuracy of positive predictions. In this case, the
model's positive predictions were accurate approximately 35.29% of the time. It
implies that when the model predicted a positive case, it was correct about
35.29% of the time.
 F1 Score: The F1 score is the harmonic mean of precision and recall, providing
a balanced assessment. A value of 0.3636 suggests a moderate balance
between precision and recall, indicating a trade-off between accurately
identifying positive cases and avoiding false positives.
Conclusion:
The Logistic Regression model on the cancer dataset performed well, achieving
competitive results in accuracy, precision, recall, and F1 score. The interpretability of
Logistic Regression is advantageous in understanding the impact of individual
predictors on the likelihood of cancer. Further investigation, including feature
importance analysis and comparison with other classification methods, will contribute
to a comprehensive understanding of the model's performance.

2.2 KNN Model:


KNN (K-Nearest Neighbors) Model Overview:

The k-Nearest Neighbors (KNN) algorithm is a non-parametric and lazy learning


algorithm used for classification. It assigns an unseen data point to the most common
class among its k-nearest neighbors based on a distance metric.
Variables Selection:

All available predictor variables were used in the KNN model to determine the
proximity of data points in the multidimensional space.
Data Cleaning:
 The data was considered clean as it was already in categorical form.
 Libraries such as caret which is already loaded used for to train the model.
 The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.

10
Model Training:

The KNN model was trained using the k-fold cross-validation method with k set to 10
for robust performance assessment.
The model was evaluated using the following classification metrics:
 Model was trained using the k-fold cross-validation method with k set to 10 for
robust performance assessment.
 The model parameters, such as the number of neighbors (k) and the distance
metric, were optimized during the training process.
 The model predicts 'Cancer' based on features like 'Gender,' 'Age,' 'Marital
Status,' 'Children,' 'Smoker,' 'Employed,' 'Years Worked,' 'Income Level,' 'Social
Media,' and 'Online Gaming.'
 The summary of the KNN model was examined to understand the significance
of each predictor.
 Predictions were made on the hidden dataset (cancer_hidden) using the trained
KNN model.
 Predictions were converted to a factor, and model performance was evaluated
using confusion matrix metrics.

Experimental Results:
k-Nearest Neighbors (KNN) Model Results
 The k-Nearest Neighbors (KNN) model was trained and evaluated on the
cancer dataset. The key findings and performance metrics are summarized
below:
Optimal Model Selection:

 The KNN model was tuned across different values of 'k' (number of neighbors),
and the optimal model was selected based on the highest accuracy.
 The final selected model had 'k' set to 7, yielding an accuracy of approximately
74.08%.
Model Performance on Testing Data:
The KNN model was applied to the hidden dataset ('cancer_hidden') to make
predictions. The confusion matrix below summarizes the model's performance on the
testing data:
Accuracy 81.19%

11
Sensitivity (Recall) 37.50%
Specificity (True Negative Rate) 89.41%
Precision (Positive Predictive Value) 40.00%
F1 Score 0.3870

 The model demonstrates an accuracy of 81.19%, indicating its ability to correctly


classify instances.
 Sensitivity and specificity provide insights into the model's performance
concerning true positive and true negative rates, respectively.
 The F1 value of 0.3870 suggests suboptimal balance between precision and
recall.
Conclusion:

The KNN model, with optimal 'k' set to 7, exhibits promising performance on the
testing dataset.

2.3 Predictive Modeling for Cancer Diagnosis using Decision Trees

Model Overview:

In this study, a predictive model was developed using decision trees to diagnose
cancer. The primary goal was to leverage machine learning techniques to accurately
predict the presence or absence of cancer based on various input variables.

Variables Selection:

The selection of variables was a crucial step in building an effective predictive model.
Relevant features were carefully chosen to ensure that the model captures essential
information for cancer diagnosis. The variables were preprocessed to maintain data
integrity and enhance the model's predictive capabilities.
Data Cleaning:
 The data was considered clean as it was already in categorical form.
 Libraries such as rpart, rsample, and rpart.plot have been loaded.
 The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.

12
Model Training:

The model was evaluated using the following classification metrics:


 The predictive model was trained using the decision tree algorithm implemented
in the 'rpart' package.
 The 'caret' library facilitated efficient model training, allowing for seamless
integration of data and algorithm.
 The decision tree model was trained on cancer dataset, with the response
variable being the presence or absence of Cancer.

Experimental Results:
Decision Tree:
The decision tree generated for the cancer dataset, as illustrated in Figure 2.1,
exemplifies the adaptability of the algorithm to discern various ways of partitioning the
data into branch-like segments. This tree vividly demonstrates the algorithm's
capability to accommodate both continuous and categorical variables as objects of
analysis. The structure of the decision tree provides insights into the relationships and
decision boundaries within the cancer dataset, showcasing its ability to effectively
handle a diverse set of features for the classification task.
Figure 2.1 describes a decision tree that reasons weather a person is getting Cancer
or not when his age is less than 62.
Model Performance on Testing Data:
The model was applied to the hidden dataset ('cancer_hidden') to make predictions.
The confusion matrix below summarizes the model's performance on the testing data:
Accuracy 80.02%
Sensitivity (Recall) 56.25%
Specificity (True Negative Rate) 84.71%
Precision (Positive Predictive Value) 40.91%
F1 Score 0.4875

13
Figure 2.1

 The model demonstrates an accuracy of 80.02%, indicating its ability to correctly


classify instances.
 Sensitivity and specificity provide insights into the model's performance
concerning true positive and true negative rates, respectively.
 The F1 value of 0.4875 suggests moderate balance between precision and
recall.
Conclusion:
In summary, the decision tree-based predictive model presents a robust and
interpretable approach to cancer diagnosis. Its ability to effectively handle a diverse
set of features and provide transparent insights into decision-making processes
positions it as a valuable tool in the realm of medical decision support systems. As
with any model, ongoing refinement and validation are crucial for ensuring its
applicability and reliability in real-world clinical settings.

14
2.4 Random Forest Modelling:
Random Forest Model Overview:

A Random Forest is an ensemble learning method that operates by constructing a


multitude of decision trees during training and outputting the mode of the classes
(classification) or mean prediction (regression) of the individual trees. It is a powerful
and versatile machine learning algorithm that combines the predictions of multiple
decision trees to improve overall accuracy and robustness.

Variables Selection:

All available predictor variables were used in the Random Forest model to determine
the proximity of data points in the multidimensional space.
Data Cleaning:
 The data was considered clean as it was already in categorical form.
 Libraries such as random forest which is already loaded used for to train the
model.
 The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.

Model Training and Evaluation:

 The Random Forest model, employing 200 decision trees, was trained to
predict cancer outcomes based on diverse features within the dataset.
 Utilizing 10-fold cross-validation for robust evaluation, the model demonstrated
notable strength in handling complex relationships and provided insights into
variable importance, as assessed by the "impurity" method.
 Used the trained model to make predictions on the test dataset
(cancer_hidden).

Experimental Results:
Random forest Model Summary
In order to optimize the performance of the Random Forest model, various tuning
parameters were explored through a cross-validated approach. The table below
illustrates the comparative results across different combinations of mtry and split rule,
highlighting the key metrics, including accuracy and Kappa.

15
mtry Split Rule Accuracy (%) Kappa (%)
2 Gini 78.31 13.94
2 Extra Trees 76.98 4.68
8 Gini 77.31 24.34
8 Extra Trees 76.42 23.99
14 Gini 76.53 22.20
14 Extra Trees 76.43 26.50

 Note in the context of Random Forest, mtry refers to the number of randomly
selected predictors considered at each split when building individual trees in
the ensemble. It plays a crucial role in the model's ability to capture diverse
patterns within the dataset. it's important to note that the mtry parameter can
take values larger than the total number of predictors. The model with mtry =
14 might be exploring interactions between a larger subset of predictors at
each split, even though you only have 11 predictors. This can be useful for
capturing complex patterns or interactions that may not be apparent when
considering a smaller subset of predictors.
 Based on the cross-validated performance, the optimal Random Forest model
for the cancer dataset is achieved with the following parameters: mtry = 2,
split rule = Gini, and minimum node size = 1. The model exhibits an
accuracy of 78.31% and a Kappa value of 13.94%.

Model Performance on Testing Data:


The Random Forest model was applied to the hidden dataset ('cancer_hidden') to
make predictions. The confusion matrix below summarizes the model's performance
on the testing data:
Accuracy 83.17%
Sensitivity (Recall) 18.75%

16
Specificity (True Negative Rate) 95.29%
Precision (Positive Predictive Value) 42.86%
F1 Score 0.2660

 The model demonstrates an accuracy of 83.17%, indicating its ability to correctly


classify instances.
 Sensitivity and specificity provide insights into the model's performance
concerning true positive and true negative rates, respectively.
 The F1 value of 0.2660 suggests moderate balance between precision and
recall.
Conclusion:

In summary, the Random Forest model, with its ability to handle complex
relationships and provide insights into variable importance, proves to be a valuable
tool for predicting cancer outcomes. The achieved results, along with a thorough
understanding of the model's strengths and limitations, contribute to its applicability
and reliability in the context of cancer prediction.

17
18
3. Conclusion
The choice of the "best" model depends on the specific goals of the analysis.
 If identifying positive cases is crucial, the Decision Tree might be preferred due to
its high recall. If overall accuracy is the priority, the Random Forest might be the
better choice.
 Based on the accuracy metric alone, the Random Forest model achieved the
highest accuracy among the models listed. Therefore, in terms of accuracy, the
Random Forest model is considered the best performer in this specific analysis.

Model Deployment and Operationalization:


When considering the deployment and operationalization of the models, it is essential to
acknowledge that the Decision Tree model may have an advantage due to its inherent
simplicity. Its straightforward decision rules make it easier to deploy in real-world settings,
requiring less computational resources and potentially minimizing deployment complexity.
On the other hand, the Random Forest model, while offering superior accuracy, may entail a
more intricate deployment process and higher resource demands. Organizations should
weigh the practicality of each model's deployment against the specific operational
constraints and resource availability.

Long-Term Maintenance and Adaptability:


Assessing the long-term maintenance and adaptability of the models reveals valuable
insights for sustainable implementation. The Decision Tree model, with its simplicity and
interpretability, may have an advantage in terms of long-term maintenance. Its
straightforward structure allows for easier updates and adaptation to changing data patterns
or the inclusion of new features. Conversely, the Random Forest model, while providing
robust accuracy, may require more frequent retraining and maintenance due to its ensemble
nature and complexity. Organizations should consider the trade-off between model accuracy
and the sustainability of long-term maintenance when making a decision.

19
4. Appendix
In the appendix, the provided R code showcases data cleaning, descriptive analytics, and
modeling processes for cancer prediction. It includes steps for fining missing values,
visualizing distributions, outlier detection, and implementing logistic regression, KNN,
decision tree, and random forest models. The code facilitates a comprehensive
understanding of the dataset and the application of various machine learning techniques for
predictive modeling in cancer diagnosis. Anything provided in # is just explaining what
the code is for and what we did in that part.

Coding for Cleaning and Descriptive Analytics:


#import thedatset
library(tidyverse)
data = Cancer
#### Data Cleaning
# Check for missing values
missing_values <- colSums(is.na(data))
print(paste("Missing Values:\n", missing_values))
# Remove duplicates
data <- distinct(data)
data
# 2. Distribution Analysis
# Plot the distribution of Age using ggplot2
ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Age", x = "Age", y = "Frequency")
###Identify Outliers in age:
# Create a boxplot to visualize outliers in Age
ggplot(data, aes(y = Age)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Outlier Detection in Age", y = "Age")

# Calculate IQR for Age


Q1 <- quantile(data$Age, 0.25)
Q3 <- quantile(data$Age, 0.75)
IQR <- Q3 - Q1

# Define lower and upper bounds for outliers


lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers in Age


outliers <- data$Age[data$Age < lower_bound | data$Age > upper_bound]

20
# Print the identified outliers
print("Identified Outliers:")
print(outliers)

#Plot the distribution of Children using ggplot2


ggplot(data, aes(x = Children)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Children", x = "Children", y = "Frequency")
#Create a boxplot to visualize outliers in Age
ggplot(data, aes(y = Children)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Outlier Detection in Children", y = "Children")

# Calculate IQR for Age


Q1 <- quantile(data$Children, 0.25)
Q3 <- quantile(data$Children, 0.75)
IQR <- Q3 - Q1

# Define lower and upper bounds for outliers


lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers in Age


outliers <- data$Children[data$Children < lower_bound | data$Children > upper_bound]

# Print the identified outliers


print("Identified Outliers:")
print(outliers)

#Plot of the distribution of years of work using ggplot2


ggplot(data, aes(x = `Years Worked`)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Distribution of Years Worked", x = "Years Worked", y = "Frequency")
data$`Years Worked`
#Create a boxplot to visualize outliers in Age
ggplot(data, aes(y = `Years Worked`)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Outlier Detection in `Years Worked`", y = "`Years Worked`")

# Calculate IQR for Age


Q1 <- quantile(data$`Years Worked`, 0.25)
Q3 <- quantile(data$`Years Worked`, 0.75)
IQR <- Q3 - Q1

21
# Define lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers in Age


outliers <- data$`Years Worked`[data$`Years Worked` < lower_bound | data$`Years
Worked` > upper_bound]

# Print the identified outliers


print("Identified Outliers:")
print(outliers)
###Now for Categorical Variables:
# Assuming your dataset is stored in a data frame named 'df'
# Load necessary libraries
library(ggplot2)

# Function to create bar plots for categorical variables


create_bar_plot <- function(variable, title) {
ggplot(data, aes(x = !!sym(variable), fill = !!sym(variable))) +
geom_bar() +
labs(title = title, x = variable, y = "Frequency") +
theme_minimal()
}
data$`Online Gaming`
# Create bar plots for categorical variables
create_bar_plot("Gender", "Distribution of Gender")
create_bar_plot("Marital Status", "Distribution of Marital Status")
create_bar_plot("Smoker", "Distribution of Smoker")
create_bar_plot("Employed", "Distribution of Employment Status")
create_bar_plot("Income Level", "Distribution of Income Level")
create_bar_plot("`Social Media`", "Distribution of Social Media Usage")
create_bar_plot("Online Gaming", "Distribution of Online Gaming Participation")
create_bar_plot("Cancer", "Distribution of Cancer Diagnosis")

#Categorical and numerical

library(dplyr)
library(ggplot2)

# Visualize the distribution of Age by Cancer status


ggplot(data, aes(x = Cancer, y = Age, fill = Cancer)) +
geom_boxplot() +
labs(title = "Distribution of Age by Cancer Status", x = "Cancer Status", y = "Age") +

22
theme_minimal()

# Compare the means of Age by Cancer status


means_by_cancer <- data %>%
group_by(Cancer) %>%
summarize(mean_Age = mean(Age))

print(means_by_cancer)

Coding for Modelling


########Methodologies
#import Dataset
cancer = Cancer
cancer_hidden = Cancer_Hidden

# Remove the 'ID' column from the 'cancer' dataset

cancer <- subset(cancer, select = -c(ID))


# Remove the 'ID' column from the 'cancer_hidden' data set

cancer_hidden <- subset(cancer_hidden, select = -c(ID))

#Cleaning

#Data is already clean because its in categorical form

#Import Libraries for

library(caret)
library(tidyverse)
library(rsample)
cancer$Cancer = factor(cancer$Cancer, labels = c("No", "Yes"))
cancer$Cancer
cancer_hidden$Cancer = factor(cancer_hidden$Cancer, labels = c("No", "Yes"))

####1. Logistic Regression model:


#training the model
cancer_model <- glm(Cancer ~ Gender + Age + `Marital Status` + Children + Smoker +
Employed + `Years Worked` + `Income Level` + `Social Media` + `Online Gaming`,
data = cancer,
family = binomial)
#Checking the summary
summary(cancer_model)

23
# Make predictions on the hidden dataset
cancer_prediction <- predict(cancer_model, newdata = cancer_hidden, type = "response")
cancer_prediction_factor <- as.factor(ifelse(cancer_prediction > 0.5, "Yes", "No"))

# Evaluate the performance of the predictions


confusionMatrix(cancer_prediction_factor, cancer_hidden$Cancer)

#####2. KNN Model


# For reproducibility
set.seed(123)
#Training the KNN Model
KNN_cancer_model <- train(Cancer ~ .,
data = cancer,
method = "knn",
trControl = trainControl(method = "cv", number = 10))
#Checking the summary of Model
KNN_cancer_model

# KNN Predictions
KNN_cancer_prediction <- predict(KNN_cancer_model, newdata = cancer_hidden)
# Evaluet teh performace by checking confuion matrix
confusionMatrix(KNN_cancer_prediction, cancer_hidden$Cancer)

###### Classification
####3. Decision Tree
#Loading the libraries

library(rpart.plot)
library(rpart)
# Training the tree plot from a basic single decision tree model with response Cancer and
every other variable included

tree_model <- rpart(Cancer ~ ., data = cancer, method = "class")


tree_model

# Print the model summary

summary(tree_model)

#Print the Decision Tree

rpart.plot(tree_model)

# Predict using the tree_model on the cancer_hidden data

24
TM_prediction = predict(tree_model, cancer_hidden)
TM_prediction
predicted_prob_yes <- TM_prediction[, "Yes"]
predicted_prob_yes
binary_predictions <- ifelse(predicted_prob_yes > 0.5, 1, 0)
binary_predictions

#Making Factor for the Predictions


TM_predictions_factor <- factor(binary_predictions, levels = c(0, 1), labels = c("No", "Yes"))

predicted_prob_yes <- factor(predicted_prob_yes, levels = c("No", "Yes"))

#Making sure that the levels are same for predictions and actual values
levels(cancer_hidden$Cancer)
levels(TM_predictions_factor)

#Confuions Matrix
conf_matrix <- confusionMatrix(TM_predictions_factor, cancer_hidden$Cancer)
conf_matrix

####4. Random forest model

#Train the RF Model


RF_model= train(Cancer ~ .,
data = cancer,
method = "ranger",
trControl = trainControl(method = "cv",
number = 10,
verboseIter = TRUE,
classProbs = TRUE),
num.trees = 200,
importance = "impurity")
#the details of RF Model
RF_model

# Predict outcomes for the cancer_hidden dataset


RF_predictions = predict(RF_model, newdata = cancer_hidden)
RF_predictions

# Create a confusion matrix


confusionMatrix(RF_predictions, cancer_hidden$Cancer)

25

You might also like