Final Report
Final Report
Group Project
(Multivariate Statistics
STAT8031 - Fall 2023
- Section 1)
Group Members:
Salma Shaheen
Student ID | 8913789
Asim Javed
Student ID | 8783167
Shivani Baireddy
Student ID | 8971642
Sai Kumar Reddy
Student ID | 8977581
Abstract
This project focuses on the development of a predictive model for cancer diagnosis using the Cancer
dataset, encompassing diverse demographic, lifestyle, and health-related variables. Our primary
objective is to enhance early detection and gain insights into factors influencing cancer incidence,
aligning with the broader goal of improving healthcare outcomes and patient well-being. Leveraging
machine learning techniques such as Random Forest, Decision Tree, KNN, and Logistic Regression,
we conducted data cleaning, exploratory data analysis, and model evaluation, utilizing metrics such
as accuracy, sensitivity, specificity, precision, and F1 score.
1
Table of Contents
1. Introduction -------------------------------------------------------------------------------------4
2. Methodologies ---------------------------------------------------------------------------------6
3. Conclusion -------------------------------------------------------------------------------------18
4. Appendix ---------------------------------------------------------------------------------------19
2
1 Introduction
The project aims to develop a predictive model for cancer diagnosis based on a comprehensive
dataset. The primary goal is to enhance early detection and provide valuable insights into factors
influencing cancer incidence. This aligns with the broader objective of improving healthcare
outcomes and patient well-being.
The dataset, named 'cancer_main,' comprises diverse demographic, lifestyle, and health-related
variables for a cohort of individuals. It includes information on cancer diagnosis, gender, age,
employment status, and other relevant factors. The dataset serves as the foundation for training and
evaluating machine learning models to predict and understand patterns of cancer occurrence.
The project employs a range of machine learning techniques, including Random Forest, Decision
Tree, KNN, and Logistic Regression models. Data cleaning and exploratory data analysis have been
conducted to ensure the dataset's quality and understand its characteristics. Model performance is
evaluated through metrics such as accuracy, sensitivity, specificity, precision, and F1 score to identify
the most effective predictive model.
The significance of this project lies in its potential to contribute to the field of healthcare by providing
a reliable tool for cancer prediction. By leveraging machine learning methodologies on a rich dataset,
the project aims to offer valuable insights into the complex interactions between various factors and
cancer occurrence. The outcomes of this research can inform medical practitioners, researchers, and
policymakers to make informed decisions for cancer prevention and intervention strategies.
Data Description
The dataset consists of the following columns, each representing a different aspect of the
respondents' profiles and responses:
1 ID: A unique 5-digit identifier assigned to each respondent. These are randomly
generated and hold no real-world significance.
2 Gender: The gender of the respondent indicated as either 'Female' or 'Male'.
3 Age: The age of the respondent, ranging from 18 to 90 years.
4 Marital Status: Marital status of the respondent, categorized as 'Married', 'Single',
'Widowed', or 'Separated'.
5 Children: The number of children the respondent has, ranging from 0 to 5.
6 Smoker: Indicates whether the respondent smokes, with options 'Yes' or 'No'.
7 Employed: Employment status of the respondent, indicated as 'Yes' for employed
and 'No' for unemployed.
3
8 Years Worked: The total number of years the respondent has been employed,
ranging from 0 to 40 years. This value is set to 0 for those not employed.
9 Income Level: The self-assessed income level of the respondent, categorized as
'High', 'Medium', or 'Low'. Unemployed respondents automatically fall under the 'Low'
category.
10 Social Media: Indicates whether the respondent uses social media platforms, with
options 'Yes' or 'No'.
11 Online Gaming: Denotes whether the respondent engages in online gaming, with
options 'Yes' or 'No'.
12 Cancer: Indicates whether the respondent has been diagnosed with lung cancer,
with options 'Yes' or 'No'. This field is artificially manipulated based on a combination
of factors such as age, smoking status, employment duration, and lifestyle choices to
simulate potential risk factors for lung cancer.
13 The primary response variable in this dataset is "Cancer". It indicates whether an
individual has been diagnosed with cancer or not. Understanding the factors
contributing to the likelihood of cancer is crucial for preventive healthcare and
personalized interventions.
Predictors
Gender
Age
Marital Status
Children
Smoker
Employed
Years Worked
Income Level
Social Media
Online Gaming
These predictors were selected based on their potential relevance to the occurrence of
cancer and existing literature on similar studies.
4
1.2 Descriptive Analytics of Data
In preparation for the analysis, a comprehensive data cleaning process was undertaken
using the 'tidyverse' package in R. The dataset, denoted as 'Cancer_main,' was inspected
for any missing values. The summary revealed that there are no missing values in the
dataset.
Additionally, a visual exploration of the numerical variables 'Age,' 'Children,' and 'Years
Worked' was conducted using the 'GGally' package. The resulting pairs plot, Figure 1.1,
provided insights into the distribution of these variables and aided in the detection of
potential outliers.
Figure 1.1
After visualizing the numerical variables using the pairs plot, an additional step was taken to
programmatically identify potential outliers. The Interquartile Range (IQR) method was
employed, where any data points falling outside a certain threshold (e.g., 1.5 times the IQR)
were considered as potential outliers. No such outlier has been found during this process.
The coding of this part is detailed in the Appendix.
There is a total of 1000 persons in the dataset, with a majority of 776 individuals diagnosed
with cancer and 224 individuals not having the condition. This yields a notable ratio of 97:28,
emphasizing a higher prevalence of cancer in the region. Gender distribution reveals 515
females and 485 males among the participants. The average age of the individuals is
approximately 54-55 years, with the maximum age reaching 89. Despite this, the median
age stands at 54, indicating a relatively even distribution with a slight skew towards younger
ages. The mean for years worked is 10, suggesting an average duration of employment or
professional activity. Additionally, the median for the number of children is 3, reflecting a
family size typical of those affected by cancer. This dataset provides valuable insights into
the demographic and health characteristics of the population, emphasizing the prominence
of cancer and its potential impact on various aspects of life.
In-depth coding details and implementation can be referred to in the Appendix section.
5
2. Methodologies
In addressing the predictive modeling task for cancer presence, our analysis employs
a multifaceted approach encompassing diverse statistical techniques. This section
delineates the methodologies utilized, each tailored to extract valuable insights from
the dataset. Initially, we delve into logistic regression, a well-established method for
modeling binary outcomes. Logistic regression provides a foundation for
understanding the probabilistic relationship between predictor variables and the
likelihood of cancer presence. Subsequently, we explore alternative techniques, such
as K-Nearest Neighbors (KNN), Decision Tree and Random Forests, each
contributing distinct strengths to our analytical arsenal.
To perform each modeling, we strategically partitioned our dataset into an 80-20 ratio,
allocating one subset for training the model (cancer) and the other for meticulously
testing each model's performance (cancer_hidden).
Logistic Regression is a statistical method used for modeling the probability of a binary
outcome. It's particularly suitable for classification problems, where the response
variable is categorical with two levels (in this case, the presence or absence of
cancer). Logistic Regression models the log-odds of the probability as a linear
combination of predictor variables.
Data Preparation:
The analysis focuses on predicting the presence of cancer based on various features.
The dataset was split into two parts: cancer (for training) and cancer_hidden
(for prediction and evaluation).
Variables Selection:
The 'ID' column was removed from both datasets as it does not contribute to the
modeling. For the Logistic Regression model on the cancer dataset, we selected the
following variables as predictors:
Gender
Age
6
Marital Status
Children
Smoker
Employed
Years Worked
Income Level
Social Media
Online Gaming
The choice of these variables was based on their potential relevance to cancer
prediction. Age, marital status, smoking habits, and other factors are commonly
associated with health outcomes, making them suitable predictors for this analysis.
Data Cleaning:
The data was considered clean as it was already in categorical form.
Libraries such as caret, tidyverse, and rsample were loaded for further analysis.
Response Variable Transformation: The response variable 'Cancer' was
transformed into a factor with labels 'No' and 'Yes'.
Model Training and Evaluation:
Experimental Results:
Logistic Regression Model Summary:
The logistic regression model was built to predict the presence of cancer based on
various features. The following key coefficients were obtained:
Predictor Coeffients
7
Gender (Male) -0.15
Age 0.03
Marital Status (Separated) 0.31
Marital Status (Single) 1.08
Marital Status (Widow) 0.03
Children 0.32
Smoker (Yes) 1.44
Employed (Yes) 1.44
Years worked 0.03
Income Level (Low) 0.96
Income Level (Medium) -0.17
Social Media (Yes) 1.26
Online Gaming (Yes) -1.27
In the logistic regression model predicting the likelihood of having cancer, several key
predictors exhibit distinctive impacts.
Gender (Male): Men are less likely to have cancer compared to women, as
indicated by the negative coefficient.
Age: With each additional year, the log-odds of having cancer increase slightly
by 0.03, reflecting the age-related risk.
Marital Status (Separated, Single, Widowed): Being separated or single is
associated with an increased likelihood of having cancer, especially being single
which has a substantial impact. Widowed individuals also show a slightly
elevated risk.
Children: Having more children is associated with an increased log-odds of
having cancer, with the coefficient of 0.32 indicating the magnitude of this effect.
Smoker (Yes): Smoking significantly raises the likelihood of having cancer, as
evidenced by the high positive coefficient.
8
Employed (Yes): Being employed is associated with an increased likelihood of
having cancer compared to being unemployed.
Years Worked: Each additional year worked contributes slightly to the log-odds
of having cancer, with a coefficient of 0.03.
Income Level (Low, Medium): Having a low income is associated with an
increased likelihood of having cancer, while a medium income level is
associated with a decreased likelihood compared to other income levels.
Social Media (Yes): Using social media is linked to a significantly increased
likelihood of having cancer.
Online Gaming (Yes): Interestingly, engaging in online gaming is associated
with a decreased likelihood of having cancer.
Prediction Performance on Testing Dataset:
The model's predictions on the hidden dataset resulted in a confusion matrix with the
following statistics:
Accuracy 79.2%
Sensitivity (Recall) 37.50%
Specificity 87.06%
Precision (Positive Predictive Value) 35.29%
F1 Score 0.3636
9
ability to identify true negatives, showcasing the model's proficiency in
distinguishing negative cases.
Precision: It measures the accuracy of positive predictions. In this case, the
model's positive predictions were accurate approximately 35.29% of the time. It
implies that when the model predicted a positive case, it was correct about
35.29% of the time.
F1 Score: The F1 score is the harmonic mean of precision and recall, providing
a balanced assessment. A value of 0.3636 suggests a moderate balance
between precision and recall, indicating a trade-off between accurately
identifying positive cases and avoiding false positives.
Conclusion:
The Logistic Regression model on the cancer dataset performed well, achieving
competitive results in accuracy, precision, recall, and F1 score. The interpretability of
Logistic Regression is advantageous in understanding the impact of individual
predictors on the likelihood of cancer. Further investigation, including feature
importance analysis and comparison with other classification methods, will contribute
to a comprehensive understanding of the model's performance.
All available predictor variables were used in the KNN model to determine the
proximity of data points in the multidimensional space.
Data Cleaning:
The data was considered clean as it was already in categorical form.
Libraries such as caret which is already loaded used for to train the model.
The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.
10
Model Training:
The KNN model was trained using the k-fold cross-validation method with k set to 10
for robust performance assessment.
The model was evaluated using the following classification metrics:
Model was trained using the k-fold cross-validation method with k set to 10 for
robust performance assessment.
The model parameters, such as the number of neighbors (k) and the distance
metric, were optimized during the training process.
The model predicts 'Cancer' based on features like 'Gender,' 'Age,' 'Marital
Status,' 'Children,' 'Smoker,' 'Employed,' 'Years Worked,' 'Income Level,' 'Social
Media,' and 'Online Gaming.'
The summary of the KNN model was examined to understand the significance
of each predictor.
Predictions were made on the hidden dataset (cancer_hidden) using the trained
KNN model.
Predictions were converted to a factor, and model performance was evaluated
using confusion matrix metrics.
Experimental Results:
k-Nearest Neighbors (KNN) Model Results
The k-Nearest Neighbors (KNN) model was trained and evaluated on the
cancer dataset. The key findings and performance metrics are summarized
below:
Optimal Model Selection:
The KNN model was tuned across different values of 'k' (number of neighbors),
and the optimal model was selected based on the highest accuracy.
The final selected model had 'k' set to 7, yielding an accuracy of approximately
74.08%.
Model Performance on Testing Data:
The KNN model was applied to the hidden dataset ('cancer_hidden') to make
predictions. The confusion matrix below summarizes the model's performance on the
testing data:
Accuracy 81.19%
11
Sensitivity (Recall) 37.50%
Specificity (True Negative Rate) 89.41%
Precision (Positive Predictive Value) 40.00%
F1 Score 0.3870
The KNN model, with optimal 'k' set to 7, exhibits promising performance on the
testing dataset.
Model Overview:
In this study, a predictive model was developed using decision trees to diagnose
cancer. The primary goal was to leverage machine learning techniques to accurately
predict the presence or absence of cancer based on various input variables.
Variables Selection:
The selection of variables was a crucial step in building an effective predictive model.
Relevant features were carefully chosen to ensure that the model captures essential
information for cancer diagnosis. The variables were preprocessed to maintain data
integrity and enhance the model's predictive capabilities.
Data Cleaning:
The data was considered clean as it was already in categorical form.
Libraries such as rpart, rsample, and rpart.plot have been loaded.
The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.
12
Model Training:
Experimental Results:
Decision Tree:
The decision tree generated for the cancer dataset, as illustrated in Figure 2.1,
exemplifies the adaptability of the algorithm to discern various ways of partitioning the
data into branch-like segments. This tree vividly demonstrates the algorithm's
capability to accommodate both continuous and categorical variables as objects of
analysis. The structure of the decision tree provides insights into the relationships and
decision boundaries within the cancer dataset, showcasing its ability to effectively
handle a diverse set of features for the classification task.
Figure 2.1 describes a decision tree that reasons weather a person is getting Cancer
or not when his age is less than 62.
Model Performance on Testing Data:
The model was applied to the hidden dataset ('cancer_hidden') to make predictions.
The confusion matrix below summarizes the model's performance on the testing data:
Accuracy 80.02%
Sensitivity (Recall) 56.25%
Specificity (True Negative Rate) 84.71%
Precision (Positive Predictive Value) 40.91%
F1 Score 0.4875
13
Figure 2.1
14
2.4 Random Forest Modelling:
Random Forest Model Overview:
Variables Selection:
All available predictor variables were used in the Random Forest model to determine
the proximity of data points in the multidimensional space.
Data Cleaning:
The data was considered clean as it was already in categorical form.
Libraries such as random forest which is already loaded used for to train the
model.
The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.
The Random Forest model, employing 200 decision trees, was trained to
predict cancer outcomes based on diverse features within the dataset.
Utilizing 10-fold cross-validation for robust evaluation, the model demonstrated
notable strength in handling complex relationships and provided insights into
variable importance, as assessed by the "impurity" method.
Used the trained model to make predictions on the test dataset
(cancer_hidden).
Experimental Results:
Random forest Model Summary
In order to optimize the performance of the Random Forest model, various tuning
parameters were explored through a cross-validated approach. The table below
illustrates the comparative results across different combinations of mtry and split rule,
highlighting the key metrics, including accuracy and Kappa.
15
mtry Split Rule Accuracy (%) Kappa (%)
2 Gini 78.31 13.94
2 Extra Trees 76.98 4.68
8 Gini 77.31 24.34
8 Extra Trees 76.42 23.99
14 Gini 76.53 22.20
14 Extra Trees 76.43 26.50
Note in the context of Random Forest, mtry refers to the number of randomly
selected predictors considered at each split when building individual trees in
the ensemble. It plays a crucial role in the model's ability to capture diverse
patterns within the dataset. it's important to note that the mtry parameter can
take values larger than the total number of predictors. The model with mtry =
14 might be exploring interactions between a larger subset of predictors at
each split, even though you only have 11 predictors. This can be useful for
capturing complex patterns or interactions that may not be apparent when
considering a smaller subset of predictors.
Based on the cross-validated performance, the optimal Random Forest model
for the cancer dataset is achieved with the following parameters: mtry = 2,
split rule = Gini, and minimum node size = 1. The model exhibits an
accuracy of 78.31% and a Kappa value of 13.94%.
16
Specificity (True Negative Rate) 95.29%
Precision (Positive Predictive Value) 42.86%
F1 Score 0.2660
In summary, the Random Forest model, with its ability to handle complex
relationships and provide insights into variable importance, proves to be a valuable
tool for predicting cancer outcomes. The achieved results, along with a thorough
understanding of the model's strengths and limitations, contribute to its applicability
and reliability in the context of cancer prediction.
17
18
3. Conclusion
The choice of the "best" model depends on the specific goals of the analysis.
If identifying positive cases is crucial, the Decision Tree might be preferred due to
its high recall. If overall accuracy is the priority, the Random Forest might be the
better choice.
Based on the accuracy metric alone, the Random Forest model achieved the
highest accuracy among the models listed. Therefore, in terms of accuracy, the
Random Forest model is considered the best performer in this specific analysis.
19
4. Appendix
In the appendix, the provided R code showcases data cleaning, descriptive analytics, and
modeling processes for cancer prediction. It includes steps for fining missing values,
visualizing distributions, outlier detection, and implementing logistic regression, KNN,
decision tree, and random forest models. The code facilitates a comprehensive
understanding of the dataset and the application of various machine learning techniques for
predictive modeling in cancer diagnosis. Anything provided in # is just explaining what
the code is for and what we did in that part.
20
# Print the identified outliers
print("Identified Outliers:")
print(outliers)
21
# Define lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
library(dplyr)
library(ggplot2)
22
theme_minimal()
print(means_by_cancer)
#Cleaning
library(caret)
library(tidyverse)
library(rsample)
cancer$Cancer = factor(cancer$Cancer, labels = c("No", "Yes"))
cancer$Cancer
cancer_hidden$Cancer = factor(cancer_hidden$Cancer, labels = c("No", "Yes"))
23
# Make predictions on the hidden dataset
cancer_prediction <- predict(cancer_model, newdata = cancer_hidden, type = "response")
cancer_prediction_factor <- as.factor(ifelse(cancer_prediction > 0.5, "Yes", "No"))
# KNN Predictions
KNN_cancer_prediction <- predict(KNN_cancer_model, newdata = cancer_hidden)
# Evaluet teh performace by checking confuion matrix
confusionMatrix(KNN_cancer_prediction, cancer_hidden$Cancer)
###### Classification
####3. Decision Tree
#Loading the libraries
library(rpart.plot)
library(rpart)
# Training the tree plot from a basic single decision tree model with response Cancer and
every other variable included
summary(tree_model)
rpart.plot(tree_model)
24
TM_prediction = predict(tree_model, cancer_hidden)
TM_prediction
predicted_prob_yes <- TM_prediction[, "Yes"]
predicted_prob_yes
binary_predictions <- ifelse(predicted_prob_yes > 0.5, 1, 0)
binary_predictions
#Making sure that the levels are same for predictions and actual values
levels(cancer_hidden$Cancer)
levels(TM_predictions_factor)
#Confuions Matrix
conf_matrix <- confusionMatrix(TM_predictions_factor, cancer_hidden$Cancer)
conf_matrix
25