0% found this document useful (0 votes)

21 views

Final Report

This project aims to develop a predictive model for cancer diagnosis using machine learning techniques on a dataset containing demographic, lifestyle, and health variables. Several machine learning models are trained and evaluated on the dataset, including random forest, decision tree, KNN, and logistic regression. The most effective model is selected based on evaluation metrics like accuracy, sensitivity, specificity, precision, and F1 score. The significance of the project is that the developed model could help medical practitioners, researchers, and policymakers make informed decisions to improve cancer prevention and outcomes.

Uploaded by

Salma Shaheen

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Final Report

Uploaded by

Salma Shaheen

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Classification

Group Project
(Multivariate Statistics
STAT8031 - Fall 2023
- Section 1)

Group Members:
Salma Shaheen
Student ID | 8913789
Asim Javed
Student ID | 8783167
Shivani Baireddy
Student ID | 8971642
Sai Kumar Reddy
Student ID | 8977581
Abstract
This project focuses on the development of a predictive model for cancer diagnosis using the Cancer
dataset, encompassing diverse demographic, lifestyle, and health-related variables. Our primary
objective is to enhance early detection and gain insights into factors influencing cancer incidence,
aligning with the broader goal of improving healthcare outcomes and patient well-being. Leveraging
machine learning techniques such as Random Forest, Decision Tree, KNN, and Logistic Regression,
we conducted data cleaning, exploratory data analysis, and model evaluation, utilizing metrics such
as accuracy, sensitivity, specificity, precision, and F1 score.

A distinctive emphasis is placed on considering two critical aspects—Model Deployment and

Operationalization, and Long-Term Maintenance and Adaptability. In the pursuit of a reliable
predictive tool for cancer, we choose a model that easy for the practical deployment in real-world
settings. Furthermore, a keen focus is placed on the long-term sustainability of the models,
considering their adaptability to evolving healthcare scenarios. This project's significance lies in not
only offering predictive insights but also in ensuring the seamless integration and enduring utility of
the developed models for the benefit of medical practitioners, researchers, and policymakers. These
models aim to empower informed decision-making for cancer prevention and intervention
strategies, thereby contributing meaningfully to the healthcare landscape.

1
Table of Contents

1. Introduction -------------------------------------------------------------------------------------4
2. Methodologies ---------------------------------------------------------------------------------6
3. Conclusion -------------------------------------------------------------------------------------18
4. Appendix ---------------------------------------------------------------------------------------19

2
1 Introduction
The project aims to develop a predictive model for cancer diagnosis based on a comprehensive
dataset. The primary goal is to enhance early detection and provide valuable insights into factors
influencing cancer incidence. This aligns with the broader objective of improving healthcare
outcomes and patient well-being.

The dataset, named 'cancer_main,' comprises diverse demographic, lifestyle, and health-related
variables for a cohort of individuals. It includes information on cancer diagnosis, gender, age,
employment status, and other relevant factors. The dataset serves as the foundation for training and
evaluating machine learning models to predict and understand patterns of cancer occurrence.

The project employs a range of machine learning techniques, including Random Forest, Decision
Tree, KNN, and Logistic Regression models. Data cleaning and exploratory data analysis have been
conducted to ensure the dataset's quality and understand its characteristics. Model performance is
evaluated through metrics such as accuracy, sensitivity, specificity, precision, and F1 score to identify
the most effective predictive model.

The significance of this project lies in its potential to contribute to the field of healthcare by providing
a reliable tool for cancer prediction. By leveraging machine learning methodologies on a rich dataset,
the project aims to offer valuable insights into the complex interactions between various factors and
cancer occurrence. The outcomes of this research can inform medical practitioners, researchers, and
policymakers to make informed decisions for cancer prevention and intervention strategies.

1.1 Explanation of Dataset

This dataset represents a synthetic collection of responses gathered from a university-
conducted survey, aimed at studying the potential risk factors for lung cancer. The survey
includes a variety of demographic, lifestyle, and health-related questions. The dataset is
purely fictional and created for educational and research purposes only.

Data Description
The dataset consists of the following columns, each representing a different aspect of the
respondents' profiles and responses:

1 ID: A unique 5-digit identifier assigned to each respondent. These are randomly
generated and hold no real-world significance.
2 Gender: The gender of the respondent indicated as either 'Female' or 'Male'.
3 Age: The age of the respondent, ranging from 18 to 90 years.
4 Marital Status: Marital status of the respondent, categorized as 'Married', 'Single',
'Widowed', or 'Separated'.
5 Children: The number of children the respondent has, ranging from 0 to 5.
6 Smoker: Indicates whether the respondent smokes, with options 'Yes' or 'No'.
7 Employed: Employment status of the respondent, indicated as 'Yes' for employed
and 'No' for unemployed.

3
8 Years Worked: The total number of years the respondent has been employed,
ranging from 0 to 40 years. This value is set to 0 for those not employed.
9 Income Level: The self-assessed income level of the respondent, categorized as
'High', 'Medium', or 'Low'. Unemployed respondents automatically fall under the 'Low'
category.
10 Social Media: Indicates whether the respondent uses social media platforms, with
options 'Yes' or 'No'.
11 Online Gaming: Denotes whether the respondent engages in online gaming, with
options 'Yes' or 'No'.
12 Cancer: Indicates whether the respondent has been diagnosed with lung cancer,
with options 'Yes' or 'No'. This field is artificially manipulated based on a combination
of factors such as age, smoking status, employment duration, and lifestyle choices to
simulate potential risk factors for lung cancer.
13 The primary response variable in this dataset is "Cancer". It indicates whether an
individual has been diagnosed with cancer or not. Understanding the factors
contributing to the likelihood of cancer is crucial for preventive healthcare and
personalized interventions.

The predictors chosen for classification are:

Predictors

Gender

Age

Marital Status

Children

Smoker

Employed

Years Worked

Income Level

Social Media

Online Gaming

These predictors were selected based on their potential relevance to the occurrence of
cancer and existing literature on similar studies.

4
1.2 Descriptive Analytics of Data
In preparation for the analysis, a comprehensive data cleaning process was undertaken
using the 'tidyverse' package in R. The dataset, denoted as 'Cancer_main,' was inspected
for any missing values. The summary revealed that there are no missing values in the
dataset.
Additionally, a visual exploration of the numerical variables 'Age,' 'Children,' and 'Years
Worked' was conducted using the 'GGally' package. The resulting pairs plot, Figure 1.1,
provided insights into the distribution of these variables and aided in the detection of
potential outliers.

Figure 1.1
After visualizing the numerical variables using the pairs plot, an additional step was taken to
programmatically identify potential outliers. The Interquartile Range (IQR) method was
employed, where any data points falling outside a certain threshold (e.g., 1.5 times the IQR)
were considered as potential outliers. No such outlier has been found during this process.
The coding of this part is detailed in the Appendix.
There is a total of 1000 persons in the dataset, with a majority of 776 individuals diagnosed
with cancer and 224 individuals not having the condition. This yields a notable ratio of 97:28,
emphasizing a higher prevalence of cancer in the region. Gender distribution reveals 515
females and 485 males among the participants. The average age of the individuals is
approximately 54-55 years, with the maximum age reaching 89. Despite this, the median
age stands at 54, indicating a relatively even distribution with a slight skew towards younger
ages. The mean for years worked is 10, suggesting an average duration of employment or
professional activity. Additionally, the median for the number of children is 3, reflecting a
family size typical of those affected by cancer. This dataset provides valuable insights into
the demographic and health characteristics of the population, emphasizing the prominence
of cancer and its potential impact on various aspects of life.
In-depth coding details and implementation can be referred to in the Appendix section.

5
2. Methodologies
In addressing the predictive modeling task for cancer presence, our analysis employs
a multifaceted approach encompassing diverse statistical techniques. This section
delineates the methodologies utilized, each tailored to extract valuable insights from
the dataset. Initially, we delve into logistic regression, a well-established method for
modeling binary outcomes. Logistic regression provides a foundation for
understanding the probabilistic relationship between predictor variables and the
likelihood of cancer presence. Subsequently, we explore alternative techniques, such
as K-Nearest Neighbors (KNN), Decision Tree and Random Forests, each
contributing distinct strengths to our analytical arsenal.

To perform each modeling, we strategically partitioned our dataset into an 80-20 ratio,
allocating one subset for training the model (cancer) and the other for meticulously
testing each model's performance (cancer_hidden).

The following subsections detail the implementation steps, considerations, and

motivations underlying each methodology, offering a comprehensive perspective on
our modeling strategy.

2.1 Logistic Regression

Logistic Regression Overview:

Logistic Regression is a statistical method used for modeling the probability of a binary
outcome. It's particularly suitable for classification problems, where the response
variable is categorical with two levels (in this case, the presence or absence of
cancer). Logistic Regression models the log-odds of the probability as a linear
combination of predictor variables.

Data Preparation:
The analysis focuses on predicting the presence of cancer based on various features.
 The dataset was split into two parts: cancer (for training) and cancer_hidden
(for prediction and evaluation).
Variables Selection:
The 'ID' column was removed from both datasets as it does not contribute to the
modeling. For the Logistic Regression model on the cancer dataset, we selected the
following variables as predictors:

 Gender
 Age

6
 Marital Status
 Children
 Smoker
 Employed
 Years Worked
 Income Level
 Social Media
 Online Gaming

The choice of these variables was based on their potential relevance to cancer
prediction. Age, marital status, smoking habits, and other factors are commonly
associated with health outcomes, making them suitable predictors for this analysis.

Data Cleaning:
The data was considered clean as it was already in categorical form.
Libraries such as caret, tidyverse, and rsample were loaded for further analysis.
 Response Variable Transformation: The response variable 'Cancer' was
transformed into a factor with labels 'No' and 'Yes'.
Model Training and Evaluation:

 A logistic regression model was built using the glm function.

 The model predicts 'Cancer' based on features like 'Gender,' 'Age,' 'Marital
Status,' 'Children,' 'Smoker,' 'Employed,' 'Years Worked,' 'Income Level,' 'Social
Media,' and 'Online Gaming.'
 The family parameter was set to "binomial" to indicate binary outcome.
 The summary of the logistic regression model was examined to understand the
significance of each predictor.
 Predictions were made on the hidden dataset (cancer_hidden) using the trained
logistic regression model.
 Predictions were converted to a factor, and model performance was evaluated
using confusion matrix metrics.

Experimental Results:
Logistic Regression Model Summary:
The logistic regression model was built to predict the presence of cancer based on
various features. The following key coefficients were obtained:

Predictor Coeffients

7
Gender (Male) -0.15
Age 0.03
Marital Status (Separated) 0.31
Marital Status (Single) 1.08
Marital Status (Widow) 0.03
Children 0.32
Smoker (Yes) 1.44
Employed (Yes) 1.44
Years worked 0.03
Income Level (Low) 0.96
Income Level (Medium) -0.17
Social Media (Yes) 1.26
Online Gaming (Yes) -1.27

In the logistic regression model predicting the likelihood of having cancer, several key
predictors exhibit distinctive impacts.
 Gender (Male): Men are less likely to have cancer compared to women, as
indicated by the negative coefficient.
 Age: With each additional year, the log-odds of having cancer increase slightly
by 0.03, reflecting the age-related risk.
 Marital Status (Separated, Single, Widowed): Being separated or single is
associated with an increased likelihood of having cancer, especially being single
which has a substantial impact. Widowed individuals also show a slightly
elevated risk.
 Children: Having more children is associated with an increased log-odds of
having cancer, with the coefficient of 0.32 indicating the magnitude of this effect.
 Smoker (Yes): Smoking significantly raises the likelihood of having cancer, as
evidenced by the high positive coefficient.

8
 Employed (Yes): Being employed is associated with an increased likelihood of
having cancer compared to being unemployed.
 Years Worked: Each additional year worked contributes slightly to the log-odds
of having cancer, with a coefficient of 0.03.
 Income Level (Low, Medium): Having a low income is associated with an
increased likelihood of having cancer, while a medium income level is
associated with a decreased likelihood compared to other income levels.
 Social Media (Yes): Using social media is linked to a significantly increased
likelihood of having cancer.
 Online Gaming (Yes): Interestingly, engaging in online gaming is associated
with a decreased likelihood of having cancer.
Prediction Performance on Testing Dataset:
The model's predictions on the hidden dataset resulted in a confusion matrix with the
following statistics:
Accuracy 79.2%
Sensitivity (Recall) 37.50%
Specificity 87.06%
Precision (Positive Predictive Value) 35.29%
F1 Score 0.3636

 Accuracy: This metric indicates the overall correctness of the model's

predictions. In this case, the model correctly classified approximately 79.2% of
all instances, reflecting a reasonably accurate predictive performance.
 Sensitivity: It measures the model's ability to correctly identify positive
instances among all actual positive instances. Here, the model captured only
37.50% of the actual positive cases, suggesting a limitation in detecting
instances of the positive class.
 Specificity: gauges the model's ability to correctly identify negative instances
among all actual negative instances. A specificity of 87.06% indicates a strong

9
ability to identify true negatives, showcasing the model's proficiency in
distinguishing negative cases.
 Precision: It measures the accuracy of positive predictions. In this case, the
model's positive predictions were accurate approximately 35.29% of the time. It
implies that when the model predicted a positive case, it was correct about
35.29% of the time.
 F1 Score: The F1 score is the harmonic mean of precision and recall, providing
a balanced assessment. A value of 0.3636 suggests a moderate balance
between precision and recall, indicating a trade-off between accurately
identifying positive cases and avoiding false positives.
Conclusion:
The Logistic Regression model on the cancer dataset performed well, achieving
competitive results in accuracy, precision, recall, and F1 score. The interpretability of
Logistic Regression is advantageous in understanding the impact of individual
predictors on the likelihood of cancer. Further investigation, including feature
importance analysis and comparison with other classification methods, will contribute
to a comprehensive understanding of the model's performance.

2.2 KNN Model:

KNN (K-Nearest Neighbors) Model Overview:

The k-Nearest Neighbors (KNN) algorithm is a non-parametric and lazy learning

algorithm used for classification. It assigns an unseen data point to the most common
class among its k-nearest neighbors based on a distance metric.
Variables Selection:

All available predictor variables were used in the KNN model to determine the
proximity of data points in the multidimensional space.
Data Cleaning:
 The data was considered clean as it was already in categorical form.
 Libraries such as caret which is already loaded used for to train the model.
 The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.

10
Model Training:

The KNN model was trained using the k-fold cross-validation method with k set to 10
for robust performance assessment.
The model was evaluated using the following classification metrics:
 Model was trained using the k-fold cross-validation method with k set to 10 for
robust performance assessment.
 The model parameters, such as the number of neighbors (k) and the distance
metric, were optimized during the training process.
 The model predicts 'Cancer' based on features like 'Gender,' 'Age,' 'Marital
Status,' 'Children,' 'Smoker,' 'Employed,' 'Years Worked,' 'Income Level,' 'Social
Media,' and 'Online Gaming.'
 The summary of the KNN model was examined to understand the significance
of each predictor.
 Predictions were made on the hidden dataset (cancer_hidden) using the trained
KNN model.
 Predictions were converted to a factor, and model performance was evaluated
using confusion matrix metrics.

Experimental Results:
k-Nearest Neighbors (KNN) Model Results
 The k-Nearest Neighbors (KNN) model was trained and evaluated on the
cancer dataset. The key findings and performance metrics are summarized
below:
Optimal Model Selection:

 The KNN model was tuned across different values of 'k' (number of neighbors),
and the optimal model was selected based on the highest accuracy.
 The final selected model had 'k' set to 7, yielding an accuracy of approximately
74.08%.
Model Performance on Testing Data:
The KNN model was applied to the hidden dataset ('cancer_hidden') to make
predictions. The confusion matrix below summarizes the model's performance on the
testing data:
Accuracy 81.19%

11
Sensitivity (Recall) 37.50%
Specificity (True Negative Rate) 89.41%
Precision (Positive Predictive Value) 40.00%
F1 Score 0.3870

 The model demonstrates an accuracy of 81.19%, indicating its ability to correctly

The KNN model, with optimal 'k' set to 7, exhibits promising performance on the
testing dataset.

2.3 Predictive Modeling for Cancer Diagnosis using Decision Trees

Model Overview:

In this study, a predictive model was developed using decision trees to diagnose
cancer. The primary goal was to leverage machine learning techniques to accurately
predict the presence or absence of cancer based on various input variables.

Variables Selection:

The selection of variables was a crucial step in building an effective predictive model.
Relevant features were carefully chosen to ensure that the model captures essential
information for cancer diagnosis. The variables were preprocessed to maintain data
integrity and enhance the model's predictive capabilities.
Data Cleaning:
 The data was considered clean as it was already in categorical form.
 Libraries such as rpart, rsample, and rpart.plot have been loaded.
 The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.

12
Model Training:

The model was evaluated using the following classification metrics:

 The predictive model was trained using the decision tree algorithm implemented
in the 'rpart' package.
 The 'caret' library facilitated efficient model training, allowing for seamless
integration of data and algorithm.
 The decision tree model was trained on cancer dataset, with the response
variable being the presence or absence of Cancer.

Experimental Results:
Decision Tree:
The decision tree generated for the cancer dataset, as illustrated in Figure 2.1,
exemplifies the adaptability of the algorithm to discern various ways of partitioning the
data into branch-like segments. This tree vividly demonstrates the algorithm's
capability to accommodate both continuous and categorical variables as objects of
analysis. The structure of the decision tree provides insights into the relationships and
decision boundaries within the cancer dataset, showcasing its ability to effectively
handle a diverse set of features for the classification task.
Figure 2.1 describes a decision tree that reasons weather a person is getting Cancer
or not when his age is less than 62.
Model Performance on Testing Data:
The model was applied to the hidden dataset ('cancer_hidden') to make predictions.
The confusion matrix below summarizes the model's performance on the testing data:
Accuracy 80.02%
Sensitivity (Recall) 56.25%
Specificity (True Negative Rate) 84.71%
Precision (Positive Predictive Value) 40.91%
F1 Score 0.4875

13
Figure 2.1

 The model demonstrates an accuracy of 80.02%, indicating its ability to correctly

classify instances.
 Sensitivity and specificity provide insights into the model's performance
concerning true positive and true negative rates, respectively.
 The F1 value of 0.4875 suggests moderate balance between precision and
recall.
Conclusion:
In summary, the decision tree-based predictive model presents a robust and
interpretable approach to cancer diagnosis. Its ability to effectively handle a diverse
set of features and provide transparent insights into decision-making processes
positions it as a valuable tool in the realm of medical decision support systems. As
with any model, ongoing refinement and validation are crucial for ensuring its
applicability and reliability in real-world clinical settings.

14
2.4 Random Forest Modelling:
Random Forest Model Overview:

A Random Forest is an ensemble learning method that operates by constructing a

multitude of decision trees during training and outputting the mode of the classes
(classification) or mean prediction (regression) of the individual trees. It is a powerful
and versatile machine learning algorithm that combines the predictions of multiple
decision trees to improve overall accuracy and robustness.

Variables Selection:

All available predictor variables were used in the Random Forest model to determine
the proximity of data points in the multidimensional space.
Data Cleaning:
 The data was considered clean as it was already in categorical form.
 Libraries such as random forest which is already loaded used for to train the
model.
 The response variable 'Cancer' was already in form of factor with labels 'No'
and 'Yes'.

Model Training and Evaluation:

 The Random Forest model, employing 200 decision trees, was trained to
predict cancer outcomes based on diverse features within the dataset.
 Utilizing 10-fold cross-validation for robust evaluation, the model demonstrated
notable strength in handling complex relationships and provided insights into
variable importance, as assessed by the "impurity" method.
 Used the trained model to make predictions on the test dataset
(cancer_hidden).

Experimental Results:
Random forest Model Summary
In order to optimize the performance of the Random Forest model, various tuning
parameters were explored through a cross-validated approach. The table below
illustrates the comparative results across different combinations of mtry and split rule,
highlighting the key metrics, including accuracy and Kappa.

15
mtry Split Rule Accuracy (%) Kappa (%)
2 Gini 78.31 13.94
2 Extra Trees 76.98 4.68
8 Gini 77.31 24.34
8 Extra Trees 76.42 23.99
14 Gini 76.53 22.20
14 Extra Trees 76.43 26.50

 Note in the context of Random Forest, mtry refers to the number of randomly
selected predictors considered at each split when building individual trees in
the ensemble. It plays a crucial role in the model's ability to capture diverse
patterns within the dataset. it's important to note that the mtry parameter can
take values larger than the total number of predictors. The model with mtry =
14 might be exploring interactions between a larger subset of predictors at
each split, even though you only have 11 predictors. This can be useful for
capturing complex patterns or interactions that may not be apparent when
considering a smaller subset of predictors.
 Based on the cross-validated performance, the optimal Random Forest model
for the cancer dataset is achieved with the following parameters: mtry = 2,
split rule = Gini, and minimum node size = 1. The model exhibits an
accuracy of 78.31% and a Kappa value of 13.94%.

Model Performance on Testing Data:

The Random Forest model was applied to the hidden dataset ('cancer_hidden') to
make predictions. The confusion matrix below summarizes the model's performance
on the testing data:
Accuracy 83.17%
Sensitivity (Recall) 18.75%

16
Specificity (True Negative Rate) 95.29%
Precision (Positive Predictive Value) 42.86%
F1 Score 0.2660

 The model demonstrates an accuracy of 83.17%, indicating its ability to correctly

In summary, the Random Forest model, with its ability to handle complex
relationships and provide insights into variable importance, proves to be a valuable
tool for predicting cancer outcomes. The achieved results, along with a thorough
understanding of the model's strengths and limitations, contribute to its applicability
and reliability in the context of cancer prediction.

17
18
3. Conclusion
The choice of the "best" model depends on the specific goals of the analysis.
 If identifying positive cases is crucial, the Decision Tree might be preferred due to
its high recall. If overall accuracy is the priority, the Random Forest might be the
better choice.
 Based on the accuracy metric alone, the Random Forest model achieved the
highest accuracy among the models listed. Therefore, in terms of accuracy, the
Random Forest model is considered the best performer in this specific analysis.

Model Deployment and Operationalization:

When considering the deployment and operationalization of the models, it is essential to
acknowledge that the Decision Tree model may have an advantage due to its inherent
simplicity. Its straightforward decision rules make it easier to deploy in real-world settings,
requiring less computational resources and potentially minimizing deployment complexity.
On the other hand, the Random Forest model, while offering superior accuracy, may entail a
more intricate deployment process and higher resource demands. Organizations should
weigh the practicality of each model's deployment against the specific operational
constraints and resource availability.

Long-Term Maintenance and Adaptability:

Assessing the long-term maintenance and adaptability of the models reveals valuable
insights for sustainable implementation. The Decision Tree model, with its simplicity and
interpretability, may have an advantage in terms of long-term maintenance. Its
straightforward structure allows for easier updates and adaptation to changing data patterns
or the inclusion of new features. Conversely, the Random Forest model, while providing
robust accuracy, may require more frequent retraining and maintenance due to its ensemble
nature and complexity. Organizations should consider the trade-off between model accuracy
and the sustainability of long-term maintenance when making a decision.

19
4. Appendix
In the appendix, the provided R code showcases data cleaning, descriptive analytics, and
modeling processes for cancer prediction. It includes steps for fining missing values,
visualizing distributions, outlier detection, and implementing logistic regression, KNN,
decision tree, and random forest models. The code facilitates a comprehensive
understanding of the dataset and the application of various machine learning techniques for
predictive modeling in cancer diagnosis. Anything provided in # is just explaining what
the code is for and what we did in that part.

Coding for Cleaning and Descriptive Analytics:

#import thedatset
library(tidyverse)
data = Cancer
#### Data Cleaning
# Check for missing values
missing_values <- colSums(is.na(data))
print(paste("Missing Values:\n", missing_values))
# Remove duplicates
data <- distinct(data)
data
# 2. Distribution Analysis
# Plot the distribution of Age using ggplot2
ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Age", x = "Age", y = "Frequency")
###Identify Outliers in age:
# Create a boxplot to visualize outliers in Age
ggplot(data, aes(y = Age)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Outlier Detection in Age", y = "Age")

# Calculate IQR for Age

Q1 <- quantile(data$Age, 0.25)
Q3 <- quantile(data$Age, 0.75)
IQR <- Q3 - Q1

# Define lower and upper bounds for outliers

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers in Age

outliers <- data$Age[data$Age < lower_bound | data$Age > upper_bound]

20
# Print the identified outliers
print("Identified Outliers:")
print(outliers)

#Plot the distribution of Children using ggplot2

ggplot(data, aes(x = Children)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Children", x = "Children", y = "Frequency")
#Create a boxplot to visualize outliers in Age
ggplot(data, aes(y = Children)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Outlier Detection in Children", y = "Children")

# Calculate IQR for Age

Q1 <- quantile(data$Children, 0.25)
Q3 <- quantile(data$Children, 0.75)
IQR <- Q3 - Q1

# Define lower and upper bounds for outliers

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers in Age

outliers <- data$Children[data$Children < lower_bound | data$Children > upper_bound]

# Print the identified outliers

print("Identified Outliers:")
print(outliers)

#Plot of the distribution of years of work using ggplot2

ggplot(data, aes(x = `Years Worked`)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Distribution of Years Worked", x = "Years Worked", y = "Frequency")
data$`Years Worked`
#Create a boxplot to visualize outliers in Age
ggplot(data, aes(y = `Years Worked`)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Outlier Detection in `Years Worked`", y = "`Years Worked`")

# Calculate IQR for Age

Q1 <- quantile(data$`Years Worked`, 0.25)
Q3 <- quantile(data$`Years Worked`, 0.75)
IQR <- Q3 - Q1

21
# Define lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers in Age

outliers <- data$`Years Worked`[data$`Years Worked` < lower_bound | data$`Years
Worked` > upper_bound]

# Print the identified outliers

print("Identified Outliers:")
print(outliers)
###Now for Categorical Variables:
# Assuming your dataset is stored in a data frame named 'df'
# Load necessary libraries
library(ggplot2)

# Function to create bar plots for categorical variables

create_bar_plot <- function(variable, title) {
ggplot(data, aes(x = !!sym(variable), fill = !!sym(variable))) +
geom_bar() +
labs(title = title, x = variable, y = "Frequency") +
theme_minimal()
}
data$`Online Gaming`
# Create bar plots for categorical variables
create_bar_plot("Gender", "Distribution of Gender")
create_bar_plot("Marital Status", "Distribution of Marital Status")
create_bar_plot("Smoker", "Distribution of Smoker")
create_bar_plot("Employed", "Distribution of Employment Status")
create_bar_plot("Income Level", "Distribution of Income Level")
create_bar_plot("`Social Media`", "Distribution of Social Media Usage")
create_bar_plot("Online Gaming", "Distribution of Online Gaming Participation")
create_bar_plot("Cancer", "Distribution of Cancer Diagnosis")

#Categorical and numerical

library(dplyr)
library(ggplot2)

# Visualize the distribution of Age by Cancer status

ggplot(data, aes(x = Cancer, y = Age, fill = Cancer)) +
geom_boxplot() +
labs(title = "Distribution of Age by Cancer Status", x = "Cancer Status", y = "Age") +

22
theme_minimal()

# Compare the means of Age by Cancer status

means_by_cancer <- data %>%
group_by(Cancer) %>%
summarize(mean_Age = mean(Age))

print(means_by_cancer)

Coding for Modelling

########Methodologies
#import Dataset
cancer = Cancer
cancer_hidden = Cancer_Hidden

# Remove the 'ID' column from the 'cancer' dataset

cancer <- subset(cancer, select = -c(ID))

# Remove the 'ID' column from the 'cancer_hidden' data set

cancer_hidden <- subset(cancer_hidden, select = -c(ID))

#Cleaning

#Data is already clean because its in categorical form

#Import Libraries for

library(caret)
library(tidyverse)
library(rsample)
cancer$Cancer = factor(cancer$Cancer, labels = c("No", "Yes"))
cancer$Cancer
cancer_hidden$Cancer = factor(cancer_hidden$Cancer, labels = c("No", "Yes"))

####1. Logistic Regression model:

#training the model
cancer_model <- glm(Cancer ~ Gender + Age + `Marital Status` + Children + Smoker +
Employed + `Years Worked` + `Income Level` + `Social Media` + `Online Gaming`,
data = cancer,
family = binomial)
#Checking the summary
summary(cancer_model)

23
# Make predictions on the hidden dataset
cancer_prediction <- predict(cancer_model, newdata = cancer_hidden, type = "response")
cancer_prediction_factor <- as.factor(ifelse(cancer_prediction > 0.5, "Yes", "No"))

# Evaluate the performance of the predictions

confusionMatrix(cancer_prediction_factor, cancer_hidden$Cancer)

#####2. KNN Model

# For reproducibility
set.seed(123)
#Training the KNN Model
KNN_cancer_model <- train(Cancer ~ .,
data = cancer,
method = "knn",
trControl = trainControl(method = "cv", number = 10))
#Checking the summary of Model
KNN_cancer_model

# KNN Predictions
KNN_cancer_prediction <- predict(KNN_cancer_model, newdata = cancer_hidden)
# Evaluet teh performace by checking confuion matrix
confusionMatrix(KNN_cancer_prediction, cancer_hidden$Cancer)

###### Classification
####3. Decision Tree
#Loading the libraries

library(rpart.plot)
library(rpart)
# Training the tree plot from a basic single decision tree model with response Cancer and
every other variable included

tree_model <- rpart(Cancer ~ ., data = cancer, method = "class")

tree_model

# Print the model summary

summary(tree_model)

#Print the Decision Tree

rpart.plot(tree_model)

# Predict using the tree_model on the cancer_hidden data

24
TM_prediction = predict(tree_model, cancer_hidden)
TM_prediction
predicted_prob_yes <- TM_prediction[, "Yes"]
predicted_prob_yes
binary_predictions <- ifelse(predicted_prob_yes > 0.5, 1, 0)
binary_predictions

#Making Factor for the Predictions

TM_predictions_factor <- factor(binary_predictions, levels = c(0, 1), labels = c("No", "Yes"))

predicted_prob_yes <- factor(predicted_prob_yes, levels = c("No", "Yes"))

#Making sure that the levels are same for predictions and actual values
levels(cancer_hidden$Cancer)
levels(TM_predictions_factor)

#Confuions Matrix
conf_matrix <- confusionMatrix(TM_predictions_factor, cancer_hidden$Cancer)
conf_matrix

####4. Random forest model

#Train the RF Model

RF_model= train(Cancer ~ .,
data = cancer,
method = "ranger",
trControl = trainControl(method = "cv",
number = 10,
verboseIter = TRUE,
classProbs = TRUE),
num.trees = 200,
importance = "impurity")
#the details of RF Model
RF_model

# Predict outcomes for the cancer_hidden dataset

RF_predictions = predict(RF_model, newdata = cancer_hidden)
RF_predictions

# Create a confusion matrix

confusionMatrix(RF_predictions, cancer_hidden$Cancer)

Easy Scape Guide byTheCineScaper PDF
No ratings yet
Easy Scape Guide byTheCineScaper PDF
19 pages
Breast Cancer Prediction Using Machine Learning
No ratings yet
Breast Cancer Prediction Using Machine Learning
8 pages
ESS Topics 2 & 3 Complete Summary Notes
100% (2)
ESS Topics 2 & 3 Complete Summary Notes
22 pages
Training Matrix Template
No ratings yet
Training Matrix Template
6 pages
Lung Cancer Dataset
No ratings yet
Lung Cancer Dataset
10 pages
SecG_Group 4_Report
No ratings yet
SecG_Group 4_Report
5 pages
Lung Cancer Risk Prediction and Feature Importance
No ratings yet
Lung Cancer Risk Prediction and Feature Importance
6 pages
Sample Project Synopsis
No ratings yet
Sample Project Synopsis
5 pages
PPT_minor[1]
No ratings yet
PPT_minor[1]
21 pages
Biostatistics Explored Through R Software: An Overview
From Everand
Biostatistics Explored Through R Software: An Overview
Vinaitheerthan Renganathan
3.5/5 (2)
Multi-Cancer Early Detection: Incorporating Blood-Based Screening Tools into Primary Care Practice
From Everand
Multi-Cancer Early Detection: Incorporating Blood-Based Screening Tools into Primary Care Practice
Candace Westgate, DO, MPH, FACOG
No ratings yet
Cancer Mortality Prediction
No ratings yet
Cancer Mortality Prediction
7 pages
Comparative_Analysis_of_Machine_Learning_Models_for_Early_Prediction_of_Cancer
No ratings yet
Comparative_Analysis_of_Machine_Learning_Models_for_Early_Prediction_of_Cancer
8 pages
lung_cancer
No ratings yet
lung_cancer
10 pages
Lung Cancer Survival Prediction Report
No ratings yet
Lung Cancer Survival Prediction Report
5 pages
Sandeep Report1
No ratings yet
Sandeep Report1
70 pages
Epidemiological Data Analyst - The Comprehensive Guide
From Everand
Epidemiological Data Analyst - The Comprehensive Guide
ANTILLIA TAURED
No ratings yet
Golu
No ratings yet
Golu
25 pages
Short-Term_Lung_Cancer_Survival_Prediction_Combining_Linear_Regression_and_Convolutional_Neural_Network
No ratings yet
Short-Term_Lung_Cancer_Survival_Prediction_Combining_Linear_Regression_and_Convolutional_Neural_Network
6 pages
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
No ratings yet
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
25 pages
Data Science Project Ideas, Methodology & Python Codes in Health Care
From Everand
Data Science Project Ideas, Methodology & Python Codes in Health Care
Zemelak Goraga
No ratings yet
Predicting Disease With Machine Learning
No ratings yet
Predicting Disease With Machine Learning
20 pages
Ramesh 2019
No ratings yet
Ramesh 2019
13 pages
Major
No ratings yet
Major
15 pages
cancer detection
No ratings yet
cancer detection
8 pages
A Critical Study of Classification Algorithms For Lungcancer Disease Detection and Diagnosis
No ratings yet
A Critical Study of Classification Algorithms For Lungcancer Disease Detection and Diagnosis
8 pages
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
Decision Support
No ratings yet
Decision Support
21 pages
ESCI2024 Paper 0912
No ratings yet
ESCI2024 Paper 0912
6 pages
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
From Everand
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
William Webb
No ratings yet
Multivariate statistics_tutorial 4 Sensitivity, Specificity, ROC and Validation
No ratings yet
Multivariate statistics_tutorial 4 Sensitivity, Specificity, ROC and Validation
19 pages
Cancer Detection and Analysis Using Machine Learning: Abstract-Among The Various Types of Diseases, Cancer Is
No ratings yet
Cancer Detection and Analysis Using Machine Learning: Abstract-Among The Various Types of Diseases, Cancer Is
5 pages
Mini Project 5
No ratings yet
Mini Project 5
27 pages
Aih Exp 3
No ratings yet
Aih Exp 3
8 pages
Prediction of Lung Cancer Patient Survival Using Machine Learning Techniques
No ratings yet
Prediction of Lung Cancer Patient Survival Using Machine Learning Techniques
11 pages
Unlocking Potential: Navigating Employment for Neurodiverse Talent
From Everand
Unlocking Potential: Navigating Employment for Neurodiverse Talent
Travis Breeding
No ratings yet
Lung Disease Prediction System Using Data Mining Techniques
No ratings yet
Lung Disease Prediction System Using Data Mining Techniques
6 pages
Foml Project Report
No ratings yet
Foml Project Report
8 pages
Ankita Patra
No ratings yet
Ankita Patra
17 pages
Cancer Prediction Using Machine Learning
No ratings yet
Cancer Prediction Using Machine Learning
5 pages
Urj Aug 2018.2
No ratings yet
Urj Aug 2018.2
7 pages
Python and Machine Learning
No ratings yet
Python and Machine Learning
14 pages
Predicting Early Stage Lung Cancer Using Advanced Machine Learning Methods
No ratings yet
Predicting Early Stage Lung Cancer Using Advanced Machine Learning Methods
7 pages
Report PDF
No ratings yet
Report PDF
12 pages
4150-8028-1-PB
No ratings yet
4150-8028-1-PB
12 pages
Smart Business Problems and Analytical Hints in Cancer Research
From Everand
Smart Business Problems and Analytical Hints in Cancer Research
Zemelak Goraga
No ratings yet
Iot Hospital Management System and Analysis With Accessing Data From Cloud Using Machine Learning
No ratings yet
Iot Hospital Management System and Analysis With Accessing Data From Cloud Using Machine Learning
7 pages
A Practical Introduction To Nordpred - Cancerview - Ca
No ratings yet
A Practical Introduction To Nordpred - Cancerview - Ca
46 pages
Breast Cacner Detection
No ratings yet
Breast Cacner Detection
6 pages
Mobile Application Development
No ratings yet
Mobile Application Development
75 pages
Ai Project Report
No ratings yet
Ai Project Report
12 pages
Project Presentation
No ratings yet
Project Presentation
21 pages
Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data
No ratings yet
Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data
9 pages
Ramana 2019
No ratings yet
Ramana 2019
6 pages
Datos categóricos
No ratings yet
Datos categóricos
416 pages
Application of Big Mining On Health Care Industry
No ratings yet
Application of Big Mining On Health Care Industry
6 pages
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
No ratings yet
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
32 pages
Common Errors in Statistics (and How to Avoid Them)
From Everand
Common Errors in Statistics (and How to Avoid Them)
Phillip I. Good
No ratings yet
medicial
No ratings yet
medicial
13 pages
IJERT Developing A Web Based System For
No ratings yet
IJERT Developing A Web Based System For
5 pages
Population
No ratings yet
Population
5 pages
Social Determinants of Health and Knowledge About Hiv/Aids Transmission Among Adolescents
From Everand
Social Determinants of Health and Knowledge About Hiv/Aids Transmission Among Adolescents
Godwin C. Osakwe MBA MPH PhD
No ratings yet
Final Big Data
No ratings yet
Final Big Data
23 pages
E1N Spec
No ratings yet
E1N Spec
6 pages
Diamond - A Novella (Kitsune Duet Book 2) - PDF Room
No ratings yet
Diamond - A Novella (Kitsune Duet Book 2) - PDF Room
83 pages
Insurance - MCQ Types Questions
100% (4)
Insurance - MCQ Types Questions
5 pages
Bangladesh Pure Food Ordinance, 2005: Term Paper On
No ratings yet
Bangladesh Pure Food Ordinance, 2005: Term Paper On
12 pages
Marie Curie People March 2014
No ratings yet
Marie Curie People March 2014
20 pages
Soal Penilaian Tengah Semester (PTS) Bahasa Inggris Kelas 7 SMP Negeri 220
No ratings yet
Soal Penilaian Tengah Semester (PTS) Bahasa Inggris Kelas 7 SMP Negeri 220
6 pages
Agenda IND Launch Webinar - Final - EN
No ratings yet
Agenda IND Launch Webinar - Final - EN
1 page
The Theory-Practice Gap - Impact of Professional-Bureaucratic Work Conflict On Newly Qualified Nurses
100% (1)
The Theory-Practice Gap - Impact of Professional-Bureaucratic Work Conflict On Newly Qualified Nurses
13 pages
G.R No. 133640, November 25, 2005
No ratings yet
G.R No. 133640, November 25, 2005
4 pages
Book1
No ratings yet
Book1
9 pages
Grade Control Techniques at Paddington Gold Mine
No ratings yet
Grade Control Techniques at Paddington Gold Mine
14 pages
Manuel - Raoul - HB - 1 Safe Schools Reopening Bill
No ratings yet
Manuel - Raoul - HB - 1 Safe Schools Reopening Bill
15 pages
Control of Nitrosamine Impurities in Human Drugs PDF
No ratings yet
Control of Nitrosamine Impurities in Human Drugs PDF
24 pages
Multi Air Engine Seminar Report
No ratings yet
Multi Air Engine Seminar Report
23 pages
Els Q2 Tos
No ratings yet
Els Q2 Tos
2 pages
DCM601A51 - Technical Data
100% (1)
DCM601A51 - Technical Data
405 pages
Blood Bank - SCA MC
No ratings yet
Blood Bank - SCA MC
10 pages
WTE Thesis DataRazo
No ratings yet
WTE Thesis DataRazo
14 pages
Brooke Bond 3 Roses
No ratings yet
Brooke Bond 3 Roses
11 pages
Zener Diode-Load Regulator
No ratings yet
Zener Diode-Load Regulator
1 page
Semestral Plan Format
No ratings yet
Semestral Plan Format
4 pages
Summary Meridians and Points Sky (Ex)
No ratings yet
Summary Meridians and Points Sky (Ex)
9 pages
Uvp Catalog 2015 PDF
No ratings yet
Uvp Catalog 2015 PDF
36 pages
Essays Suggesting Solutions To Problems
No ratings yet
Essays Suggesting Solutions To Problems
3 pages
Aqa Gcse Combined Schience Trilogy Chemistry Paper 1h 2020 Ms
No ratings yet
Aqa Gcse Combined Schience Trilogy Chemistry Paper 1h 2020 Ms
21 pages
Aging Spine: Prof. DR Mirza Bišćević Spine Department, Orthopedics
No ratings yet
Aging Spine: Prof. DR Mirza Bišćević Spine Department, Orthopedics
25 pages
Unit 4
No ratings yet
Unit 4
86 pages