0% found this document useful (0 votes)
29 views6 pages

DataScience Lung Cancer Predection M2

This document discusses using machine learning models to predict lung cancer. It begins with an abstract describing how machine learning models can be trained on data from previously diagnosed patients, such as lung photos or medical test results, to help doctors diagnose lung cancer in new patients. The document then provides an introduction to machine learning and discusses its benefits and applications, including for predictive analysis, fraud detection, and product recommendations. It highlights that the objective is to predict lung cancer quickly and easily using a classification algorithm.

Uploaded by

Mizna Amousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

DataScience Lung Cancer Predection M2

This document discusses using machine learning models to predict lung cancer. It begins with an abstract describing how machine learning models can be trained on data from previously diagnosed patients, such as lung photos or medical test results, to help doctors diagnose lung cancer in new patients. The document then provides an introduction to machine learning and discusses its benefits and applications, including for predictive analysis, fraud detection, and product recommendations. It highlights that the objective is to predict lung cancer quickly and easily using a classification algorithm.

Uploaded by

Mizna Amousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Lung Cancer Prediction

Fatima Mohammed Alhajji


college of computer science and Sarah AlGhowanim Fatima Ali Almuways
information technology, King Faisal College of Computer science and College of Computer science and
university information technology information technology
Ahssa, Saudi Arabia KingFaisal University KingFaisal University
Email: [email protected] Alahsa, Saudi Arabia Alahsa, Saudi Arabia
[email protected] [email protected]

Abstract— Machine learning-based lung cancer prediction and self-driving cars. Machine learning is a fundamental
models are provided to help doctors and medical centers in lung element of the growing field of data science. Using
cancer diagnosis. Those models use data that are collected from
previously diagnosed patients, and they might be lung photos or
statistical procedures, algorithms are trained to make
vital signs and other medical tests that are generated to classifications or predictions and find key information in
diagnose lung cancer. The benefit of using prediction models is data mining projects. Machine learning algorithms are
to help in decision-making, risk reduction, and damage primarily built using frameworks that speed up solver
limitation depending on each patient’s case. In this report, we development, such as TensorFlow and PyTorch[5].
will show and discuss the results of the model used in making
predictions that are made by the team members. Also, a
highlight of the techniques and algorithms used will be B. Machine learning benefits and applications
provided.
 Data Mining: Deep machine learning can be
Keywords—Machine Learning, Artificial Intelligence, used for big data mining, which is the process
Prediction of extracting important information from giant
data sets. With this information, a data
I. INTRODUCTION scientist can find new consumers, guess
Machine learning is a branch of artificial intelligence trends, and improve business operations[6].
and computer science that focuses on using data and  Predictive Study – Machine learning and data
algorithms to emulate the way humans learn, with the goal science have the potential to foreshadow
of steadily improving accuracy. Nowadays, machine future events, trends, and buyer behavior to
learning models can be used to easily predict many some degree. These predictions can allow
diseases. Using machine learning algorithms and organizations to make better choices about
techniques can help to survive people's life very fast by where to devote resources and how to respond
predicting the illness. Cancer is a condition in which some to changes in the marketplace[6].
cells in the body grow out of control and spread to other
parts of the body. Cancer can begin practically anywhere  Fraud detection: Machine learning can
in cells that make up the human body. In our report, cancer identify fraudulent occupations in financial
will be in the lungs. Today, the rate of lung cancer has transactions. As the world moves towards
been increasing, it can happen to both men and women. more digital transactions, it is increasingly
Lung cancer cannot be prevented once it has developed, important to identify and prevent fraud and
although it can be detected early, and the risks reduced. vulnerable system data views[6].
The lung cancer prediction will be diagnosed by a
 Consumer segmentation: Machine learning
classification algorithm. There are many algorithms for
can produce consumer segments based on
classification such as Naïve Bayes, Support Vectors
demographic information and purchasing
Machines, Decision tree, and K-nearest neighborhood. The
habits. Organizations have the ability to
objective of this report is to predict lung cancer fast and
produce targeted marketing campaigns and
easily as soon as possible by a specific algorithm.
improve shopper service with this information
through the use of AI chatbots or voice
recognition for consumer calls[6].
II. EASE OF USE
A. What is machine learning
 Web Page Improvement: Machine learning
Machine learning is a specialty in the field of AI (artificial can optimize web pages for ranking in search
intelligence) (AI) and computer science that focuses on engines. You can also track collaboration on
using data and algorithms to emulate the way humans the page and decide the most engaging
learn, gradually honing their accuracy. IBM has a rich content for users. This study of data can
history with machine learning. Over the last 2 decades, improve the visualization and design of web
technological advances in storage and processing power pages and increase traffic to the site[6].
have enabled certain innovative products based on
machine learning, such as the Netflix suggestion engine  Product Suggestions: Machine learning can
recommend products to consumers based on

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


their purchase history and preferred metrics. label. As an example, an email correspondence stability
By providing personalized suggestions, program delegated to detect spam could use natural
organizations can start driving sales and language processing (NLP) to classify emails as "spam" or
consumer loyalty[6]. "not spam."[7]
 Marketing: Machine learning can improve the
accuracy of real-time marketing predictions,
A. Data preprocessing
such as predicting which consumers are most
likely to buy a product or respond to a Data preprocessing is a step in the data
campaign. It can also be used to detect subtraction and study process that takes raw data and
consumer trends and preferences[6]. transforms it into a format that could be understood and
examined by PCs and machine learning.
 Finance: Machine learning can improve
Raw data from the entire real world by way of writing,
financial forecasting and hazard investigation.
images, video clips, etc., is messy. Not only can it contain
Gaining a better understanding of financial
risk can help organizations make more errors and inconsistencies, but it is often inconclusive and
informed choices about where to invest does not have a regular and uniform design.
money and how to defend their assets. Machines love to process nice, neat information: they read
Recommendation engines can indicate which data as 1 and 0. So calculating structured data, like whole
trends are most likely to materialize[6]. numbers and percentages, is simple. However,
unstructured data, such as writing and images, must first be
 Healthcare: Machine learning can detect cleaned and formatted prior to the study[8].
cancer cells or guess at heart complications. It
can also be used for personalized medicine,
which is based on adapting treatments based
on the genetic structure of the patient. B. Steps of data preprocessing
Coupled with automation, this could help 1. Data Cleaning
patients examine their conditions 24/7 with
medical apps[6]. Data cleaning is specially done as part of
data preprocessing to clean data by
 Instruction: Machine learning can improve filling in missing values, smoothing out
educational outcomes by personalizing noisy data, resolving inconsistency, and
instruction for each student. It can also be removing outliers[9].
used to identify cheating and plagiarism made
by students[6]. 2. Data Integration

 Retail Merchandising: Machine learning can a. Data adhesion belongs to the data pre-
improve inventory management and pricing processing steps that are applied to
tactics. It can also be used to detect consumer merge the data present in various sources
preferences and recommend products[6]. into a single larger data repository as a
data warehouse[9].
 Transportation: Machine learning can be used
b. The incorporation of data is elementary
for traffic forecasting, route idealization, and
especially since our goal is to solve a
vehicle routing. Organizations have the ability
real-world scenario such as identifying
to save time and money on transportation
the existence of nodules from computed
prices by predicting traffic patterns and
tomography images. The only alternative
optimizing routes[6].
is to integrate the images of various
 Operations: Machine learning can optimize doctor nodes to form a larger
supply chains, predict assembly failures, and database[9].
govern inventory levels. Organizations can
3. Data Transformation
run more efficiently and save money on
operations with this information[6]. a. When data cleansing has been carried
out, we must consolidate quality data in
 Human resources: Machine learning can spot alternative ways by changing the cost,
high-performing employees and forecast composition, or format of the data using
employee turnover rates. You can also the data transformation tactics discussed
develop training programs for employees and below[9]:
detect likely candidates for hire[6].
 Generalization
 Normalization
III. CLASSIFICATION IN MACHINE LEARNING
 Attribute Selection
Categorization is the process of detecting and grouping
objects or ideas into predetermined categories. In data  Aggregation
management, categorization makes it possible to divide
and categorize data according to the requirements set for 4. Data Reduction
various business or individual purposes.
In machine learning (ML), categorization is used in
predictive modeling to dedicate access data with a class
a. The size of the data set in a data
warehouse could be quite enormous to We read two research papers on lung cancer prediction.
be handled by data mining and data In 2021, Amira Bibo Sallow, Dakhaz Abdallah, and
study algorithms[9]. Adnan Mohsin Abdulazeez discuss a problem that is lung
b. A viable solution is to obtain a limited cancer that has a poor prognosis and a high mortality rate.
representation of the data set that is They decide to develop the medical industry based on data
much smaller in volume but creates the analysis. They try three different algorithms which are
same quality of analytical results[9]. Support Vector Machine (SVM), K-Nearest Neighbor
(KNN), and Convolutional Neural Network (CNN). The
experiment result gives them the SVM gives the best
IV. ML PREDICTIVE MODELS FOR LUNG CANCER result with 95.56%.
DETECTION
Machine learning models can be used to easily predict C. Methodology
many diseases. Using machine learning algorithms and Before proceeding with building the predictive
techniques can help to survive people's life very fast by machine learning model, data preprocessing is needed.
predicting illness. Cancer is a condition in which some Preprocessing is an important step in building the model.
cells in the body grow out of control and spread to other Collecting a dataset is a hard task. We can find a lot of
parts of the body. Cancer can begin practically anywhere mistakes and errors in the dataset. We must handle these
in cells that make up the human body. In our report, cancer
errors before making the model. The first step is to check
will be in the lungs. Today, the rate of lung cancer has
been increasing, it can happen to both men and women. the shape of the dataset and display information to have
Lung cancer cannot be prevented once it has developed, knowledge about the dataset. We must check null and
although it can be detected early, and the risks reduced. duplicate values. Null values mean some files do not have
The lung cancer prediction will be diagnosed by a information, and some are duplicate values which is
classification algorithm. There are many algorithms for useless because it means more rows have the same
classification such as Naive Bayes, Support Vectors information. We found that there are no null values. We
Machines, Decision tree, and K-nearest neighbor. The found duplicate rows and we handle them by dropping the
objective of this report is to predict lung cancer fast and rows that have duplicated values. Then, we encode the
easily as soon as possible by a specific algorithm. data category to numerical because most machine learning
models cannot accept categorical data. After that, we
A. Dataset Description
check the dataset if it is balanced or imbalanced. In our
After a search on different platforms to find a proper case, we figured that our dataset is imbalanced. Imbalance
dataset about lung cancer. A small dataset was found, classification refers to a classification problem that will
and it was not helpful because it does not contain a lot of result in unequal class distribution. The decision was
variables. So, we will not be able to split it to train and made to use under-sampling which deletes some rows
test sets. After that, we found our dataset on the Kaggle from the majority class to get equal
website which had a lot of datasets that are helpful. The distribution to handle the imbalance problem. After we
chosen dataset will help in lung cancer prediction. It will used it, we notice every time we run the code, we get
help in detecting the cancer risk at a lower cost and it different accuracy results for the model. Then, we decide
will help in taking the appropriate decisions in the early to use another technique which is resampled (). It will up-
stages of cancer. sample the dataset so that the minority class matches the
majority
The data set has 309 rows and 16 variables all class. After that, the dataset was divided into features and
containing information about patients with lung cancer. labels. This process helped in separating the data for
The data includes gender, age, smoking, yellow fingers, training and testing. Third, two techniques were used.
anxiety, peer pressure, chronic disease, fatigue, allergy, Normalization and standardization to make the data in the
wheezing, alcohol consuming, coughing, shortness of same measurement unit. Finally, scores were compared
breath, swallowing difficulty, chest pain, and lung cancer. for each of the models such as SVM, logistics regression,
KNN, NB, and decision trees, and choose the proper
B. Literature Review model.
The prediction strategy relies on a thorough
investigation of statistical data, symptoms, and other D. Exploratory Analysis and Visualization
factors related to lung cancer. Some general indicators of Data visualization is an important part of data science.
cancer disorders are Visualization tools are tools that can make complex
nonclinical signs and risk factors. Initially, pr-diagnostic concepts easier for humans to understand. There are many
characteristics are collected through interaction with benefits to using visualization such as exploring data,
pathological, clinical, and medical oncologists. monitoring data, and providing explanations about data.
Also, visualization helps technical and non-technical
possibility of several semantic grouping. Also, we used a people to understand and imagine everything about data.
histogram plot to know the distribution of different Technical people try to get insight from the data and try to
variables. Additionally, we used a pie plot to show the find relations between different variables and detect
number of imbalanced and balanced data. Here we are outliers in the preprocessing phase. Non-technical people
providing an overview of the visualizations we used. try to understand the
data and how it will benefit in solving their problems. Fig. 4. Is a pie chart which is a circular statistical plot
that can display only one series of data. The area of
In this project, data was visualized using a bar plot to the chart is the total percentage of the given data.[4]
detect age and smoking. We used a line plot to draw a line We used it to detect the percentage of imbalanced data
with the line with possibility of several semantic and it showed us that it is imbalanced. So, we applied
grouping. Also, a histogram plot was used to know the a resampling strategy to handle it. After that, we
distribution of different variables. Additionally, we used a visualized the data again (Fig. 5.) to make sure it is
pie plot to show the number of imbalanced and balanced balanced.
data. Here we are providing an overview of the
visualizations we used.

Fig. 1. is a bar plot which is a graph that represents the


category of data with rectangular bars with lengths and
heights that is proportional to the values they represent.[1]
The bar plots can be plotted horizontally or vertically. A
bar chart describes the comparisons between the discrete
categories. We used this plot to see the comparison for our
two categories based on age and smoking variables.

Fig.4. imbalanced data pie chart

Fig.1. barplot chart

Fig. 2. is a line plot that is used to represent the


relationship between two data X and Y on a different
axis. it is used to show the relationship between
wheezing and anxiety variables.

Fig. 5. balanced data pie chart

E. Results and Discussion


At the end of all methods that were used, the best result
Fig. 2. lineplot chart
was generated by SVM. Using the confusion matrix from
Fig. 3. represents the distribution of symptoms in lung the sklearn library to make sure that the model works
cancer patients
probably. In the confusion matrix, the model was tested
based on four factors: precision, recall, f-score, and,
support. Looking into the highest f-score which was 98%.
The best model was considered.

Every model architecture has a parameter called a


hyperparameter. Changing the hype parameters can yield
Fig. 3. histplot charts good results. When you evaluate the model and try to get
good results, you need to set a manual hyperparameter.
Meaning it is not something the model can learn from classification algorithm. we used multiple algorithms in
itself. There is no way to say that this hyperparameter is the classification such as SVM, Logistic Regression,
good. The model designer should try different KNN, NB and Decision Tree. Ending with the evaluation
hyperparameters to choose the best one. There is a method of different ML algorithm to find the best one that fit the
called GridsearchCV () in the sklearn library which we problem the SVM is given a high accuracy which is 97%.
pass to it different values of hyperparameter. It will try all This project will help the medical sector in the future too
of them and return the best hyperparameter that gives us early predict lung cancer to reduce the risk and help
good results. SVC has three hype parameters kernels, people live a long time.
regularization (C), and gamma. Kernel transfers the
dataset into another format to find the best hyperplane that
divides classes. C indicates space between super vectors
ACKNOWLEDGMENT
for each class. If C is small, it will maximum margins
We would like to extend our gratitude and
which is the space between two super vectors. If C is high, appreciation to King Faisal University, which will allow
it will reduce margins which is the space between two us to study data science and will improve our knowledge
about data and Machine learning (ML). Furthermore, we
super vectors. Gamma indicates the space between points.
would like to thank our instructor, Dr. Nora alkhaldi who
If gamma is high, the space between points will be small. effortlessly shares her knowledge with her students.
If gamma is low, the space between points will be high.

CONCLUSION REFERENCES
In the end, this paper introduces all the important steps to
build a lung cancer prediction model. Starting with data [1] Bar plot in Matplotlib (2021) GeeksforGeeks. GeeksforGeeks.

preprocessing, features engineering, planning and building Available at: https://www.geeksforgeeks.org/bar-plot-in-


matplotlib/ (Accessed: February 3, 2023).
model. Lung cancer prediction is a supervised
[2] Line chart in Matplotlib - Python (2020) GeeksforGeeks.
GeeksforGeeks. Available at: https://www.geeksforgeeks.org/line-
chart-in-matplotlib-python/ (Accessed: February 3, 2023).

[3] Plotting histogram in python using Matplotlib (2022)


GeeksforGeeks. GeeksforGeeks. Available at:
https://www.geeksforgeeks.org/plotting-histogram-in-python-
using-matplotlib/ (Accessed: February 3, 2023).

[4] Plot a pie chart in python using Matplotlib (2021) GeeksforGeeks.


GeeksforGeeks. Available at: https://www.geeksforgeeks.org/plot-
a-pie-chart-in-python-using-matplotlib/ (Accessed: February 3,
2023).

[5] What is machine learning? (no date) IBM. Available at:


https://www.ibm.com/topics/machine-learning (Accessed:
February 23, 2023).

[6] Machine learning: Types: Benefits (2022) Adservio. Adservio


team. Available at: https://www.adservio.fr/post/machine-learning-
types-benefits (Accessed: February 23, 2023).

[7] What is classification? - definition from Techopedia (no date)


Techopedia.com. Available at:
https://www.techopedia.com/definition/13779/classification
(Accessed: February 23, 2023).

[8] What is data preprocessing & what are the steps involved?
(2021) MonkeyLearn Blog. Available at:
https://monkeylearn.com/blog/data-preprocessing/ (Accessed:
February 23, 2023).

[9] Data preprocessing in machine learning [steps & techniques]


(no date) Data Preprocessing in Machine Learning [Steps &
Techniques]. Available at: https://www.v7labs.com/blog/data-
preprocessing-guide#h3 (Accessed: February 23, 2023).

You might also like