DataScience Lung Cancer Predection M2
DataScience Lung Cancer Predection M2
Abstract— Machine learning-based lung cancer prediction and self-driving cars. Machine learning is a fundamental
models are provided to help doctors and medical centers in lung element of the growing field of data science. Using
cancer diagnosis. Those models use data that are collected from
previously diagnosed patients, and they might be lung photos or
statistical procedures, algorithms are trained to make
vital signs and other medical tests that are generated to classifications or predictions and find key information in
diagnose lung cancer. The benefit of using prediction models is data mining projects. Machine learning algorithms are
to help in decision-making, risk reduction, and damage primarily built using frameworks that speed up solver
limitation depending on each patient’s case. In this report, we development, such as TensorFlow and PyTorch[5].
will show and discuss the results of the model used in making
predictions that are made by the team members. Also, a
highlight of the techniques and algorithms used will be B. Machine learning benefits and applications
provided.
Data Mining: Deep machine learning can be
Keywords—Machine Learning, Artificial Intelligence, used for big data mining, which is the process
Prediction of extracting important information from giant
data sets. With this information, a data
I. INTRODUCTION scientist can find new consumers, guess
Machine learning is a branch of artificial intelligence trends, and improve business operations[6].
and computer science that focuses on using data and Predictive Study – Machine learning and data
algorithms to emulate the way humans learn, with the goal science have the potential to foreshadow
of steadily improving accuracy. Nowadays, machine future events, trends, and buyer behavior to
learning models can be used to easily predict many some degree. These predictions can allow
diseases. Using machine learning algorithms and organizations to make better choices about
techniques can help to survive people's life very fast by where to devote resources and how to respond
predicting the illness. Cancer is a condition in which some to changes in the marketplace[6].
cells in the body grow out of control and spread to other
parts of the body. Cancer can begin practically anywhere Fraud detection: Machine learning can
in cells that make up the human body. In our report, cancer identify fraudulent occupations in financial
will be in the lungs. Today, the rate of lung cancer has transactions. As the world moves towards
been increasing, it can happen to both men and women. more digital transactions, it is increasingly
Lung cancer cannot be prevented once it has developed, important to identify and prevent fraud and
although it can be detected early, and the risks reduced. vulnerable system data views[6].
The lung cancer prediction will be diagnosed by a
Consumer segmentation: Machine learning
classification algorithm. There are many algorithms for
can produce consumer segments based on
classification such as Naïve Bayes, Support Vectors
demographic information and purchasing
Machines, Decision tree, and K-nearest neighborhood. The
habits. Organizations have the ability to
objective of this report is to predict lung cancer fast and
produce targeted marketing campaigns and
easily as soon as possible by a specific algorithm.
improve shopper service with this information
through the use of AI chatbots or voice
recognition for consumer calls[6].
II. EASE OF USE
A. What is machine learning
Web Page Improvement: Machine learning
Machine learning is a specialty in the field of AI (artificial can optimize web pages for ranking in search
intelligence) (AI) and computer science that focuses on engines. You can also track collaboration on
using data and algorithms to emulate the way humans the page and decide the most engaging
learn, gradually honing their accuracy. IBM has a rich content for users. This study of data can
history with machine learning. Over the last 2 decades, improve the visualization and design of web
technological advances in storage and processing power pages and increase traffic to the site[6].
have enabled certain innovative products based on
machine learning, such as the Netflix suggestion engine Product Suggestions: Machine learning can
recommend products to consumers based on
Retail Merchandising: Machine learning can a. Data adhesion belongs to the data pre-
improve inventory management and pricing processing steps that are applied to
tactics. It can also be used to detect consumer merge the data present in various sources
preferences and recommend products[6]. into a single larger data repository as a
data warehouse[9].
Transportation: Machine learning can be used
b. The incorporation of data is elementary
for traffic forecasting, route idealization, and
especially since our goal is to solve a
vehicle routing. Organizations have the ability
real-world scenario such as identifying
to save time and money on transportation
the existence of nodules from computed
prices by predicting traffic patterns and
tomography images. The only alternative
optimizing routes[6].
is to integrate the images of various
Operations: Machine learning can optimize doctor nodes to form a larger
supply chains, predict assembly failures, and database[9].
govern inventory levels. Organizations can
3. Data Transformation
run more efficiently and save money on
operations with this information[6]. a. When data cleansing has been carried
out, we must consolidate quality data in
Human resources: Machine learning can spot alternative ways by changing the cost,
high-performing employees and forecast composition, or format of the data using
employee turnover rates. You can also the data transformation tactics discussed
develop training programs for employees and below[9]:
detect likely candidates for hire[6].
Generalization
Normalization
III. CLASSIFICATION IN MACHINE LEARNING
Attribute Selection
Categorization is the process of detecting and grouping
objects or ideas into predetermined categories. In data Aggregation
management, categorization makes it possible to divide
and categorize data according to the requirements set for 4. Data Reduction
various business or individual purposes.
In machine learning (ML), categorization is used in
predictive modeling to dedicate access data with a class
a. The size of the data set in a data
warehouse could be quite enormous to We read two research papers on lung cancer prediction.
be handled by data mining and data In 2021, Amira Bibo Sallow, Dakhaz Abdallah, and
study algorithms[9]. Adnan Mohsin Abdulazeez discuss a problem that is lung
b. A viable solution is to obtain a limited cancer that has a poor prognosis and a high mortality rate.
representation of the data set that is They decide to develop the medical industry based on data
much smaller in volume but creates the analysis. They try three different algorithms which are
same quality of analytical results[9]. Support Vector Machine (SVM), K-Nearest Neighbor
(KNN), and Convolutional Neural Network (CNN). The
experiment result gives them the SVM gives the best
IV. ML PREDICTIVE MODELS FOR LUNG CANCER result with 95.56%.
DETECTION
Machine learning models can be used to easily predict C. Methodology
many diseases. Using machine learning algorithms and Before proceeding with building the predictive
techniques can help to survive people's life very fast by machine learning model, data preprocessing is needed.
predicting illness. Cancer is a condition in which some Preprocessing is an important step in building the model.
cells in the body grow out of control and spread to other Collecting a dataset is a hard task. We can find a lot of
parts of the body. Cancer can begin practically anywhere mistakes and errors in the dataset. We must handle these
in cells that make up the human body. In our report, cancer
errors before making the model. The first step is to check
will be in the lungs. Today, the rate of lung cancer has
been increasing, it can happen to both men and women. the shape of the dataset and display information to have
Lung cancer cannot be prevented once it has developed, knowledge about the dataset. We must check null and
although it can be detected early, and the risks reduced. duplicate values. Null values mean some files do not have
The lung cancer prediction will be diagnosed by a information, and some are duplicate values which is
classification algorithm. There are many algorithms for useless because it means more rows have the same
classification such as Naive Bayes, Support Vectors information. We found that there are no null values. We
Machines, Decision tree, and K-nearest neighbor. The found duplicate rows and we handle them by dropping the
objective of this report is to predict lung cancer fast and rows that have duplicated values. Then, we encode the
easily as soon as possible by a specific algorithm. data category to numerical because most machine learning
models cannot accept categorical data. After that, we
A. Dataset Description
check the dataset if it is balanced or imbalanced. In our
After a search on different platforms to find a proper case, we figured that our dataset is imbalanced. Imbalance
dataset about lung cancer. A small dataset was found, classification refers to a classification problem that will
and it was not helpful because it does not contain a lot of result in unequal class distribution. The decision was
variables. So, we will not be able to split it to train and made to use under-sampling which deletes some rows
test sets. After that, we found our dataset on the Kaggle from the majority class to get equal
website which had a lot of datasets that are helpful. The distribution to handle the imbalance problem. After we
chosen dataset will help in lung cancer prediction. It will used it, we notice every time we run the code, we get
help in detecting the cancer risk at a lower cost and it different accuracy results for the model. Then, we decide
will help in taking the appropriate decisions in the early to use another technique which is resampled (). It will up-
stages of cancer. sample the dataset so that the minority class matches the
majority
The data set has 309 rows and 16 variables all class. After that, the dataset was divided into features and
containing information about patients with lung cancer. labels. This process helped in separating the data for
The data includes gender, age, smoking, yellow fingers, training and testing. Third, two techniques were used.
anxiety, peer pressure, chronic disease, fatigue, allergy, Normalization and standardization to make the data in the
wheezing, alcohol consuming, coughing, shortness of same measurement unit. Finally, scores were compared
breath, swallowing difficulty, chest pain, and lung cancer. for each of the models such as SVM, logistics regression,
KNN, NB, and decision trees, and choose the proper
B. Literature Review model.
The prediction strategy relies on a thorough
investigation of statistical data, symptoms, and other D. Exploratory Analysis and Visualization
factors related to lung cancer. Some general indicators of Data visualization is an important part of data science.
cancer disorders are Visualization tools are tools that can make complex
nonclinical signs and risk factors. Initially, pr-diagnostic concepts easier for humans to understand. There are many
characteristics are collected through interaction with benefits to using visualization such as exploring data,
pathological, clinical, and medical oncologists. monitoring data, and providing explanations about data.
Also, visualization helps technical and non-technical
possibility of several semantic grouping. Also, we used a people to understand and imagine everything about data.
histogram plot to know the distribution of different Technical people try to get insight from the data and try to
variables. Additionally, we used a pie plot to show the find relations between different variables and detect
number of imbalanced and balanced data. Here we are outliers in the preprocessing phase. Non-technical people
providing an overview of the visualizations we used. try to understand the
data and how it will benefit in solving their problems. Fig. 4. Is a pie chart which is a circular statistical plot
that can display only one series of data. The area of
In this project, data was visualized using a bar plot to the chart is the total percentage of the given data.[4]
detect age and smoking. We used a line plot to draw a line We used it to detect the percentage of imbalanced data
with the line with possibility of several semantic and it showed us that it is imbalanced. So, we applied
grouping. Also, a histogram plot was used to know the a resampling strategy to handle it. After that, we
distribution of different variables. Additionally, we used a visualized the data again (Fig. 5.) to make sure it is
pie plot to show the number of imbalanced and balanced balanced.
data. Here we are providing an overview of the
visualizations we used.
CONCLUSION REFERENCES
In the end, this paper introduces all the important steps to
build a lung cancer prediction model. Starting with data [1] Bar plot in Matplotlib (2021) GeeksforGeeks. GeeksforGeeks.
[8] What is data preprocessing & what are the steps involved?
(2021) MonkeyLearn Blog. Available at:
https://monkeylearn.com/blog/data-preprocessing/ (Accessed:
February 23, 2023).