Cancer Predoiction System
Cancer Predoiction System
Cancer has identified a diverse condition of several various subtypes. The timely screening and
course of treatment of a cancer form is now a requirement in early cancer research because it
supports the medical treatment of patients. Many research teams studied the application of ML
and Deep Learning methods in the field of biomedicine and bioinformatics in the classification
of people with cancer across high- or lowrisk categories. These techniques have therefore been
used as a model for the development and treatment of cancer. As, it is important that ML
instruments are capable of detecting key features from complex datasets. Many of these
methods are widely used for the development of predictive models for predicating a cure for
cancer, some of the methods are artificial neural networks (ANNs), support vector machine
(SVMs) and decision trees (DTs). While we can understand cancer progression with the use of
ML methods, an adequate validity level is needed to take these methods into consideration in
clinical practice every day. In this study, the ML & DL approaches used in cancer progression
modeling are reviewed. The predictions addressed are mostly linked to specific ML, input, and
data samples supervision.
CHAPTER -1
INTRODUCTION
INTRODUCTION
Cancer is the most familiar cancer in Human of age amongst 41 to 60 and 60+, touching about
10 percent of all Human. In contemporary times, the rate keeps growing and data show that the
survival rate is 88 percent after five years from diagnosis and 80 percent after 10 years from
diagnosis. Early predictions of Cancer so far have made tons of improvement, death rate of
Cancer by 39 percent, starting from1989. Due to mutable nature of Cancers symptoms, patients
are frequently lay open to bombardment of assessments, comprising but not limited to
mammography, ultrasound and surgery, to check their probabilities of being diagnosed with
Cancer. Surgery is the most allusive among these events, which consist of intellection of sample
cells or tissues for assessment. Numerical topographies, such as radius, texture, perimeter and
area, can be distinguished from microscopic images. Data, later on, conquered from FNA are
studied in grouping with different imaging data to prophesy probability of the patient having
spiteful Cancer tumour. A computerized system here would be colossally profitable in this
situation. It will possibly speed up the process and enhance the meticulousness of the doctor ’s
predictions. In addition, if supported by plethora dataset and the computerized system
dependably carry out well, it will conceivably disregard the necessities for patients to go through
copious of other tests, such as mammography, ultrasound, and MRI, which focus patients to
major extent of soreness and radiation. In all, an early calculation remains is one of the vigorous
features in the follow-up process. Data extracting techniques or sorting can help to lessen the
number of false positive and false negative assessments. As a result, a new method like data
discovery in databases has become a preferential implement for medical assistant. Death rate of
oral cancer is high(2,00,000 annually over the world &46,000 annually in India). In a research
it is found t that the developing countries have the high rate of oral cancer than the developed
countries. Good dental oral care is important to maintain healthy teeth gums and tongue. Moral
problem including bad breath , TMD, tooth decay, or thrush are all the treatable with proper
treatment and care. Oral cancer can affect any area including the lips gum tissue tongue cheek
teeth. This leads to the most deaths globally. It is possible to easily reduce deaths caused by
cancer, by detecting it early stages and give the proper treatment. The approach used here is
described by analyzation. The dataset consists of patient’s habits, symptoms and medical
history. It is possible to discover patterns in large sets of data through data mining which is a
computational process. Data mining uses a combination of techniques related to database
systems, statistics, Artificial Intelligence & machine learning. Data mining helps to extract and
transform data in a dataset. Data Mining acts as an important role in the Prediction of Cancer
Diseases. Data mining methods are mainly used to develop a predicted model in medicine field.
By testing several data mining methods, we have to choose the model which gives high accurate
results. In the classification model the main metrics are done using the Sensitivity, Specificity,
Accurateness, Error Rate, True Positive Rate (TPR) and False Positive Rate (FPR).Oral cancer
is located in oral cavity. It can originate in any of the oral tissues or by extension from a
neighboring architectural structure that is the nasal cavity. Oral cancer is a global issue that
poses challenges in both diagnosis and treatment. About 50% of cases are detected at an
advanced stage, leading to a low 5-year survival rate. Visual assessments are conventionally
employed in dental clinics for the detection of oral cancer, but, their precision is greatly
influenced by the proficiency and knowledge of the practitioner, which can significantly differ.
Furthermore, the subjective character of these tests might lead to inconsistencies in the
diagnosis, hence increasing the probability of either an under- or overdiagnosis.
AIM OF PROJECT
Machine learning platforms are now being introduced into modern oncological practice for
classification and prediction of patient outcomes. To determine the current status of the
application of these learning models as adjunctive decision-making tools in oral cavity cancer
management, this systematic review aims to summarize the accuracy of machine-learning
based models for disease outcomes. Oral cancer is a form of cancer which is usually recognized
at advanced stage either because of ignorance or due to lack of medical facilities. This is more
prominent in the mid or low income countries where people are deprived from medical
facilities. In such cases, the mortality and morbidity rate will be high. To avoid that, an early
detection of oral cancer plays a very important role.
The rich source of AI techniques and tools provides a cost effective methods in the detection
of oral cancer. This will benefit doctors as an expert tool and also helpful in further
investigations. The evolving computer algorithms are providing finest solutions in diagnosing
other types of cancers and diseases.
SCOPE
The classifier accuracy is a measure of how the classifier could predict cases into the category.it
is correct prediction dived by total number of instances. Then it is not optimal method to
compare different classifier but may give an overview of the class, accuracy is calculated using
the equations. When oral cancer is found in its early stages and treated successfully, many lives
can be saved Histological analysis of an oral cavity tissue sample is the accepted method in
medicine for identifying oral cancer. This method requires more time and is more invasive than
obtaining a brush sample and then performing a cytological analysis. For a better prognosis,
treatment plan, and chance of survival, early diagnosis is essential. Therefore, this paper
suggests deep learning techniques to perform early detection of oral cancer and eventually leads
to its prevention. Deep learning techniques enable early detection of disease to provide
precision medicine. According to the recent research reports, this method has significantly
advanced the extraction of data and interpretation of crucial information related to medical
imaging. It has the potential to identify oral cancer with a cost-efficient, non-invasive, and
effective method, having substantial clinical implications.
CHAPTER-2
Kent (2007) used the method of genetic programming to solve complex problems. This
technique is used to collect sample of patients and develop program accordingly to detect oral
cancer. The sample of data of patients consist of details of their habits, lifestyles and medical
history. Although there are less amount of positive samples in this method it is able to provide
more accurate results. This can compete with previous technology.
Nahar (2011) discussed the measures to prevent from the particular type of cancer. To find out
the steps factors they have first collected dataset. They used three association algorithms:
1.Apriori 2.Predictive Apriori and 3.Tertius algorithms. These algorithms are helpful to
discover the important factors against a specific type of cancer. From the analysis it is found
that Apriori algorithm is the most efficient association algorithm to take measures for
prevention.
Chuang et al. (2014) used the method of DNA repair genes. They used single nucleotide
polymorphisms (SNPs) data set of oral cancer patients (238 samples) for analyses purpose. The
support vector machine was used to conduct experiments and they analysed that the
performance of the holdout cross validation (test and train) were better than tenfold cross
validation (rotation estimation). The accuracy was upto 64.2 %.The people named Gadewal
and Zingde took this method forward by adding 132 genes to 238 samples making it to 374
gene database and tried to enable fast retrieval of updated information.
Kaladhar et al. (2011) used the method of classification algorithm (CART, Random Forest,
LMT and Na¨ıve Bayesian algorithms).In these algorithms classification is done on the basis
of rotation & estimation and training dataset. Among these algorithms Random Forest
classification technique is more efficient.
Sankaranarayanan (1995) studied the aspects of the oral cancer in India. The causes such as
chewing of tobacco or any tobacco stem or leave are studied in detail. According to author
study the age frequency is a decade earlier than the age mentioned in western literature. Only
upto 10-15% is limited from where it is originated.
Anuradha & Sankaranarayanan (2012) have carried out survey on many major methods
adopted by the researchers to detect oral cancer at initial stage itself. Author compared all the
methods to determine which method provides more accurate results.
Milovic (2015) used patterns to detect cancer at initial stage so that physicians can treat the
patients at earlier stage itself. The pattern recognization is also a data mining technology.
Gadewal and Zingde (2013) enhanced the oral cancer gene database to include 374 genes by
adding 132 gene entries to the 238 samples given by chuang to enable fast retrieval of updated
information.
Gupta et al.(2012) used artificial neural network(ANN). The accuracy of ANN is 93.68% to
prepare dataset and 55.5% to accept dataset.
CHAPTER – 3
PROPOSED WORK
PROPOSED SYSTEM
Machine learning algorithm where the cancer data set is loaded, features have to be extracted
and classification model can be trained and used Prediction of Malignant and Benign. A benign
cancer that does not invade its surrounding tissue or spread around body and A Malignant
cancer that may invade its surrounding tissue or spread around the body. In this section a
detailed description about dataset and machine learning algorithms used in this study. In order
to start the oral cancer stage prediction process, it is required to know more about medical terms
and procedures from dental doctors, therefore a discussion was done with few dentists for the
clarity of oral cancer concepts.
Decision Tree
A decision tree is a structural diagram used to produce solutions to a problem based on certain
conditions. Decision Tree is mostly used in classification problems which are the part of
supervised machine learning algorithm. A tree has many scenarios in life and which is also
included in machine learning in deep by covering both classification and regression trees also
known as CART(Classification and Regression tree). A decision tree is a structured flowchart.
Where each internal node indicates a test on attribute each branch represent an output of test
and each leaf of end node consists a class label. The root node is the top most nodes in a tree.
To represent decisions and decision making, decision tree can be used in decision analysis. As
the name indicates it uses a tree like model of decision. Pros of Decision tree, It is easy to learn
ad create, interpret and view. Decision tree indirectly performs selection of features. It works
on both numerical and categorical data which can also handle harder problems. Cons of
Decision tree, creating over complex trees do not generalized well in decision tree. It is also
known as over fitting. Decision tree can be unbalanced because small variants in information
may result in a complete tree generating. It consumes more memory space. Decision tree is
little difficult for preparing data. The common Decision tree algorithms used are, Giri index,
chi-square, information gain and reduction in variance.
KNN is simplest classification algorithm used in machine learning, it is suitable for both large
and small datasets, though it is simplest algorithm It produces accurate results for more
complex problems. KNN is used for classification and regression predictive models, which is
mostly use in industry for classification issues. KNN considers three aspects such as, ease to
interpret; calculation time and prediction power.KNN algorithm is commonly used for
interpreting and low calculation time. In KNN firstly the class is divided by boundary then
identifies the mean of each class to calculate the distance between mean object and other
objects present in the class. Suppose the distance is longer and near to mean class of neighbor
then the object could belong to neighbor class. To calculate the distance between objects KNN
shall use various distance measures. The commonly used distance measure in KNN is
Euclidean distance method. The distance measures are arranged in order to get the top most k-
value and frequent class and then results in prediction output. KNN algorithm is also used for
regression tasks by calculating averages of nearest objects in a class rather than calculating the
mean object in a class.
Logistic Regression
Logistic regression is the most celebrated machine learning calculation after linear regression.
From multiple points of view, linear regression and logistic regression are comparative. Be that
as it may, the greatest contrast lies in what they are utilized for. Linear regression algorithms
are utilized to predict/forecast values but logistic regression is used for classification tasks.
Logistic regression is of three types namely, Binary logistic regression, Multinomial logistic
regression and Ordinal logistic regression. To predict the data class a threshold is set based on
threshold value the classes are classified by estimated probability. Decision boundary can either
be linear or non-linear to make decision boundary more complex, polynomial order is
increased. Logistic regression is a straight technique; however the expectations are changed
utilizing the calculated capacity. Logistic regression is a straightforward calculation that can
be utilized for binary/multivariate classification tasks.
Support Vector Machine (SVM) is one of the simple machine learning algorithm which
produces accuracy with less computational power.SVM can be used for both classification and
regression process. But its main objective is to create classification models. It can be done by
identifying hyperplane in n number of features which classifies the data points. There can be
many hyper-planes to differentiate data points. The attributes which is found on either side of
the hyperplanes by data points are of different classes. Hyper-planes are also called as decision
boundaries. Data points are nothing but a support margin which are closer to the boundaries
and includes the position and distance. For faraway objects margins can be maximized using
support vectors, by eliminating the support vectors the position and distances changes from the
boundaries. These above studies help to build SVM model.
Gaussian Naive Bayes is a type of Naive Bayes method where continuous attributes are
considered and the data features follow a Gaussian distribution throughout the dataset. In
Sklearn library terminology, Gaussian Naive Bayes is a type of classification algorithm
working on continuous normally distributed features that is based on the Naive Bayes
algorithm. Before diving deep into this topic we must gain a basic understanding of the
principles on which Gaussian Naive Bayes work.
For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some overlapping
as shown in the below figure. So, we will keep on increasing the number of features for
proper classification.
WORKFLOW
REQUIREMENT SPECIFICATION
This proposed software runs effectively on a computing system that has the minimum
requirements. Undertaking all the equipment necessities are not satisfied but rather exist in
their systems administration between the customer’s machines already. So, the main need is to
introduce appropriate equipment for the product.
SOFTWARE REQUIREMENTS
1. Django: Django has been utilized to develop the Backend Part of our Web Application.
2. Sklearn: Used for implementing the Machine Learning Algorithms like Decision
5. Web Scrappers: Web Scrappers are utilized to scrap through the Web and give out news-
articles for Machine Learning Analysis and Database storage.
CHAPTER – 4
The whole study explains and compares the findings of various machine learning and in-depth
learning implemented to cancer prognosis. Specifically, several trends related to those same
kinds of machines techniques to be used, the kinds of training data to be incorporated, the kind
of endpoint forecasts to be made, sorts of cancers being investigated, and the overall
performance of cancer prediction or outcome methods have been identified. While the ML’s
are common, it is clear that a broader variety of alternative learning approaches is also used to
predict at least three different cancer types. The assimilation of multifaceted heterogeneous
data, which can offer a promising tool for cancer infection and foresee the disease, also
demonstrates the incorporation in the application of different analytical and classification
methods. In future, by using the proposed framework, we would like to use other state of the
art machine learning algorithms and extraction methods to allow more intensive comparative
analysis.
REFERENCES
REFRENCES
[1] Ahmad LG*, Eshlaghy AT, Poorebrahimi A, Ebrahimi M and Razavi AR, Using Three
Machine Learning Techniques for Predicting Breast Cancer Recurrence, Health & Medical
Informatics (2013), 2157-7420.
[2] Amy F. Ziober,Kirtesh R. Patel,Faizan Alawi, Phyllis Gimotty,4 Randall S. Weber,
MichaelM. Feldman,Ara A. Chalian,Gregory S. Weinstein,Jennifer Hunt, and Barry L. Ziober,
Identification of a Gene Signature for Rapid Screening of Oral Squamous Cell Carcinoma,
American Association for Cancer (2018).
[3] Decision tree, Swapnil Yeolekar, https://www.quora.com/Can-you-explain-a-decisiontree-
in-simple-terms(2017).
[4] Fatihah Mohd, Noor Maizura Mohamad Noor, Zainab Abu Bakar, Zainul Ahmad Rajion,
Analysis of Oral Cancer Prediction using Features Selection with Machine Learning, ICIT
2015 The 7th International Conference on Information Technology.
[5] Harikumar Rajaguru and Sunil Kumar Prabhakar, Performance Comparison of Oral Cancer
Classification with Gaussian MixtureMeasures and Multi Layer Perceptron, The 16th
International Conference on Biomedical Engineering p(2017) 123-129
[6] Head-and-Neck-squamous-cellcarcinoma,https://en.wikipedia.org/wiki/Head_and_neck
_squamous-cell_carcinom, Wikipedia(2018).
[7] K-Nearest Neighbors, Tavish Srivastava,
https://www.analyticsvidhya.com/blog/2018/03/introduc tion-k-neighbours-algorithm-
clustering/
[8] Konstantina Kourou , Themis P. Exarchos , Konstantinos P. Exarchos ,Michalis V.
Karamouzis , Dimitrios I. Fotiadis, Machine learning applications in cancer prognosis and
prediction, Computational and structural biotechnology journal, (2015),18-17.
[9] Logistic Regression, https://hackernoon.com/ introduction-to-machine-learning-
algorithms-logisticregression-cbdd82d81a36
[10] Marc Aubreville, Christian Knipfer, Nicolai Oetter, Christian Jaremenko, Erik Rodner,
Joachim Denzler, Christopher Bohr, Helmut Neumann, Florian Stelzle, & Andreas Maier,
Automatic Classification of Cancerous Tissue in Laserendomicroscopy Images of the Oral
Cavity using Deep Learning, SCIENTIFIC Reports,(2017), 7: 11979.