DMT Doc Final
DMT Doc Final
for
BACHELOR OF TECHNOLOGY
In
1
VIGNAN’S FOUNDATION FOR SCIENCE TECHNOLOGY AND RESEARCH
CERTIFICATE
This is to certify that the Field project Report entitled “Cancer Prediction Using Data
Mining Techinques” that is being submitted by R.Preethi Priya (171FA04363),Y.Kavya
(171FA04377),G.Vagdevi (171FA04394) and K.Supriya (171FA04417) in partial fulfilment
for the award of B.Tech degree in Computer Science and Engineering to the Vignan’s
Foundation for Science, Technology and Research, Deemed to be University, is a record of
bonafide work carried out by them under my supervision.
Ms.B.Suvarna Dr.D.Venkatesulu
Assistant.Professor External Examiner HOD CSE
2
DECLARATION
I hereby declare that the project entitled “Cancer prediction using data mining techniques”
submitted for the DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING. This
dissertation is our original work and the project has not formed the basis for the award of any
degree, associate-ship and fellowship or any other similar titles and no part of it has been
published or sent for publication at the time of submission.
By
Y.Kavya (171FA04377)
G.Vagdevi (171FA04394)
K.Supriya (171FA04417)
Date:22-07-2020
3
ACKNOWLEDGEMENT
We are very grateful to our beloved Chairman Dr. Lavu Rathaiah, and Vice Chairman
Mr. Lavu. Krishna Devarayalu, for their love and care.
It is our pleasure to extend our sincere thanks to Vice-Chancellor Dr. M.Y.S.Prasad and
Dean Engineering & Management, Dr. V.MADHUSUDHAN RAO, for providing an
opportunity to do my academics in VFSTR.
It is a great pleasure for me to express my sincere thanks to Dr. D.Venkatesulu HOD, CSE
of VFSTR, for providing me an opportunity to do my Mini Project.
We feel it our responsibility to thank Dr. U. Sri Lakshmi under whose valuable guidance that
the project came out successfully after each stage.
We extend our whole hearted gratitude to all our faculty members of Department of Computer
Science and Engineering who helped us in our academics throughout course.
Finally we wish to express thanks to our family members for the love and affection overseas
and forbearance and cheerful depositions, which are vital for sustaining effort, required for
completing this work.
With Sincere regards,
By
Y.Kavya (171FA04377)
G.Vagdevi (171FA04394)
K.Supriya (171FA04417)
4
TABLE OF CONTENTS
Content Page No
Introduction 7
KNN 8
using dataset
Algorithm 12
Output 17
References 18
5
OUTCOMES:
OBSERVATIONS:
• KNN (k-nearest neighbour) gives the best accuracy in predicting cancer whereas remaining
gives the worst results.
• It refers to an observation of data items in a data set
6
Cancer Disease Prediction System
Abstract
Cancer is the most important cause of death for both men and women. The early detection of
cancer can be helpful in curing the disease completely. So the requirement of techniques to
detect the occurrence of cancer nodule in early stage is increasing. A disease that is commonly
misdiagnosed is lung cancer. Finally here we are finding the accuracy of cancer dataset by
making use of KNN algorithm.
Introduction
It might have happened so many times that you or someone yours need doctors help
immediately, but they are not available due to some reason. The Cancer Disease Prediction
application is an end user support and online consultation project. Here, we propose a web
application that allows users to get instant guidance on their cancer disease through an intelligent
system online. The application is fed with various details and the cancer disease associated with
those details. The application allows user to share their health related issues for cancer
prediction. It then processes user specific details to check for various illness that could be
associated with it. Here we use some intelligent data mining techniques to guess the most
accurate illness that could be associated with patient’s details. Based on result, system
automatically shows the result specific doctors for further treatment. The system allows user to
view doctor’s details. The system can be use in case of emergency.
7
Classification
KNN Algorithm
KNN is a simple algorithm that stores all the available cases and classifies the new data or
case based on the similarity measure
Industrial Use case of KNN Algorithm
1. Recommender system
2.concept search
3.Advanced
applications i.Image
recognition ii.Video
recognition
8
K-nearest Neighbour
K nearest neighbour is used for the classification as well as regression. It is a supervised learning
algorithm. It is an instancebased learning in which we take the training examples but don't process
them instead we store the training example when we need to classify an instance at that time we do
classification, that is why it is also known as a lazy algorithm.
Phases of KNN:
(a) Training phase: Save the training sets.
(b) Prediction Get the test instance xi. Find K training sets {(x1,y1),(x2,y2),...(xk, yk)}Which
are closest to x. Predict y1 as the output of yi.
(c) Classification Predict the majority class (y1,y2,…yk).
(d) Regression: Predict the average of (y1,y2,…yk). Euclidean distance is the Sum of squared
difference. we will go for averaging in the following condition: Noise in the attribute. Noise in
class labels. Classes may be overlapping.
9
How to Choose factor K?
KNN algorithm is based on the feature similarity:
1.Euclidean distance
2.Manhatten distance
10
Random Forest
A decision tree is basically a set of decisions which is used to classify the datasets. It is also known
as a classification and regression tree . Random forest is a collection of decision trees.
Random forest iteratively asks a series of questions and based on that answer it will ask another set
of questions to classify the data.
Prediction using train random forest algorithm
(a) Take the test data sets and then randomly creates the decision trees.
(b) Find the decision of each decision tree according to the majority vote and then take a decision,
choose the high voted class as the decision.
Decision Tree
It comes under the category of supervised learning. Regression and classification problems can be
solved by a Decision tree. It represents a problem i9n the form of a tree in which internal nodes of
the tree serve as attributes and leaf node serve as class label[3]. Splitting is done in this model to
divide a node into two or more sub-nodes. Below are some assumptions for making a decision tree.
At first, the whole training set is considered as root.
Secondly, Values are preferred to be categorical.
Records are distributed recursively on the basis of attribute values.Statistical methods are
used for ordering attributes as internal node or root.
There are two types of decision trees:
(a) Categorical Variable Decision Tree: Decision Tree which has a categorical target variable.
In simple words, where Yes or NO values are there.
(b) Continues Variable Decision Tree: Decision Tree which has a continuous target variable. From
the below example, it will be clearer.
11
Data preprocessing using Weka
e.g., occupation=“”
e.g., Salary=“-10”
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
Data integration.
Data transformation.
Data reduction.
12
Algorithm
Step1:
Data cleaning.
Step 2:
Step 3 :
Step 4 :
Step 5 :
Step 6 :
Handling Dataset
Now you need to split the data into a training dataset (for making the prediction) and
a testing dataset (for evaluating the accuracy of the model).
Calculate Distance
In order to make any predictions, you have to calculate the distance between the
new point and the existing points, as you will be needing k closest points.
In this case for calculating the distance, we will use the Euclidean distance.
This is defined as the square root of the sum of the squared differences between the
two arrays of numbers.
13
Find K nearest neighbor
Now that you have calculated the distance from each point, we can use it collect the k
most similar points/instances for the given test data/instance.
Calculate the distance wrt all the instance and select the subset having the smallest
Euclidean distance.
You can do this by allowing each neighbor to vote for their class attribute, and take
the majority vote as the prediction.
Check Accuracy
Now that we have all of the pieces of the kNN algorithm in place. Let’s check how
accurate our prediction is!
An easy way to evaluate the accuracy of the model is to calculate a ratio of the
total correct predictions out of all predictions made.
Program
import csv
import random
import math
import operator
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
14
for y in range(4):
dataset[x][y] = float(dataset[x][y])
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
distance = 0
for x in range(length):
return math.sqrt(distance)
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
15
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
return sortedVotes[0][0]
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] ==
predictions[x]: correct +=
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.67
16
loadDataset(r'C:\Users\acer\Desktop\cancer.data', split, trainingSet, testSet)
17
print ('Train set: ' + repr(len(trainingSet)))
# generate predictions
predictions=[]
k=3
for x in range(len(testSet)):
result = getResponse(neighbors)
predictions.append(result)
main()
18
Output
19
Conclsion
By doing all this work it is concluded that KNN (k-nearest neighbour) gives the best accuracy in
predicting cancer whereas remaining gives the worst results.
References
1. Geeks For Geeks.com
2. Github.
Team Members:
171FA04363 171FA04377
171FA04394 171FA04417
20