0% found this document useful (0 votes)
128 views20 pages

DMT Doc Final

This document is a field project report submitted to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. It examines using data mining techniques like KNN, random forest, and decision trees for cancer prediction. KNN is explored in more detail as it provides the best accuracy for cancer prediction compared to other techniques. The document outlines the phases of KNN including the training and prediction phases. It also discusses how to choose the K factor and evaluate distance measures for KNN classification and regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views20 pages

DMT Doc Final

This document is a field project report submitted to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. It examines using data mining techniques like KNN, random forest, and decision trees for cancer prediction. KNN is explored in more detail as it provides the best accuracy for cancer prediction compared to other techniques. The document outlines the phases of KNN including the training and prediction phases. It also discusses how to choose the K factor and evaluate distance measures for KNN classification and regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

A FIELD PROJECT REPORT ON

“Cancer Prediction using Data Mining Techinques”


Submitted
In the partial fulfilment of the requirements

for

The award of the degree of

BACHELOR OF TECHNOLOGY
In

COMPUTER SCIENCE & ENGINEERING


By
R.Preethi Priya (171FA04363)
Y.Kavya (171FA04377)
G.Vagdevi (171FA04394)
K.Supriya (171FA04417)

Under the esteemed guidance of


Ms.B.Suvarna, Assistant.Professor

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


VIGNAN'S FOUNDATION FOR SCIENCE, TECHNOLOGY AND RESEARCH
(Accredited by NAAC “A” grade)
Vadlamudi, Guntur.

1
VIGNAN’S FOUNDATION FOR SCIENCE TECHNOLOGY AND RESEARCH

(Accredited by NAAC “A” grade)

CERTIFICATE

This is to certify that the Field project Report entitled “Cancer Prediction Using Data
Mining Techinques” that is being submitted by R.Preethi Priya (171FA04363),Y.Kavya
(171FA04377),G.Vagdevi (171FA04394) and K.Supriya (171FA04417) in partial fulfilment
for the award of B.Tech degree in Computer Science and Engineering to the Vignan’s
Foundation for Science, Technology and Research, Deemed to be University, is a record of
bonafide work carried out by them under my supervision.

Ms.B.Suvarna Dr.D.Venkatesulu
Assistant.Professor External Examiner HOD CSE

2
DECLARATION

I hereby declare that the project entitled “Cancer prediction using data mining techniques”
submitted for the DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING. This
dissertation is our original work and the project has not formed the basis for the award of any
degree, associate-ship and fellowship or any other similar titles and no part of it has been
published or sent for publication at the time of submission.

By

R.Preethi Priya (171FA04363)

Y.Kavya (171FA04377)

G.Vagdevi (171FA04394)

K.Supriya (171FA04417)

Date:22-07-2020

3
ACKNOWLEDGEMENT

We are very grateful to our beloved Chairman Dr. Lavu Rathaiah, and Vice Chairman
Mr. Lavu. Krishna Devarayalu, for their love and care.

It is our pleasure to extend our sincere thanks to Vice-Chancellor Dr. M.Y.S.Prasad and
Dean Engineering & Management, Dr. V.MADHUSUDHAN RAO, for providing an
opportunity to do my academics in VFSTR.

It is a great pleasure for me to express my sincere thanks to Dr. D.Venkatesulu HOD, CSE
of VFSTR, for providing me an opportunity to do my Mini Project.

We feel it our responsibility to thank Dr. U. Sri Lakshmi under whose valuable guidance that
the project came out successfully after each stage.

We extend our whole hearted gratitude to all our faculty members of Department of Computer
Science and Engineering who helped us in our academics throughout course.

Finally we wish to express thanks to our family members for the love and affection overseas
and forbearance and cheerful depositions, which are vital for sustaining effort, required for
completing this work.
With Sincere regards,

By

R.Preethi Priya (171FA04363)

Y.Kavya (171FA04377)

G.Vagdevi (171FA04394)

K.Supriya (171FA04417)

4
TABLE OF CONTENTS
Content Page No

Outcomes and Observations 6


Abstract 7

Introduction 7

KNN 8

How does KNN work 8

How to choose K factor and work on KNN 11

using dataset

Algorithm 12

Output 17

References 18

5
OUTCOMES:

• Handling of incomplete data.


• Ensure the efficiency and scalability of data mining algorithms.
• Mining of large databases.
• Handling of relational and complex data types.

OBSERVATIONS:

• KNN (k-nearest neighbour) gives the best accuracy in predicting cancer whereas remaining
gives the worst results.
• It refers to an observation of data items in a data set

6
Cancer Disease Prediction System

Abstract

Cancer is the most important cause of death for both men and women. The early detection of
cancer can be helpful in curing the disease completely. So the requirement of techniques to
detect the occurrence of cancer nodule in early stage is increasing. A disease that is commonly
misdiagnosed is lung cancer. Finally here we are finding the accuracy of cancer dataset by
making use of KNN algorithm.

Introduction

It might have happened so many times that you or someone yours need doctors help
immediately, but they are not available due to some reason. The Cancer Disease Prediction
application is an end user support and online consultation project. Here, we propose a web
application that allows users to get instant guidance on their cancer disease through an intelligent
system online. The application is fed with various details and the cancer disease associated with
those details. The application allows user to share their health related issues for cancer
prediction. It then processes user specific details to check for various illness that could be
associated with it. Here we use some intelligent data mining techniques to guess the most
accurate illness that could be associated with patient’s details. Based on result, system
automatically shows the result specific doctors for further treatment. The system allows user to
view doctor’s details. The system can be use in case of emergency.

7
Classification

Classification is process of dividing the datasets into different categories or groups


by adding labels
It is all about
1.Taking data
2.Analyzing it
3.Basis of some condition ,we are dividing it into different
classes Why do we classify?
we classify it to perform predictive analysis on it
Techniques
1.Random Forest
2.Decision Tree
3.Navie Bayes
4.KNN

KNN Algorithm
KNN is a simple algorithm that stores all the available cases and classifies the new data or
case based on the similarity measure
Industrial Use case of KNN Algorithm
1. Recommender system

2.concept search

3.Advanced

applications i.Image

recognition ii.Video

recognition

8
K-nearest Neighbour

K nearest neighbour is used for the classification as well as regression. It is a supervised learning
algorithm. It is an instancebased learning in which we take the training examples but don't process
them instead we store the training example when we need to classify an instance at that time we do
classification, that is why it is also known as a lazy algorithm.
Phases of KNN:
(a) Training phase: Save the training sets.
(b) Prediction Get the test instance xi. Find K training sets {(x1,y1),(x2,y2),...(xk, yk)}Which
are closest to x. Predict y1 as the output of yi.
(c) Classification Predict the majority class (y1,y2,…yk).
(d) Regression: Predict the average of (y1,y2,…yk). Euclidean distance is the Sum of squared
difference. we will go for averaging in the following condition: Noise in the attribute. Noise in
class labels. Classes may be overlapping.

How does a KNN algorithm will work?

9
How to Choose factor K?
KNN algorithm is based on the feature similarity:

choosing a right value of k is called parameter pruning

It is important for better accuracy

How things are predicted using KNN?


KNN algorithm uses least distance measure in order to find the nearest neighbor

There are several distance measure to use

1.Euclidean distance

2.Manhatten distance

10
Random Forest
A decision tree is basically a set of decisions which is used to classify the datasets. It is also known
as a classification and regression tree . Random forest is a collection of decision trees.
Random forest iteratively asks a series of questions and based on that answer it will ask another set
of questions to classify the data.
Prediction using train random forest algorithm
(a) Take the test data sets and then randomly creates the decision trees.
(b) Find the decision of each decision tree according to the majority vote and then take a decision,
choose the high voted class as the decision.

Decision Tree
It comes under the category of supervised learning. Regression and classification problems can be
solved by a Decision tree. It represents a problem i9n the form of a tree in which internal nodes of
the tree serve as attributes and leaf node serve as class label[3]. Splitting is done in this model to
divide a node into two or more sub-nodes. Below are some assumptions for making a decision tree.
At first, the whole training set is considered as root.
 Secondly, Values are preferred to be categorical.
 Records are distributed recursively on the basis of attribute values.Statistical methods are
used for ordering attributes as internal node or root.
There are two types of decision trees:
(a) Categorical Variable Decision Tree: Decision Tree which has a categorical target variable.
In simple words, where Yes or NO values are there.
(b) Continues Variable Decision Tree: Decision Tree which has a continuous target variable. From
the below example, it will be clearer.

11
Data preprocessing using Weka

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of interest, or


containing only aggregate data

 e.g., occupation=“”

 noisy: containing errors or outliers

 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or names

 e.g., Age=“42” Birthday=“03/07/1997”

 e.g., Was rating “1,2,3”, now rating “A, B, C”

 e.g., discrepancy between duplicate records

Major Tasks in data preprocessing


 Data cleaning.

 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.

 Data integration.

 Data transformation.

 Data reduction.

12
Algorithm
Step1:

Data cleaning.

Step 2:

Handling the data.

Step 3 :

Calculate the distance.

Step 4 :

Find k nearest point.

Step 5 :

Predict the class.

Step 6 :

Check the accuracy.

Handling Dataset
 Now you need to split the data into a training dataset (for making the prediction) and
a testing dataset (for evaluating the accuracy of the model).

 Generally, a standard ratio of 67/33 is used for test/train split.

Calculate Distance
 In order to make any predictions, you have to calculate the distance between the
new point and the existing points, as you will be needing k closest points.

 In this case for calculating the distance, we will use the Euclidean distance.

 This is defined as the square root of the sum of the squared differences between the
two arrays of numbers.

13
Find K nearest neighbor
 Now that you have calculated the distance from each point, we can use it collect the k
most similar points/instances for the given test data/instance.

 This is a straightforward process:

 Calculate the distance wrt all the instance and select the subset having the smallest
Euclidean distance.

Predict the Accuracy


 Nowthat you have the k nearest points/neighbors for the given test instance, the next
task is to predicted response based on those neighbors

 You can do this by allowing each neighbor to vote for their class attribute, and take
the majority vote as the prediction.

Check Accuracy
 Now that we have all of the pieces of the kNN algorithm in place. Let’s check how
accurate our prediction is!

 An easy way to evaluate the accuracy of the model is to calculate a ratio of the
total correct predictions out of all predictions made.

Program
import csv

import random

import math

import operator

def loadDataset(filename, split, trainingSet=[] , testSet=[]):

with open(filename, 'r') as csvfile:

lines = csv.reader(csvfile)

dataset = list(lines)

for x in range(len(dataset)-1):

14
for y in range(4):

dataset[x][y] = float(dataset[x][y])

if random.random() < split:

trainingSet.append(dataset[x])

else:

testSet.append(dataset[x])

def euclideanDistance(instance1, instance2, length):

distance = 0

for x in range(length):

distance += pow((instance1[x] - instance2[x]), 2)

return math.sqrt(distance)

def getNeighbors(trainingSet, testInstance, k):

distances = []

length = len(testInstance)-1

for x in range(len(trainingSet)):

dist = euclideanDistance(testInstance, trainingSet[x], length)

distances.append((trainingSet[x], dist))

distances.sort(key=operator.itemgetter(1))

neighbors = []

for x in range(k):

neighbors.append(distances[x][0])

return neighbors

15
def getResponse(neighbors):

classVotes = {}

for x in range(len(neighbors)):

response = neighbors[x][-1]

if response in classVotes:

classVotes[response] += 1

else:

classVotes[response] = 1

sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)

return sortedVotes[0][0]

def getAccuracy(testSet, predictions):

correct = 0

for x in range(len(testSet)):

if testSet[x][-1] ==

predictions[x]: correct +=

return (correct/float(len(testSet))) * 100.0

def main():

# prepare data

trainingSet=[]

testSet=[]

split = 0.67

16
loadDataset(r'C:\Users\acer\Desktop\cancer.data', split, trainingSet, testSet)

17
print ('Train set: ' + repr(len(trainingSet)))

print ('Test set: ' + repr(len(testSet)))

# generate predictions

predictions=[]

k=3

for x in range(len(testSet)):

neighbors = getNeighbors(trainingSet, testSet[x], k)

result = getResponse(neighbors)

predictions.append(result)

print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))

accuracy = getAccuracy(testSet, predictions)

print('Accuracy: ' + repr(accuracy) + '%')

main()

18
Output

19
Conclsion

By doing all this work it is concluded that KNN (k-nearest neighbour) gives the best accuracy in
predicting cancer whereas remaining gives the worst results.

References
1. Geeks For Geeks.com
2. Github.

Team Members:

171FA04363 171FA04377

171FA04394 171FA04417
20

You might also like