0% found this document useful (0 votes)

128 views20 pages

DMT Doc Final

This document is a field project report submitted to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. It examines using data mining techniques like KNN, random forest, and decision trees for cancer prediction. KNN is explored in more detail as it provides the best accuracy for cancer prediction compared to other techniques. The document outlines the phases of KNN including the training and prediction phases. It also discusses how to choose the K factor and evaluate distance measures for KNN classification and regression.

Uploaded by

Pradeep reddy Jonnala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views20 pages

DMT Doc Final

Uploaded by

Pradeep reddy Jonnala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

A FIELD PROJECT REPORT ON

“Cancer Prediction using Data Mining Techinques”

Submitted
In the partial fulfilment of the requirements

for

The award of the degree of

BACHELOR OF TECHNOLOGY
In

COMPUTER SCIENCE & ENGINEERING

By
R.Preethi Priya (171FA04363)
Y.Kavya (171FA04377)
G.Vagdevi (171FA04394)
K.Supriya (171FA04417)

Under the esteemed guidance of

Ms.B.Suvarna, Assistant.Professor

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

VIGNAN'S FOUNDATION FOR SCIENCE, TECHNOLOGY AND RESEARCH
(Accredited by NAAC “A” grade)
Vadlamudi, Guntur.

1
VIGNAN’S FOUNDATION FOR SCIENCE TECHNOLOGY AND RESEARCH

(Accredited by NAAC “A” grade)

CERTIFICATE

This is to certify that the Field project Report entitled “Cancer Prediction Using Data
Mining Techinques” that is being submitted by R.Preethi Priya (171FA04363),Y.Kavya
(171FA04377),G.Vagdevi (171FA04394) and K.Supriya (171FA04417) in partial fulfilment
for the award of B.Tech degree in Computer Science and Engineering to the Vignan’s
Foundation for Science, Technology and Research, Deemed to be University, is a record of
bonafide work carried out by them under my supervision.

Ms.B.Suvarna Dr.D.Venkatesulu
Assistant.Professor External Examiner HOD CSE

2
DECLARATION

I hereby declare that the project entitled “Cancer prediction using data mining techniques”
submitted for the DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING. This
dissertation is our original work and the project has not formed the basis for the award of any
degree, associate-ship and fellowship or any other similar titles and no part of it has been
published or sent for publication at the time of submission.

R.Preethi Priya (171FA04363)

Y.Kavya (171FA04377)

G.Vagdevi (171FA04394)

K.Supriya (171FA04417)

Date:22-07-2020

3
ACKNOWLEDGEMENT

We are very grateful to our beloved Chairman Dr. Lavu Rathaiah, and Vice Chairman
Mr. Lavu. Krishna Devarayalu, for their love and care.

It is our pleasure to extend our sincere thanks to Vice-Chancellor Dr. M.Y.S.Prasad and
Dean Engineering & Management, Dr. V.MADHUSUDHAN RAO, for providing an
opportunity to do my academics in VFSTR.

It is a great pleasure for me to express my sincere thanks to Dr. D.Venkatesulu HOD, CSE
of VFSTR, for providing me an opportunity to do my Mini Project.

We feel it our responsibility to thank Dr. U. Sri Lakshmi under whose valuable guidance that
the project came out successfully after each stage.

We extend our whole hearted gratitude to all our faculty members of Department of Computer
Science and Engineering who helped us in our academics throughout course.

Finally we wish to express thanks to our family members for the love and affection overseas
and forbearance and cheerful depositions, which are vital for sustaining effort, required for
completing this work.
With Sincere regards,

R.Preethi Priya (171FA04363)

Y.Kavya (171FA04377)

G.Vagdevi (171FA04394)

K.Supriya (171FA04417)

4
TABLE OF CONTENTS
Content Page No

Outcomes and Observations 6

Abstract 7

Introduction 7

KNN 8

How does KNN work 8

How to choose K factor and work on KNN 11

using dataset

Algorithm 12

Output 17

References 18

5
OUTCOMES:

• Handling of incomplete data.

• Ensure the efficiency and scalability of data mining algorithms.
• Mining of large databases.
• Handling of relational and complex data types.

OBSERVATIONS:

• KNN (k-nearest neighbour) gives the best accuracy in predicting cancer whereas remaining
gives the worst results.
• It refers to an observation of data items in a data set

6
Cancer Disease Prediction System

Abstract

Cancer is the most important cause of death for both men and women. The early detection of
cancer can be helpful in curing the disease completely. So the requirement of techniques to
detect the occurrence of cancer nodule in early stage is increasing. A disease that is commonly
misdiagnosed is lung cancer. Finally here we are finding the accuracy of cancer dataset by
making use of KNN algorithm.

Introduction

It might have happened so many times that you or someone yours need doctors help
immediately, but they are not available due to some reason. The Cancer Disease Prediction
application is an end user support and online consultation project. Here, we propose a web
application that allows users to get instant guidance on their cancer disease through an intelligent
system online. The application is fed with various details and the cancer disease associated with
those details. The application allows user to share their health related issues for cancer
prediction. It then processes user specific details to check for various illness that could be
associated with it. Here we use some intelligent data mining techniques to guess the most
accurate illness that could be associated with patient’s details. Based on result, system
automatically shows the result specific doctors for further treatment. The system allows user to
view doctor’s details. The system can be use in case of emergency.

7
Classification

Classification is process of dividing the datasets into different categories or groups

by adding labels
It is all about
1.Taking data
2.Analyzing it
3.Basis of some condition ,we are dividing it into different
classes Why do we classify?
we classify it to perform predictive analysis on it
Techniques
1.Random Forest
2.Decision Tree
3.Navie Bayes
4.KNN

KNN Algorithm
KNN is a simple algorithm that stores all the available cases and classifies the new data or
case based on the similarity measure
Industrial Use case of KNN Algorithm
1. Recommender system

2.concept search

3.Advanced

applications i.Image

recognition ii.Video

recognition

8
K-nearest Neighbour

K nearest neighbour is used for the classification as well as regression. It is a supervised learning
algorithm. It is an instancebased learning in which we take the training examples but don't process
them instead we store the training example when we need to classify an instance at that time we do
classification, that is why it is also known as a lazy algorithm.
Phases of KNN:
(a) Training phase: Save the training sets.
(b) Prediction Get the test instance xi. Find K training sets {(x1,y1),(x2,y2),...(xk, yk)}Which
are closest to x. Predict y1 as the output of yi.
(c) Classification Predict the majority class (y1,y2,…yk).
(d) Regression: Predict the average of (y1,y2,…yk). Euclidean distance is the Sum of squared
difference. we will go for averaging in the following condition: Noise in the attribute. Noise in
class labels. Classes may be overlapping.

How does a KNN algorithm will work?

9
How to Choose factor K?
KNN algorithm is based on the feature similarity:

choosing a right value of k is called parameter pruning

It is important for better accuracy

How things are predicted using KNN?

KNN algorithm uses least distance measure in order to find the nearest neighbor

There are several distance measure to use

1.Euclidean distance

2.Manhatten distance

10
Random Forest
A decision tree is basically a set of decisions which is used to classify the datasets. It is also known
as a classification and regression tree . Random forest is a collection of decision trees.
Random forest iteratively asks a series of questions and based on that answer it will ask another set
of questions to classify the data.
Prediction using train random forest algorithm
(a) Take the test data sets and then randomly creates the decision trees.
(b) Find the decision of each decision tree according to the majority vote and then take a decision,
choose the high voted class as the decision.

Decision Tree
It comes under the category of supervised learning. Regression and classification problems can be
solved by a Decision tree. It represents a problem i9n the form of a tree in which internal nodes of
the tree serve as attributes and leaf node serve as class label[3]. Splitting is done in this model to
divide a node into two or more sub-nodes. Below are some assumptions for making a decision tree.
At first, the whole training set is considered as root.
 Secondly, Values are preferred to be categorical.
 Records are distributed recursively on the basis of attribute values.Statistical methods are
used for ordering attributes as internal node or root.
There are two types of decision trees:
(a) Categorical Variable Decision Tree: Decision Tree which has a categorical target variable.
In simple words, where Yes or NO values are there.
(b) Continues Variable Decision Tree: Decision Tree which has a continuous target variable. From
the below example, it will be clearer.

11
Data preprocessing using Weka

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

 e.g., occupation=“”

 noisy: containing errors or outliers

 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or names

 e.g., Age=“42” Birthday=“03/07/1997”

 e.g., Was rating “1,2,3”, now rating “A, B, C”

 e.g., discrepancy between duplicate records

Major Tasks in data preprocessing

 Data cleaning.

 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.

 Data integration.

 Data transformation.

 Data reduction.

12
Algorithm
Step1:

Data cleaning.

Step 2:

Handling the data.

Step 3 :

Calculate the distance.

Step 4 :

Find k nearest point.

Step 5 :

Predict the class.

Step 6 :

Check the accuracy.

Handling Dataset
 Now you need to split the data into a training dataset (for making the prediction) and
a testing dataset (for evaluating the accuracy of the model).

 Generally, a standard ratio of 67/33 is used for test/train split.

Calculate Distance
 In order to make any predictions, you have to calculate the distance between the
new point and the existing points, as you will be needing k closest points.

 In this case for calculating the distance, we will use the Euclidean distance.

 This is defined as the square root of the sum of the squared differences between the
two arrays of numbers.

13
Find K nearest neighbor
 Now that you have calculated the distance from each point, we can use it collect the k
most similar points/instances for the given test data/instance.

 This is a straightforward process:

 Calculate the distance wrt all the instance and select the subset having the smallest
Euclidean distance.

Predict the Accuracy

 Nowthat you have the k nearest points/neighbors for the given test instance, the next
task is to predicted response based on those neighbors

 You can do this by allowing each neighbor to vote for their class attribute, and take
the majority vote as the prediction.

Check Accuracy
 Now that we have all of the pieces of the kNN algorithm in place. Let’s check how
accurate our prediction is!

 An easy way to evaluate the accuracy of the model is to calculate a ratio of the
total correct predictions out of all predictions made.

Program
import csv

import random

import math

import operator

def loadDataset(filename, split, trainingSet=[] , testSet=[]):

with open(filename, 'r') as csvfile:

lines = csv.reader(csvfile)

dataset = list(lines)

for x in range(len(dataset)-1):

14
for y in range(4):

dataset[x][y] = float(dataset[x][y])

if random.random() < split:

trainingSet.append(dataset[x])

else:

testSet.append(dataset[x])

def euclideanDistance(instance1, instance2, length):

distance = 0

for x in range(length):

distance += pow((instance1[x] - instance2[x]), 2)

return math.sqrt(distance)

def getNeighbors(trainingSet, testInstance, k):

distances = []

length = len(testInstance)-1

for x in range(len(trainingSet)):

dist = euclideanDistance(testInstance, trainingSet[x], length)

distances.append((trainingSet[x], dist))

distances.sort(key=operator.itemgetter(1))

neighbors = []

for x in range(k):

neighbors.append(distances[x][0])

return neighbors

15
def getResponse(neighbors):

classVotes = {}

for x in range(len(neighbors)):

response = neighbors[x][-1]

if response in classVotes:

classVotes[response] += 1

else:

classVotes[response] = 1

sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)

return sortedVotes[0][0]

def getAccuracy(testSet, predictions):

correct = 0

for x in range(len(testSet)):

if testSet[x][-1] ==

predictions[x]: correct +=

return (correct/float(len(testSet))) * 100.0

def main():

# prepare data

trainingSet=[]

testSet=[]

split = 0.67

16
loadDataset(r'C:\Users\acer\Desktop\cancer.data', split, trainingSet, testSet)

17
print ('Train set: ' + repr(len(trainingSet)))

print ('Test set: ' + repr(len(testSet)))

# generate predictions

predictions=[]

k=3

for x in range(len(testSet)):

neighbors = getNeighbors(trainingSet, testSet[x], k)

result = getResponse(neighbors)

predictions.append(result)

print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))

accuracy = getAccuracy(testSet, predictions)

print('Accuracy: ' + repr(accuracy) + '%')

main()

18
Output

19
Conclsion

By doing all this work it is concluded that KNN (k-nearest neighbour) gives the best accuracy in
predicting cancer whereas remaining gives the worst results.

References
1. Geeks For Geeks.com
2. Github.

Team Members:

171FA04363 171FA04377

171FA04394 171FA04417
20

Unit 5 - DA - Classification & Clustering
No ratings yet
Unit 5 - DA - Classification & Clustering
105 pages
Computational Intelligence and its Applications
From Everand
Computational Intelligence and its Applications
Vikash Yadav
No ratings yet
Breast Cancer Classification
100% (2)
Breast Cancer Classification
16 pages
Chapter
100% (1)
Chapter
101 pages
Machine Learning3
No ratings yet
Machine Learning3
51 pages
Unit4 PPT
No ratings yet
Unit4 PPT
118 pages
DL PPR3
No ratings yet
DL PPR3
57 pages
Classification
No ratings yet
Classification
52 pages
ML Supervised Learning Unit 3
No ratings yet
ML Supervised Learning Unit 3
51 pages
ML 3
No ratings yet
ML 3
20 pages
Applied Machine Learning I
No ratings yet
Applied Machine Learning I
29 pages
Slide 2 ML Basics
No ratings yet
Slide 2 ML Basics
42 pages
Introduction To Classification and Classification Algorithms
No ratings yet
Introduction To Classification and Classification Algorithms
9 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
ML Acti
No ratings yet
ML Acti
23 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
U02Lecture08 Statistical Machine Learning
No ratings yet
U02Lecture08 Statistical Machine Learning
41 pages
Data Sciene - Unit 5 Material
No ratings yet
Data Sciene - Unit 5 Material
15 pages
Lec 11,12
No ratings yet
Lec 11,12
14 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
06 KNN
No ratings yet
06 KNN
41 pages
Himansh PR
No ratings yet
Himansh PR
12 pages
Machine Learning Project
No ratings yet
Machine Learning Project
12 pages
Unit II - 2 - Supervised Learning
No ratings yet
Unit II - 2 - Supervised Learning
23 pages
Stop and Wait ARQ
100% (2)
Stop and Wait ARQ
13 pages
Wordpress Security: Define ('DISALLOW - UNFILTERED - HTML', True)
100% (1)
Wordpress Security: Define ('DISALLOW - UNFILTERED - HTML', True)
11 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
ML Report2
No ratings yet
ML Report2
21 pages
Artikel Data Science Yohana Juniati Sitorus B.indo - Id.en
No ratings yet
Artikel Data Science Yohana Juniati Sitorus B.indo - Id.en
7 pages
Artikel Data Science Yohana Juniati Sitorus B.indo - Id.en
No ratings yet
Artikel Data Science Yohana Juniati Sitorus B.indo - Id.en
7 pages
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
No ratings yet
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
21 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
"Leaky Bucket Algorithm": Computer Networks Minor Project Report On
100% (1)
"Leaky Bucket Algorithm": Computer Networks Minor Project Report On
13 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Team1-EMPLOYEE DATABASE AND PAYROLL MANAGEMENT SYSTEM
No ratings yet
Team1-EMPLOYEE DATABASE AND PAYROLL MANAGEMENT SYSTEM
44 pages
Chapter Non-Parametric Methods
No ratings yet
Chapter Non-Parametric Methods
9 pages
K Nearest Neighbor: Presented by
No ratings yet
K Nearest Neighbor: Presented by
29 pages
PSUC
No ratings yet
PSUC
5 pages
Prediction of Heart Disease Using Decision Tree in Comparison With KNN To Improve Accuracy
No ratings yet
Prediction of Heart Disease Using Decision Tree in Comparison With KNN To Improve Accuracy
5 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
n8n Full Course 6 Hours (Build & Sell Ai Automations + Agents)
No ratings yet
n8n Full Course 6 Hours (Build & Sell Ai Automations + Agents)
236 pages
Settings: Settings P63X/Uk St/A54 Micom P631, P632, P633, P634
No ratings yet
Settings: Settings P63X/Uk St/A54 Micom P631, P632, P633, P634
108 pages
DWDM Unit-3
No ratings yet
DWDM Unit-3
9 pages
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)
DM - MP
No ratings yet
DM - MP
15 pages
Data Science Notes
No ratings yet
Data Science Notes
36 pages
Unit-7 ML
No ratings yet
Unit-7 ML
11 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
14 pages
Day 2 - Session 2: - KNN - Decision Tree - Random Forest - Naïve Bayes Classification
No ratings yet
Day 2 - Session 2: - KNN - Decision Tree - Random Forest - Naïve Bayes Classification
50 pages
Azure Virtual Machines
100% (1)
Azure Virtual Machines
1 page
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
New Microsoft PowerPoint Presentation (Recovered)
No ratings yet
New Microsoft PowerPoint Presentation (Recovered)
23 pages
Business Transformation Enablement Program
No ratings yet
Business Transformation Enablement Program
48 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Unit 4 Classification & Prediction
No ratings yet
Unit 4 Classification & Prediction
10 pages
Leaky Bucket Algorithm: 171FA04014 171FA04042 171FA04276
No ratings yet
Leaky Bucket Algorithm: 171FA04014 171FA04042 171FA04276
22 pages
Damodaram Sanjivayya National Law University Sabbavaram Vishakapatnam Ap., India
91% (23)
Damodaram Sanjivayya National Law University Sabbavaram Vishakapatnam Ap., India
39 pages
Sayan Das - Machine Learning
No ratings yet
Sayan Das - Machine Learning
4 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
AWS Interview Question For A Company
No ratings yet
AWS Interview Question For A Company
7 pages
MLunit 2 Mynotes
No ratings yet
MLunit 2 Mynotes
15 pages
Pronest 8 Manual
No ratings yet
Pronest 8 Manual
275 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
14 pages
KNN PDF
No ratings yet
KNN PDF
30 pages
Sociology Project Pradeep
No ratings yet
Sociology Project Pradeep
26 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
Windows Shotcut Keys
No ratings yet
Windows Shotcut Keys
15 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
AWS DevOps Interview Q&A
No ratings yet
AWS DevOps Interview Q&A
5 pages
Reddy 2019
No ratings yet
Reddy 2019
5 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
Machine Learning Algorithms For Breast Cancer Prediction
No ratings yet
Machine Learning Algorithms For Breast Cancer Prediction
8 pages
Bi-Directional Visitor Counter
No ratings yet
Bi-Directional Visitor Counter
18 pages
5.1.8 K-Nearest-Neighbor Algorithm
No ratings yet
5.1.8 K-Nearest-Neighbor Algorithm
8 pages
JAXB
100% (1)
JAXB
48 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
Managing A Data Cleansing Process For Material or Service Master Data 20130529
No ratings yet
Managing A Data Cleansing Process For Material or Service Master Data 20130529
34 pages
Microprocessors and Interfacing Minor Project Report ON: "Bi-Directional Visitor Counter" Submitted
No ratings yet
Microprocessors and Interfacing Minor Project Report ON: "Bi-Directional Visitor Counter" Submitted
14 pages
ABAP Technie
No ratings yet
ABAP Technie
11 pages
Ryan Michael
No ratings yet
Ryan Michael
156 pages
Slag Detection Using Speech
No ratings yet
Slag Detection Using Speech
16 pages
Damodaram Sanjivayya National Law University Sabbavaram Vishakapatnam, Ap., India Kinds of Law Legal Methods Prof - Soma Bhattacharjya
No ratings yet
Damodaram Sanjivayya National Law University Sabbavaram Vishakapatnam, Ap., India Kinds of Law Legal Methods Prof - Soma Bhattacharjya
44 pages
Privity of Contracts
No ratings yet
Privity of Contracts
153 pages
A Power Exchange For Db22024
No ratings yet
A Power Exchange For Db22024
9 pages
Damodaram Sanjivayya National LAW University Sabbavaram Vishakapatnam Ap., India
No ratings yet
Damodaram Sanjivayya National LAW University Sabbavaram Vishakapatnam Ap., India
24 pages
Jeevan LL Sem 2
No ratings yet
Jeevan LL Sem 2
18 pages
Law of Torts Abstract
No ratings yet
Law of Torts Abstract
1 page
Asrock Motherboard Manual
No ratings yet
Asrock Motherboard Manual
72 pages
Damodaram Sanjivayya National Law University, Ap, India: Subject
No ratings yet
Damodaram Sanjivayya National Law University, Ap, India: Subject
16 pages
Volume Based Broadband Packages SLT
No ratings yet
Volume Based Broadband Packages SLT
3 pages
"Slang Detection Using Speech": Data Mining Techniques Minor Project Report On
No ratings yet
"Slang Detection Using Speech": Data Mining Techniques Minor Project Report On
15 pages
Presentation Software
No ratings yet
Presentation Software
2 pages
11 - Ir. Dr. Harriezan Ahmad PDF
No ratings yet
11 - Ir. Dr. Harriezan Ahmad PDF
10 pages
Pass Res B1plus UT 4A
No ratings yet
Pass Res B1plus UT 4A
3 pages
SQL Quiz 1
100% (1)
SQL Quiz 1
4 pages
Ecg PDF
No ratings yet
Ecg PDF
9 pages
Damodaram Sanjivayya National Law University Sabbavaram Vishakapatnam, Ap., India
No ratings yet
Damodaram Sanjivayya National Law University Sabbavaram Vishakapatnam, Ap., India
6 pages
Emotion Based Driving
No ratings yet
Emotion Based Driving
5 pages
LM Abstract
No ratings yet
LM Abstract
3 pages
508 Test Report NIST Mobile UFED4PC v4.2.6.5 January 2016
No ratings yet
508 Test Report NIST Mobile UFED4PC v4.2.6.5 January 2016
20 pages
ABSTRACT Political Science
No ratings yet
ABSTRACT Political Science
2 pages
B-K Radio Programming Manual
No ratings yet
B-K Radio Programming Manual
29 pages
Public Debt
No ratings yet
Public Debt
1 page
Sticker On Box @MPI-Minor Project
No ratings yet
Sticker On Box @MPI-Minor Project
1 page
05 JavaScript-OOP
No ratings yet
05 JavaScript-OOP
43 pages
Dokumen - Tips - Cano Antonio Ejercicio 7 Popularly Known As Estudio Cano PDF
No ratings yet
Dokumen - Tips - Cano Antonio Ejercicio 7 Popularly Known As Estudio Cano PDF
1 page
Serif Altinbas MobiWis2022
No ratings yet
Serif Altinbas MobiWis2022
15 pages
Paper Implementation Major Project
No ratings yet
Paper Implementation Major Project
6 pages
Advanced Scripting Techniques For Automating Regression Tests and Measurements With The Code Composer Studio Scripting Utility
No ratings yet
Advanced Scripting Techniques For Automating Regression Tests and Measurements With The Code Composer Studio Scripting Utility
24 pages

DMT Doc Final

Uploaded by

DMT Doc Final

Uploaded by

A FIELD PROJECT REPORT ON

“Cancer Prediction using Data Mining Techinques”

The award of the degree of

COMPUTER SCIENCE & ENGINEERING

Under the esteemed guidance of

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

(Accredited by NAAC “A” grade)

R.Preethi Priya (171FA04363)

R.Preethi Priya (171FA04363)

Outcomes and Observations 6

How does KNN work 8

How to choose K factor and work on KNN 11

• Handling of incomplete data.

Classification is process of dividing the datasets into different categories or groups

How does a KNN algorithm will work?

choosing a right value of k is called parameter pruning

It is important for better accuracy

How things are predicted using KNN?

There are several distance measure to use

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of interest, or

 noisy: containing errors or outliers

 inconsistent: containing discrepancies in codes or names

 e.g., Age=“42” Birthday=“03/07/1997”

 e.g., Was rating “1,2,3”, now rating “A, B, C”

 e.g., discrepancy between duplicate records

Major Tasks in data preprocessing

Handling the data.

Calculate the distance.

Find k nearest point.

Predict the class.

Check the accuracy.

 Generally, a standard ratio of 67/33 is used for test/train split.

 This is a straightforward process:

Predict the Accuracy

def loadDataset(filename, split, trainingSet=[] , testSet=[]):

with open(filename, 'r') as csvfile:

if random.random() < split:

def euclideanDistance(instance1, instance2, length):

distance += pow((instance1[x] - instance2[x]), 2)

def getNeighbors(trainingSet, testInstance, k):

dist = euclideanDistance(testInstance, trainingSet[x], length)

sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)

def getAccuracy(testSet, predictions):

return (correct/float(len(testSet))) * 100.0

print ('Test set: ' + repr(len(testSet)))

neighbors = getNeighbors(trainingSet, testSet[x], k)

print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))

accuracy = getAccuracy(testSet, predictions)

print('Accuracy: ' + repr(accuracy) + '%')

You might also like