0% found this document useful (0 votes)
16 views18 pages

Lecture 3.1.5 and 3.1.6

The document outlines the course objectives and outcomes for a Data Mining and Warehousing class, focusing on classification methods such as k-nearest neighbor and genetic algorithms. It details the syllabus, including topics like decision trees, clustering methods, and various data mining techniques. Additionally, it provides references and resources for further study in the field.

Uploaded by

Anshul Kunwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views18 pages

Lecture 3.1.5 and 3.1.6

The document outlines the course objectives and outcomes for a Data Mining and Warehousing class, focusing on classification methods such as k-nearest neighbor and genetic algorithms. It details the syllabus, including topics like decision trees, clustering methods, and various data mining techniques. Additionally, it provides references and resources for further study in the field.

Uploaded by

Anshul Kunwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data Mining and Warehousing (22CSH-380)


Faculty: Dr. Preeti Khera (E16576)

Lecture – 3.1.5 & 3.1.6


Classification methods K-nearest neighbor DISCOVER . LEARN . EMPOWER
classifiers, Genetic Algorithm

1
Data Mining and Warehousing : Course Objectives

COURSE OBJECTIVES
The Course aims to:

1. Develop understanding key concepts of data mining and obtain knowledge about
how to extract useful characteristics from data using data pre-processing techniques.
2. Demonstrate methods to apply and analyze relevant attributes, perform statistical
measure to look for meaningful variation in data, and mine association rules for
transactional datasets.
3. Teach use and application of data mining techniques such as classification, decision
tree, neural networks, back propagation and many more, in various applications.

2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-

Understand the concept of Data mining and usage of various tools for
CO1
data warehousing and data mining.

Demonstrate the strengths and weaknesses of different methods of


CO2
meaningful data mining.

Apply association rule, classification, and clustering algorithms


CO3
for large data sets.

Evaluate and employ correct data mining techniques depending on


CO4
characteristics of the dataset.
Verify and formulate the performance of various data mining
CO5
techniques according to the dataset.

3
Unit-3 Syllabus

Unit-3
What is Classification & Prediction, Issues regarding Classification and prediction,
Decision tree, Bayesian Classification, Classification by Back propagation,
Multilayer feed-forward Neural Network, Back propagation Algorithm,
Classification methods K-nearest neighbor classifiers, Genetic Algorithm.
Cluster Analysis: Data types in cluster analysis, Categories of clustering methods,
Partitioning methods. Hierarchical Clustering- CURE and Chameleon. Density Based
Methods-DBSCAN, OPTICS. Grid Based Methods- STING, CLIQUE.
Model Based Method –Statistical Approach, Neural Network approach, Outlier
Analysis

4
Table of Content
• k-Nearest Neighbor Algorithm
• Genetic Algorithm
Instance-Based Methods
• Instance-based learning:
• Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference

June 1, 2025 Data Mining: Concepts and Techniques 6


The k-Nearest Neighbor Algorithm

• All instances correspond to points in the n-D space.


• The nearest neighbor are defined in terms of Euclidean
distance.
• The target function could be discrete- or real- valued.
• For discrete-valued, the k-NN returns the most common
value among the k training examples nearest to xq.
• Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples.
_
_
_ _ .
+
_ .
+
xq + . . .
June 1, 2025
_
+ .
Data Mining: Concepts and Techniques 7
Discussion on the k-NN Algorithm
• The k-NN algorithm for continuous-valued target functions
• Calculate the mean values of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm
• Weight the contribution of each of the k neighbors according
to their distance to the query point xq
• giving greater weight to closer neighbors
w 1
• Similarly, for real-valued target functions d ( xq , xi )2
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes.
• To overcome it, axes stretch or elimination of the least
relevant attributes.
June 1, 2025 Data Mining: Concepts and Techniques 8
Case-Based Reasoning

• Also uses: lazy evaluation + analyze similar instances


• Difference: Instances are not “points in a Euclidean space”
• Example: Water faucet problem in CADET (Sycara et al’92)
• Methodology
• Instances represented by rich symbolic descriptions (e.g.,
function graphs)
• Multiple retrieved cases may be combined
• Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving
• Research issues
• Indexing based on syntactic similarity measure, and when
failure, backtracking, and adapting to additional cases
June 1, 2025 Data Mining: Concepts and Techniques 9
Remarks on Lazy vs. Eager Learning

• Instance-based learning: lazy evaluation


• Decision-tree and Bayesian classification: eager evaluation
• Key differences
• Lazy method may consider query instance xq when deciding how to
generalize beyond the training data D
• Eager method cannot since they have already chosen global approximation
when seeing the query
• Efficiency: Lazy - less time training but more time predicting
• Accuracy
• Lazy method effectively uses a richer hypothesis space since it uses many
local linear functions to form its implicit global approximation to the target
function
• Eager: must commit to a single hypothesis that covers the entire instance
space
June 1, 2025 Data Mining: Concepts and Techniques 10
Genetic Algorithms

• GA: based on an analogy to biological evolution


• Each rule is represented by a string of bits
• An initial population is created consisting of randomly generated
rules
• e.g., IF A1 and Not A2 then C2 can be encoded as 100
• Based on the notion of survival of the fittest, a new population
is formed to consists of the fittest rules and their offsprings
• The fitness of a rule is represented by its classification accuracy
on a set of training examples
• Offsprings are generated by crossover and mutation

June 1, 2025 Data Mining: Concepts and Techniques 11


Rough Set Approach

• Rough sets are used to approximately or “roughly” define


equivalent classes
• A rough set for a given class C is approximated by two sets: a
lower approximation (certain to be in C) and an upper
approximation (cannot be described as not belonging to C)
• Finding the minimal subsets (reducts) of attributes (for
feature reduction) is NP-hard but a discernibility matrix is
used to reduce the computation intensity

June 1, 2025 Data Mining: Concepts and Techniques 12


Fuzzy Set
Approaches

• Fuzzy logic uses truth values between 0.0 and 1.0 to represent
the degree of membership (such as using fuzzy membership
graph)
• Attribute values are converted to fuzzy values
• e.g., income is mapped into the discrete categories {low,
medium, high} with fuzzy values calculated
• For a given new sample, more than one fuzzy value may apply
• Each applicable rule contributes a vote for membership in the
categories
• Typically, the truth values for each predicted category are
summed
June 1, 2025 Data Mining: Concepts and Techniques 13
Summary
• Instance based Methods
• k-Nearest Neighbor Algorithm
• Genetic Algorithm

14
Assignment
• Examine the key features of the k-nearest neighbor classifier.
• Determine the process of selecting the value of k in kNN.
• Explain in detail the motivation of using Genetic Algorithm.

15
References
TEXT BOOKS
T1: Tan, Steinbach and Vipin Kumar. Introduction to Data Mining, Pearson Education, 2016.
T2: Zaki MJ, Meira Jr W, Meira W. Data mining and machine learning: Fundamental concepts and algorithms.
Cambridge University Press; 2020 Jan 30.
T3: King RS. Cluster analysis and data mining: An introduction. Mercury Learning and Information; 2015 May
12.

REFERENCE BOOKS
R1: Pei, Han and Kamber. Data Mining: Concepts and Techniques, Elsevier, 2011.
R2: Halgamuge SK, Wang L, editors. Classification and clustering for knowledge discovery. Springer Science
& Business Media; 2005 Sep 2.
R3: Bhatia P. Data mining and data warehousing: principles and practical techniques. Cambridge University
Press; 2019 Jun 27.

JOURNALS
• https://www.igi-global.com/journal/international-journal-data-warehousing-mining/1085
• https://www.springer.com/journal/41060 16
• https://link.springer.com/journal/10618
References
RESEARCH PAPER
 Alasadi SA, Bhaya WS. Review of data preprocessing techniques in data mining. Journal of Engineering and Applied
Sciences. 2017 Sep;12(16):4102-7.
 Freitas AA. A survey of evolutionary algorithms for data mining and knowledge discovery. InAdvances in evolutionary
computing: theory and applications 2003 Jan 1 (pp. 819-845). Berlin, Heidelberg: Springer Berlin Heidelberg.
 Kumbhare TA, Chobe SV. An overview of association rule mining algorithms. International Journal of Computer Science
and Information Technologies. 2014 Feb;5(1):927-30.
 Srivastava S. Weka: a tool for data preprocessing, classification, ensemble, clustering and association rule mining.
International Journal of Computer Applications. 2014 Jan 1;88(10).
 Dol SM, Jawandhiya PM. Classification technique and its combination with clustering and association rule mining in
educational data mining—A survey. Engineering Applications of Artificial Intelligence. 2023 Jun 1; 122:106071.

• WEB LINK
https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn
https://www.javatpoint.com/genetic-algorithm-in-machine-learning

• VIDEO LINK
17
https://youtu.be/Z_8MpZeMdD4
THANK YOU

For queries
Email: [email protected]

You might also like