Data Mining Notes
Data Mining Notes
Guide
1. Classification
Classification is a supervised learning technique that assigns items in a dataset to predefined
categories or classes. Think of it as sorting emails into “spam” or “not spam” based on
their characteristics.
Data Generalization
Data generalization involves reducing the complexity of data while maintaining its essential
patterns. This process helps in: - Converting raw data into meaningful concepts (like age
ranges instead of exact ages) - Creating concept hierarchies (e.g., city → state → country)
- Reducing noise and handling missing values
Analytical Characterization
This involves analyzing data to understand its key characteristics: - Data distribution and
central tendencies - Data quality assessment - Feature correlation analysis - Pattern
identification in different classes
Statistical-Based Algorithms
These algorithms use probability theory and statistical inference: - Naive Bayes Classifier -
Bayesian Networks - Maximum Likelihood Estimation - Statistical hypothesis testing
Distance-Based Algorithms
These algorithms use distance metrics to classify items: - k-Nearest Neighbors (kNN) -
Distance-weighted classification - Metric learning approaches Common distance measures
include Euclidean, Manhattan, and Cosine similarity.
3. Clustering
Introduction to Clustering
Clustering is an unsupervised learning technique that groups similar items together. Unlike
classification, it doesn’t require pre-labeled data.
Hierarchical Clustering
Chameleon
Density-Based Methods
DBSCAN
OPTICS
Extension of DBSCAN
Creates reachability plot
Handles varying density clusters
Grid-Based Methods
STING (Statistical Information Grid)
CLIQUE
Model-Based Methods
Statistical approaches include: - Expectation-Maximization (EM) algorithm - Gaussian
Mixture Models - Hidden Markov Models
4. Association Rules
Introduction
Association rule mining finds interesting relationships in large datasets, like “customers
who buy bread often buy butter.”
Large Itemsets
Frequent itemset mining
Support and confidence metrics
Minimum support thresholds
Closure properties
Basic Algorithms
Apriori algorithm
FP-growth algorithm
Eclat algorithm
Performance considerations