Data Mining Overview
Data Mining Overview
Definition:
Data mining is the process of discovering patterns, relationships, and useful information from large
Motivation:
- The growth of data from diverse sources (IoT, social media, business, science, etc.).
- The need to make informed decisions based on patterns and trends in data.
- Competitive advantage in various fields such as healthcare, finance, marketing, and science.
Key Functionalities:
1. Association Analysis: Discovering rules that reveal relationships between variables (e.g., 'If X,
then Y').
Data Cleaning:
- Handling Missing Values: Replace, remove, or predict missing entries.
- Handling Noisy Data: Use techniques like binning, regression, or clustering to smooth data.
validation.
Data Integration:
Data Transformation:
Data Reduction:
- Data Cube Aggregation: Summarizing data at higher levels (e.g., regional vs. store-level sales).
- Data Compression: Use lossless or lossy compression methods to store data compactly.
- Concept Hierarchy Generation: Organize data into multiple levels (e.g., 'city' -> 'state' -> 'country').
1. Binning: Smooth data by grouping it into bins (e.g., bin means, medians).
- Inconsistent Data: Occurs due to duplicate records, schema differences, or incorrect entries.
Resolve using:
1. Rule-based corrections.
1. Data Cube Aggregation: Create summaries by aggregating data across dimensions (e.g., sales
by region, time).
2. Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) reduce the
3. Data Compression: Compress data storage using algorithms like Huffman coding or wavelet
transforms.
4. Numerosity Reduction: Replace original data with models (parametric like regression or
5. Discretization: Group numeric values into intervals (e.g., age: 0-18, 19-35, etc.).
6. Concept Hierarchy Generation: Summarize data at higher abstraction levels (e.g., product types
Definition:
A decision tree is a predictive model that splits data into branches based on conditions at nodes,
2. Select the best attribute for splitting using metrics like Information Gain or Gini Index.
4. Repeat the process recursively until stopping criteria are met (e.g., maximum depth, no significant
gain).
Advantages: