1 - Lect 1 & 2 Data Mining
1 - Lect 1 & 2 Data Mining
Content
However, raw data alone is of little use. It's like having a treasure
chest without a key.
Early Beginnings:
Statistics and Mathematics: The foundation of data mining
lies in statistical methods like regression analysis,
correlation, and probability theory, which have been used for
centuries to analyze data and draw inferences.
1. Data Cleaning
1. Handling missing values: Imputation, deletion, or estimation.
2. Noise removal: Identifying and correcting errors or outliers.
3. Data consistency: Ensuring data uniformity and integrity.
2. Data Integration
1. Combining data from multiple sources: Merging data from
different databases/files.
2. Entity identification: Resolving inconsistencies in naming
conventions.
3. Data redundancy: Eliminating duplicate data.
KDD Process
3. Data Transformation
1. Normalization: Scaling data to a common range.
2. Aggregation: Combining data into summary representations.
3. Generalization: Creating higher-level concepts from data.
4. Data Reduction
1. Dimensionality reduction: Reducing the number of attributes.
2. Numerosity reduction: Replacing the original data with a
smaller representation.
3. Data compression: Reducing the data size without losing
essential information.
5. Data Mining
1. Pattern discovery: Applying algorithms to extract patterns like
association rules, classification, clustering, regression, etc.
2. Model building: Creating mathematical representations of the
discovered patterns.
KDD Process
6. Pattern Evaluation
1. Assessing the discovered patterns: Determining the usefulness and
reliability of patterns.
2. Visualization: Creating visual representations of patterns for better
understanding.
7. Knowledge Discovery
1. Interpreting patterns: Translating patterns into actionable insights.
2. Knowledge representation: Presenting insights in a human-
understandable format.
KDD Process
Research Challenges in (KDD)
1. Data-Related Challenges
1. Data Quality: Handling missing, inconsistent, and noisy data remains
a significant hurdle.
2. Data Volume and Velocity: Efficiently processing and extracting
knowledge from massive and rapidly changing datasets is challenging.
3. Data Variety: Dealing with diverse data formats (structured,
unstructured, semi-structured) and integrating them for analysis.
4. Data Privacy and Security: Protecting sensitive information while
enabling valuable insights.
2. Algorithmic Challenges
1. Interpretability: Understanding the rationale behind model decisions,
especially for complex models like deep learning.
2. Scalability: Developing algorithms that can handle large-scale
datasets efficiently.
3. Efficiency: Improving the computational efficiency of existing
algorithms.
4. Novelty: Discovering truly novel patterns and insights rather than
reproducing known knowledge.
Research Challenges in (KDD)
3. Knowledge Discovery Challenges
1. Knowledge Representation: Effectively capturing and representing
discovered knowledge.
2. Knowledge Integration: Combining knowledge from multiple sources
and perspectives.
3. Knowledge Utilization: Transforming discovered knowledge into
actionable insights.
4. Human-in-the-Loop: Integrating human expertise to guide the
discovery process and validate results.
4. Application-Specific Challenges
1. Domain Expertise: Bridging the gap between data scientists and
domain experts to ensure relevant knowledge discovery.
2. Real-time Analytics: Developing techniques for timely insights from
streaming data.
3. Incidental Knowledge: Discovering unexpected and potentially
valuable patterns.
4. Ethical Considerations: Addressing biases and ensuring fairness in
data mining algorithms.