Data Mining
Data Mining
•Presents data in a visually appealing and understandable format, such as charts and
graphs, making it easier to interpret insights.
In data mining, data processing, or preprocessing, involves
transforming raw data into a usable format by cleaning, integrating,
reducing, and transforming it, ensuring data quality and suitability for
analysis and model building.
1. Data Cleaning:
•Handling Missing Values:
•Identifying and addressing missing data points, either by imputation
(filling in with estimates) or removal.
Removing Outliers:
Detecting and dealing with extreme values that can skew analysis,
either by removing them or transforming them.
Correcting Inconsistencies:
Addressing errors, duplicates, and inconsistencies in the data to ensure
accuracy.
2. Data Integration:
3. Data Reduction:
Dimensionality Reduction: Reducing the number of variables (features)
while preserving relevant information.
Data cleaning, also known as data cleansing or data scrubbing, is the process of
identifying and correcting or removing errors, inconsistencies, inaccuracies, and
corrupt records from a dataset. It ensures that data is accurate, consistent, and
usable, which is fundamental for building reliable and effective artificial
intelligence (AI) and machine learning (ML) models.
Common Data Issues Addressed Missing Values: Incomplete
records that can skew analysis.
Duplicate Entries: Redundant data that inflates datasets.
Inconsistent Formatting: Variations in data presentation (e.g.,
date formats).
•Outliers: Anomalous data points that may distort results.
•Typographical Errors: Mistakes in data entry that lead to
inaccuracies.
Best Practices for Effective Data Cleaning
2. Noisy Data
Noisy data means random errors or variances that distort the dataset.
Sources:
•Data entry errors
•Faulty sensors
•Communication errors
•Outliers
Handling Techniques:
1.Smoothing Techniques:
1. Moving average: Replace data with the average of neighboring
values.
2. Bin smoothing: Group data into bins and replace values with
bin mean/median
1.Clustering:
1. Group data into clusters (e.g., using K-Means) and remove or
smooth data points that don’t fit well.
2.Regression or Model-based methods:
1. Fit a model and use residuals to detect unusual patterns or noise.
Data Compression
This refers to encoding data using fewer bits. It can be lossless (no
info lost) or lossy (some data sacrificed for better compression).
Examples:
•ZIP files – Lossless compression of general data
•JPEG, MP3 – Lossy compression for images/audio
•Autoencoders – Neural networks trained to compress and
reconstruct data (learn an efficient encoding)
Techniques:
•Huffman Coding
•Run-Length Encoding
•LZW (Lempel–Ziv–Welch)
•Autoencoders (again — yes, they can be used for compression too!)
Numerosity Reduction
Numerosity reduction refers to techniques that reduce the volume of
data by replacing the original data with a smaller representation
that is more compact but still maintains the essential properties and
patterns of the data.
Why Discretize?
•Simplifies models and reduces noise
•Helps algorithms that work better with categorical data (like decision
trees)
•Enables generation of concept hierarch
Concept Hierarchy Generation
Concept hierarchy involves organizing data from low-level concepts
(raw data) to higher-level concepts — think granularity levels.
Example Hierarchy:
For the attribute Location:
City → State → Country → Continent