DWDM Unit 3
DWDM Unit 3
Data mining is the process of discovering patterns, correlations, trends, or useful information
from large datasets using statistical, machine learning, and database techniques. It is an essential
step in the knowledge discovery in databases (KDD) process, aiming to transform raw data into
meaningful insights.
Explosion of Data: The rapid growth of data from business, science, and online sources
necessitates tools to extract useful knowledge.
Decision Support: Helps organizations make informed decisions by uncovering hidden
patterns and trends.
Competitive Advantage: Businesses use data mining to gain insights into customer
behavior, optimize operations, and develop targeted marketing strategies.
Technological Advancement: Improved computing power and storage make large-scale
data analysis feasible.
"The non-trivial extraction of implicit, previously unknown, and potentially useful information
from data." It involves analyzing data from different perspectives and summarizing it into
actionable insights. It's often referred to as the science of extracting useful knowledge from large
volumes of data.
Data mining offers a wide range of functionalities, generally categorized into two types:
a. Descriptive Functions
b. Predictive Functions
Data Processing refers to the series of operations performed on raw data to convert it into
meaningful information. It is a vital step in data mining and analytics, ensuring that the data is
clean, structured, and ready for analysis.
1. Data Collection
o Gathering raw data from various sources such as databases, files, sensors, or
online sources.
2. Data Cleaning
o Removing inaccuracies, inconsistencies, and missing values.
o Ensures the data is accurate and complete.
3. Data Integration
o Merging data from multiple sources into a coherent dataset.
o Handles data redundancy and conflicts.
4. Data Transformation
o Converting data into appropriate formats.
o Includes normalization, encoding, and scaling.
5. Data Reduction
o Reducing data volume while maintaining relevant information.
o Techniques include dimensionality reduction and data compression.
6. Data Storage and Retrieval
o Storing processed data in a structured format (e.g., databases).
o Enables easy access for analysis and mining.
Data Pre-processing is the step in data mining where raw data is prepared for analysis. It
improves the quality, consistency, and structure of the data, making it suitable for mining tasks.
Data Cleaning is a critical part of data pre-processing, and handling missing values is one of its
most important tasks. Missing values can reduce the quality of analysis and may lead to biased or
inaccurate results if not properly addressed.
Incomplete analysis.
Errors in statistical models.
Inaccurate predictions.
Noisy Data:
Noisy data refers to data that contains errors, random variations, or irrelevant information. It can
arise from faulty sensors, human errors, or inconsistencies in data entry and transmission. Noisy
data can significantly impact the accuracy and reliability of data mining results.
1. Binning
Definition: A smoothing technique that groups data into "bins" and smooths it by
replacing values in a bin with a representative value (like the mean or boundary).
Types:
o Smoothing by bin means: Replaces values in each bin with the mean.
o Smoothing by bin medians: Uses the median value.
o Smoothing by bin boundaries: Replaces values with bin edges (min or max).
Use: Effective for reducing random noise in numeric data.
2. Clustering
Definition: Groups similar data objects into clusters; noisy data often appears as outliers
that do not belong to any cluster.
Approach: After clustering, data points far from their cluster centers can be considered
noise and either removed or adjusted.
Use: Suitable for large datasets where noise forms small, distinct groups or outliers.
3. Regression
Definition: Uses regression analysis to fit data to a function (linear or nonlinear) and
identifies points that deviate significantly from the fitted model as noise.
Example: Linear regression can smooth a time-series dataset by fitting a trend line.
Use: Ideal when there's a known relationship between variables.
Inconsistent Data:
Inconsistent data refers to data that shows contradictions, discrepancies, or does not follow a
uniform format across the dataset. It often arises when data is collected from multiple sources,
entered manually, or not properly validated.
Examples:
1. Standardization
o Converting data to a common format (e.g., using ISO date formats).
o Ensures uniformity across the dataset.
2. Validation Rules
Enforcing predefined rules at the time of data entry (e.g., dropdowns, input
o
masks).
o Helps prevent inconsistencies from entering the dataset.
3. Data Cleaning Tools
o Use software or scripts to identify and correct inconsistencies (e.g., OpenRefine,
Python scripts).
4. Manual Review
o Human inspection for complex inconsistencies where automated correction may
not be possible.
5. Reference Matching
o Matching data against a trusted reference database or master list to ensure
correctness (e.g., postal codes, country names).
These are key steps in the data pre-processing phase of data mining, used to combine and
convert data into a consistent and usable format for analysis.
1. Data Integration
Definition:
Data Integration is the process of combining data from multiple sources (databases, files,
systems) into a unified and coherent view.
Goals:
Challenges:
Schema mismatch (e.g., different column names for the same data).
Data value conflicts (e.g., different formats or units).
Duplicate data from different sources.
Techniques:
Example:
Combining customer data from the sales department (CRM system) and support department
(ticketing system) into a single customer profile.
2. Data Transformation
Definition:
Data Transformation involves converting data into a suitable format or structure for analysis and
mining.
Normalization
o Scaling data values to a small, standard range (e.g., 0 to 1).
o Useful in algorithms like K-means or neural networks.
Aggregation
o Summarizing data (e.g., daily sales → monthly sales).
Generalization
o Replacing detailed data with higher-level concepts (e.g., "age 23" → "Young
Adult").
Encoding
o Converting categorical data into numeric form (e.g., one-hot encoding).
Smoothing
o Removing noise from data using techniques like binning or moving averages.
Discretization
o Converting continuous data into discrete intervals or categories.
Importance:
Dimensionality Reduction
Definition:
Dimensionality reduction is the process of reducing the number of input variables (features) in a
dataset while preserving the essential information. This is particularly useful in situations where
there are many variables, making the data difficult to analyze, visualize, or compute.
Definition: PCA is a statistical technique that transforms a set of correlated variables into
a smaller set of uncorrelated variables called principal components. These components
capture the maximum variance in the data.
How it works:
o PCA identifies the "directions" (principal components) in the data where the
variation is the highest.
o It projects the data into these directions, thereby reducing the number of features
while retaining the essential information.
Use: PCA is widely used in image processing, face recognition, and reducing features in
machine learning.
4. Autoencoders
Definition: Autoencoders are neural networks designed to learn an efficient
representation (encoding) of input data. They consist of an encoder, which reduces
dimensionality, and a decoder, which reconstructs the original data from the encoding.
How it works: The autoencoder network is trained to minimize the difference between
the input and the reconstructed output, forcing it to learn an efficient low-dimensional
representation.
Use: Commonly used in image and speech data, and for feature extraction in deep
learning.
5. Feature Selection
Definition: Feature selection is the process of selecting a subset of the most relevant
features (variables) in a dataset. Unlike dimensionality reduction techniques that
transform features, feature selection simply removes irrelevant or redundant features.
Methods:
o Filter methods: Statistical techniques that evaluate each feature's relevance
independently of any machine learning model (e.g., correlation coefficient).
o Wrapper methods: Use a machine learning model to evaluate feature subsets by
training the model with different feature combinations.
o Embedded methods: Techniques like LASSO that perform feature selection
during model training.
Information Loss: Some methods may lose useful information when reducing
dimensions.
Interpretability: Reduced features may be difficult to interpret, especially in techniques
like PCA.
Non-linearity: Some non-linear relationships may be difficult to capture using linear
methods like PCA.
Data Compression
Definition:
Data compression is the process of reducing the size of a dataset or file, enabling efficient
storage, transmission, and processing. Compression works by eliminating redundancies or
encoding data in a more compact format.
1. Lossless Compression
o Definition: Lossless compression reduces file size without losing any
information. The original data can be perfectly reconstructed from the compressed
data.
o How It Works: It identifies and encodes repetitive patterns, reducing redundancy
without sacrificing data integrity.
o Examples:
ZIP (for files).
PNG (for images).
FLAC (for audio).
o Use Case: Used when the exact original data is needed, such as in text files,
program code, or medical data.
2. Lossy Compression
o Definition: Lossy compression reduces file size by discarding some data, usually
data deemed unnecessary or imperceptible to humans.
o How It Works: It sacrifices precision and removes less critical information (e.g.,
minor details in images or audio).
o Examples:
JPEG (for images).
MP3 (for audio).
MPEG (for video).
o Use Case: Used in applications where a slight loss in quality is acceptable in
exchange for reduced file size, such as for streaming media or web images.
Compression Techniques:
1. Huffman Coding
o Definition: A lossless compression algorithm that assigns variable-length codes
to input characters, with shorter codes assigned to more frequent characters.
o How It Works: It builds a binary tree where the most frequent symbols are at the
top and assigned shorter codes, while less frequent symbols have longer codes.
o Use Case: Often used in file compression formats like ZIP or in image
compression like JPEG.
2. Run-Length Encoding (RLE)
o Definition: A simple lossless compression technique where consecutive identical
elements are replaced with a single value and its count.
o Example: The sequence "AAAABBB" becomes "4A3B".
o Use Case: Used in image formats like BMP or TIFF where large areas of identical
pixels occur.
3. Dictionary-based Compression (LZ77, LZ78)
o Definition: These are lossless compression algorithms that replace repetitive
strings of data with references to a dictionary of previously seen strings.
o How It Works: The algorithm builds a dictionary of frequently occurring
sequences and encodes them as references.
o Examples:
LZ77: Used in formats like ZIP and GZIP.
LZ78: Used in LZW compression, like in GIF images.
o Use Case: Common in general-purpose file compression and web graphics.
4. Transform Coding (for Images and Audio)
o Definition: A lossy compression technique where data is transformed (e.g., using
a Fourier or wavelet transform) and then quantized to remove less important
information.
o How It Works: In image compression, this might involve breaking the image into
frequency components and discarding high-frequency components (which the
human eye can't easily detect).
o Examples: Used in JPEG (image) and MP3 (audio).
o Use Case: Used for multimedia formats where file size reduction is critical and
minor loss of quality is acceptable.
Reduced Storage Requirements: Compressed data occupies less space, which is crucial
in data storage and transmission.
Faster Data Transfer: Smaller file sizes mean quicker download/upload times and less
bandwidth consumption.
Cost Efficiency: Reduces storage costs and network usage, especially with large datasets.
Numerosity Reduction
Definition:
Numerosity reduction refers to the process of reducing the number of data points in a dataset
while preserving essential information. The goal is to simplify the dataset, making it more
manageable for analysis and modeling. This process helps to reduce the computational
complexity and can enhance performance, particularly in high-dimensional data.
2. Sampling
Definition: Sampling involves selecting a representative subset of the entire dataset. The
goal is to preserve the characteristics of the original data while working with a smaller
portion of it.
Types of Sampling:
o Simple Random Sampling: Randomly selects a subset of data.
o Stratified Sampling: Ensures that each class or group in the data is represented
proportionally in the sample.
o Systematic Sampling: Selects data points at regular intervals from a list.
Use: Useful when working with large datasets, where processing the entire dataset would
be too costly or time-consuming.
3. Clustering
Definition: Clustering is the process of grouping similar data points together into
clusters. Once the data is grouped, each cluster can be represented by a centroid or
representative point, reducing the overall number of points needed to represent the data.
How It Works: Instead of working with every individual data point, the centroid of each
cluster is used as a summary of all points in that cluster.
Use: Clustering is commonly used in data mining and machine learning to identify
patterns and structure within the data (e.g., customer segmentation).
4. Aggregation
Information Loss: While reducing data, some information may be lost, which could
affect the accuracy of the analysis or model.
Choosing the Right Technique: The appropriate numerosity reduction technique
depends on the nature of the data and the specific objectives of the analysis.
Bias in Sampling: If not done properly, sampling may introduce bias that can affect the
representativeness of the dataset.
Discretization
Definition:
Discretization is the process of converting continuous data into discrete intervals or categories.
This is often done in data preprocessing, especially when the dataset contains continuous
numerical attributes and the analysis requires discrete values (e.g., for classification algorithms).
Machine Learning Algorithms: Some algorithms (like decision trees) perform better
with discrete data.
Simplifying Data: Converting continuous attributes into intervals helps simplify analysis
and decision-making.
Data Interpretation: Discretized data is easier to interpret and can provide clearer
insights for analysis or reporting.
Methods of Discretization:
Benefits of Discretization:
Simplifies Data: Reduces the complexity of continuous data and makes it easier for
certain algorithms to process.
Improves Performance: Certain machine learning algorithms perform better with
discrete data (e.g., decision trees).
Interpretability: Makes data easier to understand and analyze, especially for non-
technical users.
Challenges of Discretization:
Loss of Information: Converting continuous data to discrete values may lead to loss of
precision.
Choice of Method: Selecting the right discretization method depends on the distribution
of the data and the specific task.
Handling Outliers: Some discretization methods may be sensitive to outliers, affecting
the quality of the intervals.
Definition:
Concept hierarchy generation is the process of creating a hierarchy of concepts that abstracts data
to higher levels of generality. This is often used in data mining to summarize or categorize data
into different levels of abstraction, from more specific to more general.
1. Taxonomy-based Generation
o Definition: Taxonomy-based generation uses predefined hierarchical structures
(like categories or classifications) to group data into different levels of
abstraction.
o How It Works: Categories like "Animal" → "Mammals" → "Dogs" form a
hierarchy where each level represents a more specific concept.
o Use Case: Used in structured data such as product catalogs or biological
classifications.
2. Attribute-oriented Induction (AOI)
o Definition: AOI involves generalizing attributes by replacing specific values with
more general ones, typically using a predefined hierarchy.
o How It Works: For example, the age attribute might be generalized as "Young,"
"Middle-aged," and "Senior" based on age ranges.
o Use Case: Often used in rule-based mining or data summarization tasks.
3. Clustering-based Hierarchy Generation
o Definition: Clustering-based hierarchy generation uses clustering algorithms to
group data into clusters, which are then organized into hierarchical structures.
o How It Works: Data points that share similarities are grouped together in lower
levels, and these groups are then further generalized into broader categories.
o Use Case: Common in unsupervised learning, where the goal is to discover
hidden patterns in data.
4. Domain-specific Hierarchies
o Definition: These are hierarchies based on domain knowledge, where concepts
are grouped based on their relationships and characteristics in a specific domain
(e.g., business, geography).
o How It Works: A business might define a hierarchy where “Product” →
“Electronics” → “Smartphones” is a specific category, based on its domain
knowledge.
o Use Case: Useful in fields like business analytics or natural language processing.
Improved Data Summarization: Reduces the complexity of data and provides higher-
level insights.
Better Decision Making: Simplifies decision-making processes by enabling analysis at
different abstraction levels.
Enhanced Interpretability: Makes data easier to understand for different stakeholders
by presenting it in an organized hierarchical structure.
Decision Tree:
Definition:
A Decision Tree is a supervised machine learning algorithm that is used for classification and
regression tasks. It builds a model in the form of a tree structure, where each internal node
represents a decision based on a feature, each branch represents the outcome of that decision, and
each leaf node represents a class label or value. Decision trees are popular for their simplicity
and interpretability.
Root Node: The topmost node in the tree that represents the entire dataset. It is split into
branches based on some condition.
Internal Nodes: These are nodes that test an attribute and represent a decision or
condition. Each internal node splits the data into two or more branches.
Leaf Nodes: These are the terminal nodes that provide the output, such as the predicted
class or value.
Branches: The edges between nodes represent the decision outcomes or conditions that
lead to the next node.
1. Start at the Root: The algorithm begins with the entire dataset at the root node.
2. Feature Selection: At each node, a feature (attribute) is selected to split the data. The
goal is to choose the feature that best separates the data into distinct classes (in
classification tasks) or values (in regression tasks).
3. Recursive Splitting: The dataset is split into subsets based on the chosen feature. This
process repeats recursively for each subset until one of the stopping criteria is met.
4. Stopping Criteria: Stopping criteria can include:
o A pre-defined depth of the tree.
o A minimum number of samples required to split a node.
o The data in a node cannot be further split (pure node).
o The node reaches a threshold of data purity (all samples belong to the same class).