Data Transformation
Data Transformation
3. Decimal Scaling: This method normalizes the data by moving the decimal point.
j is chosen such that the maximum absolute value of the transformed data is less than
1.
Discretization by binning
4. Cluster-based Binning:
• This method groups data into clusters using clustering algorithms like K-means.
Each cluster represents a bin, and values within each cluster are assigned to the
same bin.
• Cluster-based binning is useful when data naturally forms groups or clusters.
Example:
If you have data points that naturally group into clusters like [2, 5, 7], [9, 12, 15], and
[18, 21], each of these could form a bin.
Advantages of Binning:
• Noise Reduction: Smoothing
• Improved Model Performance: Accuracy
• Interpretability: Simplicity
Disadvantages of Binning:
• Loss of Information: Generalization
• Choice of Number of Bins: Arbitrary
• Sensitive to Outliers: Distortion
Concept Hierarchy Generation for
Nominal Data
What is Nominal Data?
Nominal data is a type of categorical data that represents distinct categories without
any inherent order or ranking.
In other words, nominal data consists of labels or names that are used to identify
categories, but there is no meaningful way to order them.
Here are some key features of nominal data:
• Categories with no order: The values are simply different from each other, but there’s
no ranking or order between them.
• No mathematical operations: You can't perform any mathematical operations (like
addition or subtraction) on nominal data. For example, you can’t say one category is
"greater" or "less" than another.
• Labels: Nominal data is often used to label things in different categories.
Examples of Nominal Data:
Generalized Animal
-------------------
Mammals
Mammals
Wild Animals
Wild Animals
Why is Concept Hierarchy Important?
• Simplifies the Data: It helps in reducing complexity by grouping detailed categories into
broader, more generalized concepts.
• Improves Understanding: It makes it easier to understand patterns or trends in the data,
because you can analyze the data at a higher level (e.g., analyzing "Mammals" rather than
individual animals like "Dog" and "Cat").
• Data Mining Efficiency: By generalizing the data, algorithms can work more efficiently because
they don’t have to handle every small category separately.
• Better Insights: It helps to see relationships that might not be obvious when working with
individual categories.
Thank You