Unit 1
Unit 1
Overview
In today’s rapidly changing corporate environment, organizations are turning to cloud-based technologies
for convenient data collection, reporting, and analysis. This is where Data Warehousing comes in as a
core component of business intelligence that enables businesses to enhance their performance. It is
important to understand what is data warehouse and why it is evolving in the global marketplace.
The concept of data warehouses first came into use in the 1980s when IBM researchers Paul Murphy and
Barry Devlin developed the business data warehouse. American computer scientist Bill Inmon is
considered the “father” of the data warehouse due to his authorship of several works, such as
the Corporate Information Factory and other topics on the building, usage, and maintenance of the data
warehouse.
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data
helps analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the transactions that
take place. Suppose a business executive wants to analyze previous feedback on any data such as a
product, a supplier, or any consumer data, then the executive will have no data available to analyze
because the previous data has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional view. Along with
generalized and consolidated view of data, a data warehouses also provides us Online Analytical
Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a
multidimensional space. This analysis results in data generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be integrated with
OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's
why data warehouse has now become an important platform for data analysis and online analytical
processing.
Data warehouse is an information system that contains historical and commutative data from single or
multiple sources. It simplifies reporting and analysis process of the organization. It is also a single version
of truth for any company for decision making and forecasting.
Data warehouses serve as a central repository for storing and analyzing information to make better
informed decisions. An organization's data warehouse receives data from a variety of sources, typically
on a regular basis, including transactional systems, relational databases, and other sources.
A data warehouse can be defined as a collection of organizational data and information extracted from
operational sources and external data sources. The data is periodically pulled from various internal
applications like sales, marketing, and finance; customer-interface applications; as well as external partner
systems. This data is then made available for decision-makers to access and analyze. So what is data
warehouse? For a start, it is a comprehensive repository of current and historical information that is
designed to enhance an organization’s performance.
A data warehouse is a database, which is kept separate from the organization's operational database.
It possesses consolidated historical data, which helps the organization to analyze its business.
A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.
Analytics database- These databases help manage and sustain analytics of data storage.
Cloud-based database- Here, the databases can be retrieved and hosted on the cloud so
that you do not have to acquire hardware to set up a data warehouse.
Typical rational databases- These are row databases used on a routine basis.
2. ETL (Extraction, Transformation, Loading) Tools
ETL tools are the central components of data warehouse and help extract data from various
resources. This data is then transformed into a suitable arrangement and is later loaded into
the data warehouse. They allow you to extract data, fill mislaid data, highlight data
distribution from the central repository to BI applications(Business Intelligence), and more.
Simply put, data is pulled from resources and altered to align the data for fast analytical
consumption. It is carried out by a variety of data integration strategies, including ETL, ELT,
bulk-load processing, real-time data duplication, data quality, data transformation, and more.
3. Metadata
Metadata is termed as ‘data about your data.’ It is one of the major components of data
warehouse. Metadata tells you everything about the usage, values, source, and other features
of the data sets in the warehouse. Additionally, business metadata includes a context to the
technical data as well. It describes the method to access the data, where it resides, and how
well it is structured. Metadata is also classified as business meta-data having information
which gives users a simple understanding of the information warehouse.
Metadata offers interactive and easy access to the users, which helps them understand the
matter and find get the data. Its management is carried out through the repository and other
accompanying software. The software relies on a workstation and easily maps the source to
the target database. Also, it creates code and controls the operational data to the respective
warehouse.
4. Access Tools
Data warehouses make use of a group of databases as the primary base. However, data
warehouse organizations can’t work with databases without using the access tools until a
database administrator is available. But, to cope with the changing conditions, it becomes
necessary to use data warehouse tools as the major components of data warehouse, including
data mining tools, app development tools, OLAP tools, query/reporting tools, and more.
OLAP tools-
These tools aid in building a multi-dimensional data warehouse while allowing business data
analysis from various points.
Application Development tools-
They help develop customized reports.
6. Additional components
There exist some additional components of data warehouse in a few data warehouses. They
are-
Logical data marts-
It is an altered view of the significant data warehouse. But, logical data marts do not exist
physically as an independent data element.
Operational data store-
It is an integrated database of operational data while its sources contain legacy systems. It
also includes near and current term information.
Dependent data marts-
It is a physical database that fetches all the information from the data warehouse.
1. To extract the data (transnational) from different data sources: For building a data
warehouse, a data is extracted from various data sources and that data is stored in central
storage area. For extraction of the data Microsoft has come up with an excellent tool.
When you purchase Microsoft SQL Server, then this tool will be available at free of cost.
2. To transform the transnational data: There are various DBMS where many of the
companies stores their data. Some of them are: MS Access, MS SQL Server, Oracle,
Sybase etc. Also these companies saves the data in spreadsheets, flat files, mail systems
etc. Relating a data from all these sources is done while building a data warehouse.
3. To load the data (transformed) into the dimensional database: After building a
dimensional model, the data is loaded in the dimensional database. This process
combines the several columns together or it may split one field into the several columns.
There are two stages at which transformation of the data can be performed and they are:
while loading the data into the dimensional model or while data extraction from their
origins.
4. To purchase a front-end reporting tool: There are top notch analytical tools are available
in the market. These tools are provided by the several major vendors. A cost effective
tool and Data Analyzer is released by the Microsoft on its own.
For the warehouse there is an acquisition of the data. There must be a use of multiple and
heterogeneous sources for the data extraction, example databases. There is a need for the
consistency for which formation of data must be done within the warehouse.
Warehouse Databases
Relational database
Relational databases are generally made of a set of tables with data that fits into a category that is already
defined. Each table has at least one data category in a column and each row has a certain data instance for
the categories which are defined in the columns. Hence forming a matrix between data and categories.
Standard user and application program for a relational database is Structured Query Language (SQL).
Relational databases are easily extended and a new data category can be added after the original database
creation without requiring much modification.
Distributed database
A distributed database is a database in which some parts of the database are stored in multiple physical
locations and in which processing is dispersed or replicated among different points in a network.
The distributed database is categorized in two forms. Homogeneous and heterogeneous. All the physical
locations in a homogeneous distributed database system have the same underlying hardware and run the
same operating systems and database applications. The hardware, operating systems or database
applications in a heterogeneous distributed database may be different at each of the locations.
Cloud database
A cloud database is built for a virtual environment; it can be in a hybrid cloud, public cloud or private
cloud. Cloud databases provide benefits such as the ability to pay for storage capacity and bandwidth on a
per-user basis and they provide scalability on demand, along with high availability.
A cloud database also gives enterprises the opportunity to support business applications in a software-as-
a-service deployment.
NoSQL database
NoSQL databases should be used when there is a large set of distributed data. NoSQL databases are
effective for big data performance issues that relational databases aren’t built to solve. They are most
effective when an organization analyzes large chunks of unstructured data or data that are stored on
multiple virtual servers in the cloud.
Object-oriented database
An object-oriented database is organized around objects rather than actions and data rather than logic. For
example, a multimedia record in a relational database can be a definable data object, as opposed to an
alphanumeric value.
Graph Database
This type of database uses graph theory to store, map and query data. Graph databases are collections of
nodes and edges where each node represents an entity and each edge represents a connection between
nodes.
Mapping the Data Warehouse to a Multiprocessor Architecture
Mapping a data warehouse to a multiprocessor architecture is essential for leveraging parallel processing
capabilities to handle large volumes of data efficiently. Multiprocessor architectures can significantly
enhance the performance of data warehousing operations, such as data loading, querying, and processing.
This involves distributing the workload across multiple processors to achieve scalability, high
availability, and faster response times.
Multiprocessor Architectures
Multiprocessor systems can be categorized into two main types:
1. Symmetric Multiprocessing (SMP):
All processors share a single, unified memory space and are controlled by a single operating system.
Processors communicate through shared memory, making it easier to balance the load dynamically.
Common in many commercial database systems due to its simplicity and ease of management.
2. Massively Parallel Processing (MPP):
Each processor has its own memory and operates independently, with processors connected by a high-
speed interconnect.
Designed to handle very large data sets and complex queries by distributing the data and workload across
multiple nodes.
Suitable for large-scale data warehouses where scalability and performance are critical.
Mapping Strategies
Mapping a data warehouse to a multiprocessor architecture involves several strategies to distribute data
and processing tasks efficiently:
1. Data Partitioning:
Horizontal Partitioning: Divide tables into smaller subsets (partitions) based on rows. Each partition can be
processed by a different processor, improving parallel query execution.
Range Partitioning: Distribute data based on a range of values (e.g., date ranges).
Hash Partitioning: Distribute data based on a hash function applied to a key column, ensuring even
distribution.
Vertical Partitioning: Split tables into subsets of columns. Each subset can be processed independently,
useful for wide tables with many columns.
2. Parallel Query Execution:
Intra-Query Parallelism: Break down a single query into smaller tasks that can be executed concurrently
across multiple processors. This involves parallel scans, joins, aggregations, and sorts.
Inter-Query Parallelism: Execute multiple queries simultaneously across different processors, improving
overall throughput.
3. ETL Parallelization:
Distribute ETL (Extract, Transform, Load) processes across multiple processors to speed up data loading and
transformation.
Use parallel ETL tools and frameworks that support distributed processing (e.g., Apache Spark, Talend).
4. Load Balancing and Resource Management:
Ensure even distribution of workloads across processors to prevent bottlenecks and maximize resource
utilization.
Implement dynamic load balancing techniques to adjust workloads based on processor availability and
performance.
5. Replication and Redundancy:
Use data replication techniques to ensure high availability and fault tolerance. Data can be replicated across
multiple processors or nodes to prevent data loss and maintain availability during failures.
Implement failover mechanisms to switch to backup processors or nodes in case of hardware or software
failures.
6. Index and Materialized View Optimization:
Create indexes and materialized views that are optimized for parallel processing. Ensure that indexes are
distributed across processors to improve query performance.
Use partitioned indexes and materialized views to enhance parallel query execution.
Implementation Considerations
1. Hardware Configuration:
Choose appropriate hardware that supports the chosen multiprocessor architecture. Consider factors such as the
number of processors, memory capacity, storage systems, and interconnect speed.
Ensure that the hardware supports the required level of parallelism and scalability.
2. Database Management System (DBMS):
Use a DBMS that supports multiprocessor architectures and parallel processing. Many modern DBMSs, such
as Oracle, Microsoft SQL Server, and IBM Db2, have built-in support for parallel processing and partitioning.
Configure the DBMS to take advantage of the multiprocessor architecture, including setting parameters for
parallel query execution and partitioning.
3. Data Distribution Strategy:
Carefully design the data distribution strategy to ensure even data distribution and minimize data movement
across processors.
Consider the nature of the queries and workload patterns when designing the partitioning and distribution
strategy.
4. Monitoring and Optimization:
Continuously monitor the performance of the data warehouse to identify bottlenecks and optimize parallel
processing.
Use performance monitoring tools to track query performance, resource utilization, and system health.
5. Scalability and Maintenance:
Design the data warehouse architecture to scale horizontally by adding more processors or nodes as data
volume and workload increase.
Implement maintenance procedures to ensure data consistency, optimize performance, and handle hardware
upgrades or failures.
Mapping a data warehouse to a multiprocessor architecture involves careful planning and implementation of
data partitioning, parallel query execution, ETL parallelization, and load balancing strategies. By leveraging
the capabilities of multiprocessor systems, organizations can achieve significant improvements in data
processing speed, query performance, and overall scalability. Proper hardware configuration, DBMS support,
and continuous monitoring are essential to maximize the benefits of a multiprocessor architecture for data
warehousing.
Data Cubes
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be
measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected as
dimensions or functional attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of
things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it, known as a dimensional table, which describes
the dimensions. For example, a dimension table for items may contain the attributes item_name,
brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database.
If a query contains constants at even lower levels than those provided in a data cube, it is not
clear how to make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional
data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional
data model is organized around a central theme, like sales and transactions. A fact table
represents this theme. Facts are numerical measures. Thus, the fact table contains measure (such
as Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for
analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data for items sold
per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
Stars
A star schema is a type of data modeling technique used in data warehousing to represent data
in a structured and intuitive way. In a star schema, data is organized into a central fact table
that contains the measures of interest, surrounded by dimension tables that describe the
attributes of the measures.
The fact table in a star schema contains the measures or metrics that are of interest to the user
or organization. For example, in a sales data warehouse, the fact table might contain sales
revenue, units sold, and profit margins. Each record in the fact table represents a specific event
or transaction, such as a sale or order.
The dimension tables in a star schema contain the descriptive attributes of the measures in the
fact table. These attributes are used to slice and dice the data in the fact table, allowing users to
analyze the data from different perspectives. For example, in a sales data warehouse, the
dimension tables might include product, customer, time, and location.
In a star schema, each dimension table is joined to the fact table through a foreign key
relationship. This allows users to query the data in the fact table using attributes from the
dimension tables. For example, a user might want to see sales revenue by product category, or
by region and time period.
The star schema is a popular data modeling technique in data warehousing because it is easy to
understand and query. The simple structure of the star schema allows for fast query response
times and efficient use of database resources. Additionally, the star schema can be easily
extended by adding new dimension tables or measures to the fact table, making it a scalable
and flexible solution for data warehousing.
Star schema is the fundamental schema among the data mart schema and it is simplest. This
schema is widely used to develop or build a data warehouse and dimensional data marts. It
includes one or more fact tables indexing any number of dimensional tables. The star schema is
a necessary cause of the snowflake schema. It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points. Below is an
example to demonstrate the Star Schema:
In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID,
Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension
tables. Employee dimension table contains the attributes: Emp ID, Emp Name, Title,
Department and Region. Product dimension table contains the attributes: Product ID, Product
Name, Product Category, Unit Price. Customer dimension table contains the attributes:
Customer ID, Customer Name, Address, City, Zip. Time dimension table contains the
attributes: Order ID, Order Date, Year, Quarter, Month.
Model of Star Schema :
In Star Schema, Business process data, that holds the quantitative data about a business is
distributed in fact tables, and dimensions which are descriptive characteristics related to fact
data. Sales price, sale quantity, distant, speed, weight, and weight measurements are few
examples of fact data in star schema.
Often, A Star Schema having multiple dimensions is termed as Centipede Schema. It is easy to
handle a star schema which have dimensions of few attributes.
Advantages of Star Schema :
1. Simpler Queries –
Join logic of star schema is quite cinch in comparison to other join logic which are needed
to fetch data from a transactional schema that is highly normalized.
2. Simplified Business Reporting Logic –
In comparison to a transactional schema that is highly normalized, the star schema makes
simpler common business reporting logic, such as of reporting and period-over-period.
1. Feeding Cubes –
Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact,
major OLAP systems deliver a ROLAP mode of operation which can use a star schema as a
source without designing a cube structure.
Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized schema state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within business entities – at least
not frequently.
Features:
Central fact table: The star schema revolves around a central fact table that contains the
numerical data being analyzed. This table contains foreign keys to link to dimension tables.
Dimension tables: Dimension tables are tables that contain descriptive attributes about the
data being analyzed. These attributes provide context to the numerical data in the fact table.
Each dimension table is linked to the fact table through a foreign key.
Denormalized structure: A star schema is denormalized, which means that redundancy is
allowed in the schema design to improve query performance. This is because it is easier and
faster to join a small number of tables than a large number of tables.
Simple queries: Star schema is designed to make queries simple and fast. Queries can be
written in a straightforward manner by joining the fact table with the appropriate dimension
tables.
Aggregated data: The numerical data in the fact table is usually aggregated at different levels
of granularity, such as daily, weekly, or monthly. This allows for analysis at different levels of
detail.
Fast performance: Star schema is designed for fast query performance. This is because the
schema is denormalized and data is pre-aggregated, making queries faster and more efficient.
Easy to understand: The star schema is easy to understand and interpret, even for non-
technical users. This is because the schema is designed to provide context to the numerical data
through the use of dimension tables.
Snow Flakes
The snowflake schema is a variant of the star schema. Here, the centralized fact table is
connected to multiple dimensions. In the snowflake schema, dimensions are present in
a normalized form in multiple related tables. The snowflake structure materialized when the
dimensions of a star schema are detailed and highly structured, having several levels of
relationship, and the child tables have multiple parent tables. The snowflake effect affects only
the dimension tables and does not affect the fact tables.
A snowflake schema is a type of data modeling technique used in data warehousing to
represent data in a structured way that is optimized for querying large amounts of data
efficiently. In a snowflake schema, the dimension tables are normalized into multiple related
tables, creating a hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the schema, surrounded by
the dimension tables. However, each dimension table is further broken down into multiple
related tables, creating a hierarchical structure that resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table might be normalized into
multiple related tables, such as product category, product subcategory, and product details.
Each of these tables would be related to the product dimension table through a foreign
key relationship.
Example:
The Employee dimension table now contains the attributes: EmployeeID, EmployeeName,
DepartmentID, Region, and Territory. The DepartmentID attribute links with
the Employee table with the Department dimension table. The Department dimension is
used to provide detail about each department, such as the Name and Location of the
department. The Customer dimension table now contains the attributes: CustomerID,
CustomerName, Address, and CityID. The CityID attributes link the Customer dimension
table with the City dimension table. The City dimension table has details about each city such
as city name, Zipcode, State, and Country.
What is Snowflaking?
The snowflake design is the result of further expansion and normalization of the dimension
table. In other words, a dimension table is said to be snowflaked if the low-cardinality attribute
of the dimensions has been divided into separate normalized tables. These tables are then
joined to the original dimension table with referential constraints (foreign key constrain).
Generally, snowflaking is not recommended in the dimension table, as it hampers the
understandability and performance of the dimension model as more tables would be required to
be joined to satisfy the queries.
Difference Between Snowflake and Star Schema
The main difference between star schema and snowflake schema is that the dimension table of
the snowflake schema is maintained in the normalized form to reduce redundancy. The
advantage here is that such tables (normalized) are easy to maintain and save storage space.
However, it also means that more joins will be needed to execute the query. This will
adversely impact system performance.
However, the snowflake schema can also be more complex to query than a star schema
because it requires more table joins. This can result in slower query response times and higher
resource usage in the database. Additionally, the snowflake schema can be more difficult to
understand and maintain because of the increased complexity of the schema design.
The decision to use a snowflake schema versus a star schema in a data warehousing project
will depend on the specific requirements of the project and the trade-offs between query
performance, schema complexity, and data integrity.
Characteristics of Snowflake Schema
The snowflake schema uses small disk space.
It is easy to implement the dimension that is added to the schema.
There are multiple tables, so performance is reduced.
The dimension table consists of two or more sets of attributes that define information at
different grains.
The sets of attributes of the same dimension table are populated by different source
systems.
Features of the Snowflake Schema
Normalization: The snowflake schema is a normalized design, which means that data is
organized into multiple related tables. This reduces data redundancy and improves data
consistency.
Hierarchical Structure: The snowflake schema has a hierarchical structure that is
organized around a central fact table. The fact table contains the measures or metrics of
interest, and the dimension tables contain the attributes that provide context to the
measures.
Multiple Levels: The snowflake schema can have multiple levels of dimension tables,
each related to the central fact table. This allows for more granular analysis of data and
enables users to drill down into specific subsets of data.
Joins: The snowflake schema typically requires more complex SQL queries that
involve multiple tables joins. This can impact performance, especially when dealing
with large data sets.
Scalability: The snowflake schema is scalable and can handle large volumes of data.
However, the complexity of the schema can make it difficult to manage and maintain.
Advantages of Snowflake Schema
It provides structured data which reduces the problem of data integrity.
It uses small disk space because data are highly structured.
Disadvantages of Snowflake Schema
Snowflaking reduces space consumed by dimension tables but compared with the entire
data warehouse the saving is usually insignificant.
Avoid snowflaking or normalization of a dimension table, unless required and appropriate.
Do not snowflake hierarchies of dimension table into separate tables. Hierarchies should
belong to the dimension table only and should never be snowflakes.
Multiple hierarchies that can belong to the same dimension have been designed at the
lowest possible detail.
A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.
In data mining, the concept of a concept hierarchy refers to the organization of data into a tree-
like structure, where each level of the hierarchy represents a concept that is more general than
the level below it. This hierarchical organization of data allows for more efficient and effective
data analysis, as well as the ability to drill down to more specific levels of detail when needed.
The concept of hierarchy is used to organize and classify data in a way that makes it more
understandable and easier to analyze. The main idea behind the concept of hierarchy is that the
same data can have different levels of granularity or levels of detail and that by organizing the
data in a hierarchical fashion, it is easier to understand and perform analysis.
Explanation:
As shown in the above diagram, it consists of a concept hierarchy for the dimension location,
where the user can easily retrieve the data. In order to evaluate it easily the data is represented
in a tree-like structure. The top of the tree consists of the main dimension location and further
splits into various sub-nodes. The root node is located, and it further splits into two nodes
countries ie. USA and India. These countries are further then splitted into more sub-nodes, that
represent the province states ie. New York, Illinois, Gujarat, UP. Thus the concept hierarchy as
shown in the above example organizes the data into a tree-like structure and describes and
represents in more general than the level below it.
The hierarchical structure represents the abstraction level of the dimension location, which
consists of various footprints of the dimension such as street, city, province state, and
country.
Types of Concept Hierarchies
There are several reasons why a concept hierarchy is useful in data mining:
1. Improved Data Analysis: A concept hierarchy can help to organize and simplify data,
making it more manageable and easier to analyze. By grouping similar concepts together, a
concept hierarchy can help to identify patterns and trends in the data that would otherwise
be difficult to spot. This can be particularly useful in uncovering hidden or unexpected
insights that can inform business decisions or inform the development of new products or
services.
2. Improved Data Visualization and Exploration: A concept hierarchy can help to improve
data visualization and data exploration by organizing data into a tree-like structure,
allowing users to easily navigate and understand large and complex data sets. This can be
particularly useful in creating interactive dashboards and reports that allow users to easily
drill down to more specific levels of detail when needed.
3. Improved Algorithm Performance: The use of a concept hierarchy can also help to
improve the performance of data mining algorithms. By organizing data into a hierarchical
structure, algorithms can more easily process and analyze the data, resulting in faster and
more accurate results.
4. Data Cleaning and Pre-processing: A concept hierarchy can also be used in data cleaning
and pre-processing, to identify and remove outliers and noise from the data.
5. Domain Knowledge: A concept hierarchy can also be used to represent the domain
knowledge in a more structured way, which can help in a better understanding of the data
and the problem domain.