Data Mining Unit-2 notes
Data Mining Unit-2 notes
Data warehouses serve as a central repository for storing and analyzing information to make better
informed decisions. An organization's data warehouse receives data from a variety of sources, typically on
a regular basis, including transactional systems, relational databases, and other sources.
A data warehouse is a centralized storage system that allows for the storing, analyzing, and interpreting
of data in order to facilitate better decision-making. Transactional systems, relational databases, and other
sources provide data into data warehouses on a regular basis.
A data warehouse is a type of data management system that facilitates and supports business intelligence
(BI) activities, specifically analysis. Data warehouses are primarily designed to facilitate searches and
analyses and usually contain large amounts of historical data.
A data warehouse can be defined as a collection of organizational data and information extracted from
operational sources and external data sources. The data is periodically pulled from various internal
applications like sales, marketing, and finance; customer-interface applications; as well as external partner
systems. This data is then made available for decision-makers to access and analyze. So what is data
warehouse? For a start, it is a comprehensive repository of current and historical information that is
designed to enhance an organization’s performance.
Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc. For example, if you want
to analyze your company’s sales data, you need to build a data warehouse that concentrates on sales. Such
a warehouse would provide valuable information like ‘who was your best customer last year?’ or ‘who is
likely to be your best customer in the coming year?’
Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format. The data
must be stored in the warehouse in a consistent and universally acceptable manner in terms of naming,
format, and coding. This facilitates effective data analysis.
Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous data is
not erased when current data is entered. This helps you to analyze what has happened and when.
Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or implicitly.
An example of time variance in Data Warehouse is exhibited in the Primary Key, which must have an
element of time like the day, week, or month.
In today’s rapidly changing corporate environment, organizations are turning to cloud-based technologies
for convenient data collection, reporting, and analysis. This is where Data Warehousing comes in as a core
component of business intelligence that enables businesses to enhance their performance. It is important to
understand what is data warehouse and why it is evolving in the global marketplace.
In this article, we’ll provide an overview of Data Warehouse – explore key concepts like data warehouse
architecture, characteristics of data warehouse, what is data management, the benefits of data warehouse,
and data warehouse applications in Data Science.
Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries.
Data warehouse is an example of an OLAP system or an online database query answering system. OLTP
is an online database modifying system, for example, ATM. Learn more about the OLTP vs.
OLAP differences.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-end tools
are used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways.
The ROLAP or Relational OLAP model is an extended relational database management system that maps
multidimensional data process to standard relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data and operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse. It holds various tools like
query tools, analysis tools, reporting tools, and data mining tools.
Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s
point-of-sale systems, its mailing lists, website, and comment cards. It might also incorporate confidential
information about employees, salary information, etc. Businesses use such components of data warehouse
to analyze customers.
Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns
in vast volumes of data and devising innovative strategies for increased sales and profits.
Complexity: OLAP systems can be complex to set up and maintain, requiring specialized technical
expertise.
Data size limitations: OLAP systems can struggle with very large data sets and may require
extensive data aggregation or summarization.
Performance issues: OLAP systems can be slow when dealing with large amounts of data,
especially when running complex queries or calculations.
Data integrity: Inconsistent data definitions and data quality issues can affect the accuracy of OLAP
analysis.
Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the need
for specialized hardware and software.
Inflexibility: OLAP systems may not easily accommodate changing business needs and may require
significant effort to modify or extend.
Relational OLAP
Data Generalization is the process of summarizing data by replacing relatively low level values with
higher level concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization :
1. Data cube approach :
It is also known as OLAP approach.
It is an efficient approach as it is helpful to make the past selling graph.
In this approach, computation and results are stored in the Data cube.
It uses Roll-up and Drill-down operations on a data cube.
These operations typically involve aggregate functions, such as count(), sum(), average(), and max().
These materialized views can then be used for decision support, knowledge discovery, and many
other applications.
2. Attribute oriented induction :
It is an online data analysis, query oriented and generalization based approach.
In this approach, we perform generalization on basis of different values of each attributes within the
relevant data set. after that same tuple are merged and their respective counts are accumulated in
order to perform aggregation.
It performs off-line aggregation before an OLAP or data mining query is submitted for processing.
On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized based (on-line data analysis technique).
It is not limited to particular measures nor categorical data.
Attribute oriented induction approach uses two method :
(i). Attribute removal.
(ii). Attribute generalization.
5. Data warehouse
A data warehouse is a central repository of data that is designed for efficient querying and analysis. Data
cubes can be computed on top of a data warehouse, which allows for fast querying of the data. However,
data warehouses can be expensive to set up and maintain, and may not be suitable for all organizations.
6. Distributed computing
In this approach, the data cube is computed using a distributed computing system, such as Hadoop or
Spark.
Advantage: The advantage of this approach is that it allows for the data cube to be computed on a
large dataset, which may not fit on a single machine.
Disadvantage: The disadvantage is that distributed computing systems can be complex to set up and
maintain, and may require specialized skills and resources.
7. In-memory computing
This approach involves storing the data in memory and computing the data cube directly from memory.
Advantage: The advantage of this approach is that it allows for very fast querying of the data since
the data is already in memory and does not need to be retrieved from disk.
Disadvantage: The disadvantage is that it may not be practical for very large datasets, since the data
may not fit in memory.
8. Streaming data
This approach involves computing the data cube on a stream of data, rather than a batch of data.
Advantage: The advantage of this approach is that it allows the data cube to be updated in real-time,
as new data becomes available.
Disadvantage: The disadvantage is that it can be more complex to implement, and may require
specialized tools and techniques.
Note: Sorting, hashing, and grouping are techniques that can be used to optimize data cube computation,
but they are not necessarily strategies for data cube computation in and of themselves.