Data Warehosing and Data Mining
Data Warehosing and Data Mining
Data mining refers to extracting or mining knowledge from large amounts of data. The
term is actually a misnomer. Thus, data mining should have been more appropriately
named as knowledge mining which emphasizes mining from large amounts of data. It is
the computational process of discovering patterns in large data sets involving methods
at the intersection of artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract information from a
data set and transform it into an understandable structure for further use.
Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in
gigabytes of store scanner data — and mining a mountain for a vein of valuable ore.
2
A typical data mining system may have the following major components.
1. Knowledge Base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, Burla used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
3
2. Data Mining Engine: This is essential to the data mining system and ideally consists
of a set of functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
4. User interface: Thismodule communicates between users and the data mining
system,allowing the user to interact with the system by specifying a data mining query
or task, providing information to help focus the search, and performing exploratory data
mining based on the intermediate data mining results. In addition, this component
allows the user to browse database and data warehouse schemas or data
structures,evaluate mined patterns, and visualize the patterns in different forms.
4
Data warehousing is a method of organizing and compiling data into one database,
whereas data mining deals with fetching important data from databases. Data mining
attempts to depict meaningful patterns through a dependency on the data that is
compiled in the data warehouse.
DATA WAREHOUSE:
A data warehouse is where data can be collected for mining purposes, usually with large
storage capacity. Various organizations’ systems are in the data warehouse, where it
can be fetched as per usage.
Data warehouses collaborate data from several sources and ensure data accuracy,
quality, and consistency. System execution is boosted by differentiating the process of
analytics from traditional databases. In a data warehouse, data is sorted into a
formatted pattern by type and as needed. The data is examined by query tools using
several patterns.
Data warehouses store historical data and handle requests faster, helping in online
analytical processing, whereas a database is used to store current transactions in a
business process that is called online transaction processing.
5
● Subject Oriented:
It provides you with important data about a specific subject like suppliers, products,
promotion, customers, etc. Data warehousing usually handles the analysis and
modeling of data that assist any organization to make data-driven decisions.
● Integrated:
Different heterogeneous sources are put together to build a data warehouse, such as
level documents or social databases.
● Time-Variant:
● Nonvolatile:
This means the earlier data is not deleted when new data is added to the data
warehouse. The operational database and data warehouse are kept separate and thus
continuous changes in the operational database are not shown in the data warehouse.
● Consumer goods
● Banking services
● Financial services
● Manufacturing
● Retail sectors
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP). These
include applications such as forecasting, profiling, summary reporting, and trend
analysis.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Operational System
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
A staging area simplifies data cleansing and consolidation for operational methods
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
1.7 Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouse
that can provide information for reporting and analysis on a section, unit, department or
operation in the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and
sales or mine historical information to make predictions about customer behavior.
10
Ans.: data warehouse is never static; it evolves as the business expands. As the
business evolves, its requirements keep changing and therefore a data warehouse must
be designed to ride with these changes. Hence a data warehouse system needs to be
flexible.
Ideally there should be a delivery process to deliver a data warehouse. However data
warehouse projects normally suffer from various issues that make it difficult to
complete tasks and deliverables in the strict and ordered fashion demanded by the
waterfall method. Most of the times, the requirements are not understood completely.
The architectures, designs, and build components can be completed only after
gathering and studying all the requirements.
Delivery Method
The delivery method is a variant of the joint application development approach adopted
for the delivery of a data warehouse. We have staged the data warehouse delivery
process to minimize risks. The approach that we will discuss here does not reduce the
overall delivery time-scales but ensures the business benefits are delivered
incrementally through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery
risk.
IT Strategy
Data warehouses are strategic investments that require a business process
to generate benefits. IT Strategy is required to procure and retain funding
for the project.
Business Case
The objective of the business case is to estimate business benefits that
should be derived from using a data warehouse. These benefits may not be
quantifiable but the projected benefits need to be clearly stated. If a data
warehouse does not have a clear business case, then the business tends to
suffer from credibility problems at some stage during the delivery process.
Therefore in data warehouse projects, we need to understand the business
case for investment.
The following points are to be kept in mind to produce an early release and
deliver business benefits.
Business Requirements
12
Technical Blueprint
This phase needs to deliver an overall architecture satisfying the long term
requirements. This phase also delivers the components that must be
implemented in a short term to derive any business benefit. The blueprint
need to identify the followings.
Ans: A data Warehouses are central repositories that store data from one or
more heterogeneous sources. Data warehouses are analytical tools built to
support decision-making for reporting users across many departments. Data
warehouse works to create a single, unified system of truth for an entire
warehouse is a Data management system that is used for storing, reporting,
and data analysis. It is the primary component of business intelligence and
is also known as an enterprise data warehouse. Data Organization and store
historical data about business and organization so that it could be analyzed
and extract insights from it.
13
The tools that allow sourcing of data contents and formats accurately and external data
stores into the data warehouse have to perform several essential tasks that contain:
○ Data transformation and calculation based on the function of business rules that
force transformation.
There are several selection criteria which should be considered while implementing a
data warehouse:
1. The ability to identify the data in the data source environment that can be read by
the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
14
3. The capability to merge records from multiple data stores is required in many
installations.
7. Selective data extraction of both data items and records enables users to extract
only the required data.
9. The ability to perform data type and the character-set translation is a requirement
when moving data between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and
records are necessary.
11. Vendor stability and support for the products are components that must be
evaluated carefully.
A warehousing team will require different types of tools during a warehouse project.
These software products usually fall into one or more of the categories illustrated, as
shown in the figure.
15
The warehouse team needs tools that can extract, transform, integrate, clean, and load
information from a source system into one or more data warehouse databases.
Middleware and gateway products may be needed for warehouses that extract a record
from a host-based source system.
Warehouse Storage
Software products are also needed to store warehouse data and their accompanying
metadata. Relational database management systems are well suited to large and
growing warehouses.
Different types of software are needed to access, retrieve, distribute, and present
warehouse data to its end-clients.