DWM CHP1 Notes
DWM CHP1 Notes
When new data is added to previous data, old data is not deleted it means
nonvolatile. A data warehouse is keep separated from the operational database
& hence changes made in operational database are not reflected in the data
warehouse.
Non-volatile means that, once entered into the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable you to
analyse what has occurred
Separate:
The DW is separate from the operational systems in the company. It gets its
data out of these legacy systems
Available:
The task of a DW is to make data accessible for the user.
Aggregation performance:
The data which is requested by the user has to perform well on all scales of
aggregation.
Consistency:
Structural and contents of the data is very important and can only be guaranteed
by the use of metadata: this is independent from the source and collection date
of the data.
servers. What this means is a data warehouse will process queries much
faster and more effectively, leading to efficiency and increased productivity.
2) Better consistency of data:
Developers work with data warehousing systems after data has been
received so that all the information contained in the data warehouse is
standardized. Only uniform data can be used efficiently for successful
comparisons. Other solutions simply cannot match a data warehouse's level
of consistency.
3) Improved user access:
A standard database can be read and manipulated by programs like SQL
Query Studio or the Oracle client, but there is considerable ramp up time for
end users to effectively use these apps to get what they need. Business
intelligence and data warehouse end-user access tools are built specifically
for the purposes data warehouses are used: analysis, benchmarking,
prediction and more.
4) All-in-one:
A data warehouse has the ability to receive data from many different
sources, meaning any system in a business can contribute its data. Let's face
it: different business segments use different applications. Only a proper data
warehouse solution can receive data from all of them and give a business the
"big picture" view that is needed to analyse the business, make plans, track
competitors and more.
5) Future-proof:
A data warehouse doesn't care where it gets its data from. It can work with
any raw information and developers can "massage" any data it may have
trouble with. Considering this, you can see that a data warehouse will
outlast other changes in the business' technology. For example, a business
can overhaul its accounting system, choose a whole new CRM solution or
change the applications it uses to gather statistics on the market and it won't
matter at all to the data warehouse. Upgrading or overhauling apps
anywhere in the enterprise will not require subsequent expenditures to
change the data warehouse side.
6) Retention of data history:
End-user applications typically don't have the ability, not to mention the
space, to maintain much transaction history and keep track of multiple
changes to data. Data warehousing solutions have the ability to track all
COMPONENETS OF DATAWAREHOUSE:
The figure shows the essential elements of a typical warehouse. We see the
Source Data component shows on the left. The Data staging element serves as
the next building block. In the middle, we see the Data Storage component that
handles the data warehouses data. This element not only stores and manages the
data; it also keeps track of data using the metadata repository. The Information
Delivery component shows on the right consists of all the different ways of
making the information from the data warehouses available to the users.
Source Data Component
Source data coming into the data warehouses may be grouped into four broad
categories:
DATAWAREHOUSE WITH MINING TECHNIQUES 7
Production Data:
This type of data comes from the different operating systems of the enterprise.
Based on the data requirements in the data warehouse, we choose segments of
the data from the various operational modes.
Internal Data:
In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.
Archived Data:
Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved
files.
External Data:
Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their
industry produced by the external department.
As we know, data for a data warehouse comes from many different sources.
If data extraction for a data warehouse posture big challenge, data
transformation presents even significant challenges. We perform several
individual tasks as part of data transformation. First, we clean the data
extracted from each source. Cleaning may be the correction of misspellings
or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source
systems. Standardization of data components forms a large part of data
transformation. Data transformation contains many forms of combining
pieces of data from different sources. We combine data from single source
record or related data parts from many source records. On the other hand,
data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data
transformation function ends, we have a collection of integrated data that is
cleaned, standardized, and summarized.
3) Data Loading:
Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for
the first time, we do the initial loading of the information into the data
warehouse storage. The initial load moves high volumes of data using up a
substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories
for the operational systems generally include only the current data. Also, these
data repositories include the data structured in highly normalized for fast and
efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing
for data warehouse files and having it transferred to one or more destinations
according to some customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data
catalogue in a database management system. In the data dictionary, we keep the
data about the logical data structures, the data about the records and addresses,
the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of
users. The scope is confined to particular selected subjects. Data in a data
warehouse should be a fairly current, but not mainly up to the minute, although
development in the data warehouse industry has made standard and incremental
data dumps more achievable. Data marts are lower than data warehouses and
usually contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts for
particular kinds of queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions
within the data warehouse. These components control the data transformation
and the data transfer into the data warehouse storage. On the other hand, it
moderates the data delivery to the clients. Its work with the database
management systems and authorizes data to be correctly saved in the
repositories. It monitors the movement of information into the staging method
and from there into the data warehouses storage itself.
ARCHITECTURE OF DATAWAREHOUSE: -
There is general 3 types of architecture in DW, which are:
1) Single tier architecture
2) Two-tier architecture
3) Three tier architecture or multi-tier architecture
Single-Tier architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to
minimize the amount of data stored to reach this goal; it removes data
redundancies. The figure shows the only layer physically available is the
source layer. In this method, data warehouses are virtual. This means that the
data warehouse is implemented as a multidimensional view of operational data
created by specific middleware, or an intermediate processing layer.
OR
The vulnerability of this architecture lies in its failure to meet the requirement
for separation between analytical and transactional processing. Analysis queries
are agreed to operational data after the middleware interprets them. In this way,
queries affect transactional workloads.
Two-Tier architecture:
The requirement for separation plays an essential role in defining the two- tier
architecture for a data warehouse system, as shown in fig:
Three-Tier Architecture:
The three-tier architecture consists of the source layer (containing multiple
source system), the reconciled layer and the data warehouse layer (containing
both data warehouses and data marts). The reconciled layer sits between the
source data and data warehouse. The main advantage of the reconciled layer is
that it creates a standard reference data model for a whole enterprise. At the
same time, it separates the problems of source data extraction and integration
from those of data warehouse population. In some cases, the reconciled layer is
also directly used to accomplish better some operational tasks, such as
producing daily reports that cannot be satisfactorily prepared using the
corporate applications or generating data flows to feed external processes
periodically to benefit from cleaning and integration. This architecture is
especially useful for the extensive, enterprise-wide systems. A disadvantage of
this structure is the extra file storage space used through the extra redundant
reconciled layer. It also makes the analytical tools a little further away from
being real-time.
OR
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database.
For effective query processing, only some of the possible summary vision
may be materialized. A virtual warehouse is simple to build but required
excess capacity on operational database servers.
ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse. ETL process can also use the pipelining concept i.e., as soon as
some data is extracted, it can transform and during that period some new
data can be extracted. And while the transformed data is being loaded into
the data warehouse, the already extracted data can be transformed.
The process of ETL can be broken down into the following three stages:
1. Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats like
relational databases, No SQL, XML, and flat files into the staging area. It is
important to extract the data from various source systems and store it into
the staging area first and not directly into the data warehouse because the
extracted data is in various formats and can be corrupted also. Hence
loading it directly into the data warehouse may damage it and rollback will
be much more difficult. Therefore, this is one of the most important steps of
ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of
rules or functions are applied on the extracted data to convert it into a single
standard format. It may involve following processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping
U.S.A, United States, and America into USA, etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multiple attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-
attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes the
data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals. The rate and period
of loading solely depends on the requirements and varies from system to
system
QUESTIONS:
1) Define Data Warehouse (2 definitions)
2) Write the Characteristics of Data Warehousing
3) Difference Between operational database and data warehouse (6 points)
4) Write the needs for data warehouse
5) Explain ETL with diagram
6) Write limitations, benefits and application of data warehouse
7) Draw and explain single tier, 2-tier, and 3-tier architecture of data
warehouse
8) Difference between data warehouse and data mart
9) Difference between ETL and ELT.
10) Explain Meta data repository
11) Explain Data Warehouse Models