Warehousing
Warehousing
Introduction:
Our capabilities of both generating and collecting data have been
increasing rapidly in the last several decades.
Contributing factors include the widespread use of bar codes for most
commercial products, the computerization of many business, scientific
and government transactions, and advances in data collection tools
ranging from scanned text and image platforms to satellite remote
sensing systems.
Popular use of World Wide Web as a global information system has
flooded us with tremendous amount of data and information.
This explosive growth in stored data has generated an urgent need for
new techniques and automated tools that can intelligently assist us in
transforming the vast amounts of data into useful information and
knowledge.
Management of data is one of the important objective of computer
science.
Data for efficient management requires to be stored in better
architecture.
Data warehousing helps in this respect which stores data in multiple
dimensions.
• Definition:
1.A Data Warehouse is a repository of information
collected from multiple sources, stored under a
unified schema and which usually resides at a
single site.
6. Process Oriented:
It is important to view data warehousing as a process
for delivery of information.
The maintenance of DW is ongoing and iterative in
nature.
Characteristics:
• Smaller number of (concurrent) users.
• Instant response is less important (only for interactively composing
reports.
• Read-only access by users.
• Most data access will be targeted at a small partition of the data: the
last month or quarter.
• Database access less frequent but executing large and complicated
queries that access many rows per table.
• Inconsistent, primarily long- running and complex read-only
transactions instead of high constant transaction rate.
• Load from operational data store will only insert new records, existing
ones do not get changed (updated).
• Bulk load from operational data store, no single-record inserts (at most
once daily).
• Database design partly de-normalized and redundant for better
performance, using a star or snowflake schema. Database design is
data-driven, not workflow-driven.
• Large storage capacity for historical data .
• May also contain aggregate data.
Benefits of data warehousing
Some of the benefits that a data warehouse provides are as follows:
• A data warehouse provides a common data model for all data of interest
regardless of the data's source.
• DW makes it easier to report and analyze information than it would be if
multiple data models were used to retrieve information such as sales
invoices, order receipts, general ledger charges, etc.
• Prior to loading data into the data warehouse, inconsistencies are
identified and resolved. This greatly simplifies reporting and analysis.
• Information in the data warehouse is under the control of data warehouse
users so that, even if the source system data is purged over time, the
information in the warehouse can be stored safely for extended periods of
time.
• Because they are separate from operational systems, data warehouses
provide retrieval of data without slowing down operational systems.
• Data warehouses can work in conjunction with and, hence, enhance the
value of operational business applications, notably customer relationship
management (CRM) systems.
• Data warehouses facilitate decision support system applications such as
trend reports (e.g., the items with the most sales in a particular area
within the last two years), exception reports, and reports that show actual
performance versus goals.
Data Warehousing:
• Data warehousing is a process of constructing and
using data warehouses.
• The classic definition of the data warehouse
focuses on data storage.
• However, the means to retrieve and analyze data,
to extract, transform and load data, and to manage
the data dictionary are also considered essential
components of a data warehousing system.
• Many references to data warehousing use this
broader context.
• Thus, an expanded definition for data warehousing
includes business intelligence tools (, tools to
extract, transform, and load data into the repository,
and tools to manage and retrieve metadata.
Extract, Transform, and Load (ETL) is a process in data
warehousing that involves:
• extracting data from outside sources,
• transforming it to fit business needs
• loading it into the end target, i.e. the data warehouse.
1) Extract:
– The first part of an ETL process is to extract the data from the source
systems.
– Most data warehousing projects consolidate data from different source
systems.
– Each separate system may also use a different data organization
format.
– Common data source formats are relational databases and flat files.
– Extraction converts the data into a format for transformation
processing.
• An intrinsic part of the extraction is the parsing of extracted
data, resulting in a check if the data meets an expected
pattern or structure. If not, the data may be rejected entirely.
2) Transform:
• The transform stage applies to a series of rules or functions to the
extracted data.
• Some data sources will require very little or even no manipulation of data.
• In other cases, one or more of the following transformations types to meet
the business and technical needs of the end target may be required:
– Selecting only certain columns to load (or selecting null columns not to
load).
– Translating coded values (e.g., if the source system stores 1 for male
and 2 for female, but the warehouse stores M for male and F for female) .
– Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
– Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
– Filtering
– Sorting
– Joining together data from multiple sources.
– Aggregation.
– Transposing or pivoting (turning multiple columns into multiple rows or
vice versa)
– Splitting a column into multiple columns (e.g., putting a comma-
separated list specified as a string in one column as individual values in
different columns)
3) Load:
• The load phase loads the data into the end target,
usually being the data warehouse.
• Depending on the requirements of the organization,
this process ranges widely. Some data warehouses
might weekly overwrite existing information with
cumulative, updated data, while other DW (or even
other parts of the same DW) might add new data in
a historized form, e.g. hourly.
• As the load phase interacts with a database, the
constraints defined in the database schema as well
as in triggers activated upon data load apply (e.g.
uniqueness, referential integrity, mandatory fields),
which also contribute to the overall data quality
performance of the ETL process.
Need for a separate data warehouse:
• Why not perform online analytical processing
directly on operational database?
• Why to spend additional time and resources to
construct a separate data warehouse?
1)Major reason for such separation is to promote high
performance of both systems.
2)OLAP operations on operational db reduces the
throughput of an OLTP system.
3)Separation is based on different structures, contents
and use of the data in two systems.
• Since the two systems provide quite different
functionalities and require different kinds of data, it is
necessary to maintain separate database.