0% found this document useful (0 votes)
339 views15 pages

Warehousing

Data warehousing involves extracting data from multiple sources, transforming it to fit business needs, and loading it into a central data warehouse for analysis and reporting. A data warehouse contains integrated, subject-oriented data that supports decision-making. It allows users to analyze historical data across an organization. The extraction, transformation, and loading (ETL) process prepares the data to load into the warehouse by resolving inconsistencies and structuring it for easy analysis.

Uploaded by

manoraman
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
339 views15 pages

Warehousing

Data warehousing involves extracting data from multiple sources, transforming it to fit business needs, and loading it into a central data warehouse for analysis and reporting. A data warehouse contains integrated, subject-oriented data that supports decision-making. It allows users to analyze historical data across an organization. The extraction, transformation, and loading (ETL) process prepares the data to load into the warehouse by resolving inconsistencies and structuring it for easy analysis.

Uploaded by

manoraman
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 15

DATA WAREHOUSING

Introduction:
Our capabilities of both generating and collecting data have been
increasing rapidly in the last several decades.
Contributing factors include the widespread use of bar codes for most
commercial products, the computerization of many business, scientific
and government transactions, and advances in data collection tools
ranging from scanned text and image platforms to satellite remote
sensing systems.
Popular use of World Wide Web as a global information system has
flooded us with tremendous amount of data and information.
This explosive growth in stored data has generated an urgent need for
new techniques and automated tools that can intelligently assist us in
transforming the vast amounts of data into useful information and
knowledge.
Management of data is one of the important objective of computer
science.
Data for efficient management requires to be stored in better
architecture.
Data warehousing helps in this respect which stores data in multiple
dimensions.
• Definition:
1.A Data Warehouse is a repository of information
collected from multiple sources, stored under a
unified schema and which usually resides at a
single site.

2.A Data Warehouse is a repository of subjectively


selected and adapted operational data which can
answer any ad hoc, complex, statistical or analytical
queries.

3. A Data Warehouse is a subject-oriented, integrated,


time- variant and non- volatile collection of data in
support of management’s decision making process.
• Data Warehouse refers to a database that is
maintained separately from an organization’s
operational databases.
• Data Warehouse systems allow for the
integration of a variety of application systems.
• They support information processing by
providing a solid platform of consolidated
historical data for analysis.
• Data Warehouse is a repository of an
organization’s electronically stored data.
• Data Warehouses are designed to facilitate
reporting & analysis.
• Features:
1. Subject Oriented:
 Data is arranged and optimized to provide answer
to questions from diverse functional areas.
 DW is organized around major subjects like
customer, supplier, product and sales.
 It focuses on the modeling and analysis of data for
decision makers and not on day to day operations
and transaction processing of an organization.
 DW typically provide a simple and concise view
around particular subject issues by excluding data
that are not useful in the decision support process.
 For example, to learn more about your company's
sales data, you can build a warehouse that
concentrates on sales. Using this warehouse, you
can answer questions like "Who was our best
customer for this item last year?"
2. Integrated:
 DW is constructed by integrating multiple,
heterogeneous data sources such as relational
databases, flat files, on-line transaction
Records.
 They must resolve problems such as naming
conflicts and inconsistencies among units of
measure.
 Data cleaning and data integration techniques
are applied to ensure consistency in naming
conventions, encoding structures, attribute
measures, etc. among different data sources.
E.g., Hotel price currency when data is moved
to the warehouse, it is converted.
3. Time Variant:
The time horizon for the data warehouse is
significantly longer than that of operational
systems.
Operational database: current value data.
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every important element in the data warehouse
contains time either explicitly or implicitly.
4. Nonvolatile:
 Nonvolatile means that, once entered into the warehouse,
data should not change.
 This is logical because the purpose of a warehouse is to
enable you to analyze what has occurred.
 DW is a physically separate store of data transformed
from the operational environment.
 As operational update of data does not occur in the data
warehouse environment it does not require transaction
processing, recovery, and concurrency control
mechanisms.
 It requires only two operations in data accessing:
• Initial loading of data
• Access of data
5. Accessible:
 The primary purpose of data warehouse is to provide
readily accessible information to end users.

6. Process Oriented:
 It is important to view data warehousing as a process
for delivery of information.
 The maintenance of DW is ongoing and iterative in
nature.
Characteristics:
• Smaller number of (concurrent) users.
• Instant response is less important (only for interactively composing
reports.
• Read-only access by users.
• Most data access will be targeted at a small partition of the data: the
last month or quarter.
• Database access less frequent but executing large and complicated
queries that access many rows per table.
• Inconsistent, primarily long- running and complex read-only
transactions instead of high constant transaction rate.
• Load from operational data store will only insert new records, existing
ones do not get changed (updated).
• Bulk load from operational data store, no single-record inserts (at most
once daily).
• Database design partly de-normalized and redundant for better
performance, using a star or snowflake schema. Database design is
data-driven, not workflow-driven.
• Large storage capacity for historical data .
• May also contain aggregate data.
Benefits of data warehousing
Some of the benefits that a data warehouse provides are as follows:
• A data warehouse provides a common data model for all data of interest
regardless of the data's source.
• DW makes it easier to report and analyze information than it would be if
multiple data models were used to retrieve information such as sales
invoices, order receipts, general ledger charges, etc.
• Prior to loading data into the data warehouse, inconsistencies are
identified and resolved. This greatly simplifies reporting and analysis.
• Information in the data warehouse is under the control of data warehouse
users so that, even if the source system data is purged over time, the
information in the warehouse can be stored safely for extended periods of
time.
• Because they are separate from operational systems, data warehouses
provide retrieval of data without slowing down operational systems.
• Data warehouses can work in conjunction with and, hence, enhance the
value of operational business applications, notably customer relationship
management (CRM) systems.
• Data warehouses facilitate decision support system applications such as
trend reports (e.g., the items with the most sales in a particular area
within the last two years), exception reports, and reports that show actual
performance versus goals.
Data Warehousing:
• Data warehousing is a process of constructing and
using data warehouses.
• The classic definition of the data warehouse
focuses on data storage.
• However, the means to retrieve and analyze data,
to extract, transform and load data, and to manage
the data dictionary are also considered essential
components of a data warehousing system.
• Many references to data warehousing use this
broader context.
• Thus, an expanded definition for data warehousing
includes business intelligence tools (, tools to
extract, transform, and load data into the repository,
and tools to manage and retrieve metadata.
Extract, Transform, and Load (ETL) is a process in data
warehousing that involves:
• extracting data from outside sources,
• transforming it to fit business needs
• loading it into the end target, i.e. the data warehouse.
1) Extract:
– The first part of an ETL process is to extract the data from the source
systems.
– Most data warehousing projects consolidate data from different source
systems.
– Each separate system may also use a different data organization
format.
– Common data source formats are relational databases and flat files.
– Extraction converts the data into a format for transformation
processing.
• An intrinsic part of the extraction is the parsing of extracted
data, resulting in a check if the data meets an expected
pattern or structure. If not, the data may be rejected entirely.
2) Transform:
• The transform stage applies to a series of rules or functions to the
extracted data.
• Some data sources will require very little or even no manipulation of data.
• In other cases, one or more of the following transformations types to meet
the business and technical needs of the end target may be required:
– Selecting only certain columns to load (or selecting null columns not to
load).
– Translating coded values (e.g., if the source system stores 1 for male
and 2 for female, but the warehouse stores M for male and F for female) .
– Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
– Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
– Filtering
– Sorting
– Joining together data from multiple sources.
– Aggregation.
– Transposing or pivoting (turning multiple columns into multiple rows or
vice versa)
– Splitting a column into multiple columns (e.g., putting a comma-
separated list specified as a string in one column as individual values in
different columns)
3) Load:
• The load phase loads the data into the end target,
usually being the data warehouse.
• Depending on the requirements of the organization,
this process ranges widely. Some data warehouses
might weekly overwrite existing information with
cumulative, updated data, while other DW (or even
other parts of the same DW) might add new data in
a historized form, e.g. hourly.
• As the load phase interacts with a database, the
constraints defined in the database schema as well
as in triggers activated upon data load apply (e.g.
uniqueness, referential integrity, mandatory fields),
which also contribute to the overall data quality
performance of the ETL process.
Need for a separate data warehouse:
• Why not perform online analytical processing
directly on operational database?
• Why to spend additional time and resources to
construct a separate data warehouse?
1)Major reason for such separation is to promote high
performance of both systems.
2)OLAP operations on operational db reduces the
throughput of an OLTP system.
3)Separation is based on different structures, contents
and use of the data in two systems.
• Since the two systems provide quite different
functionalities and require different kinds of data, it is
necessary to maintain separate database.

You might also like