What Is Data Warehousing

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13

What is Data warehousing?

Answer
A data warehouse can be considered as a storage area where interest specific or relevant data
is stored irrespective of the source. What actually is required to create a data warehouse can
be considered as Data Warehousing. Data warehousing merges data from multiple sources
into an easy and complete form.

What are fact tables and dimension tables?

Answer
As mentioned, data in a warehouse comes from the transactions. Fact table in a data
warehouse consists of facts and/or measures. The nature of data in a fact table is usually
numerical.

On the other hand, dimension table in a data warehouse contains fields used to describe the
data in fact tables. A dimension table can provide additional and descriptive information
(dimension) of the field of a fact table.

e.g. If I want to know the number of resources used for a task, my fact table will store the
actual measure (of resources) while my Dimension table will store the task and resource
details.

Hence, the relation between a fact and dimension table is one to many.

What is ETL process in data warehousing?

Answer
ETL is Extract Transform Load. It is a process of fetching data from different sources,
converting the data into a consistent and clean form and load into the data warehouse.
Different tools are available in the market to perform ETL jobs.

Explain the difference between data mining and data warehousing.

Answer
Data warehousing is merely extracting data from different sources, cleaning the data and
storing it in the warehouse. Where as data mining aims to examine or explore the data using
queries. These queries can be fired on the data warehouse. Explore the data in data mining
helps in reporting, planning strategies, finding meaningful patterns etc.

E.g. a data warehouse of a company stores all the relevant information of projects and
employees. Using Data mining, one can use this data to generate different reports like profits
generated etc.

What is an OLTP system and OLAP system?

Answer
OLTP: Online Transaction and Processing helps and manages applications based on
transactions involving high volume of data. Typical example of a transaction is commonly
observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports
transactions to run cross a network.
OLAP: Online analytical processing performs analysis of business data and provides the ability
to perform complex calculations on usually low volumes of data. OLAP helps the user gain an
insight on the data coming from different sources (multi dimensional).

What is PDAP?

Answer
A data cube stores data in a summarized version which helps in a faster analysis of data. The
data is stored in such a way that it allows reporting easily.

E.g. using a data cube A user may want to analyze weekly, monthly performance of an
employee. Here, month and week could be considered as the dimensions of the cube.

What is snow flake scheme design in database?

Answer
A snowflake Schema in its simplest form is an arrangement of fact tables and dimension
tables. The fact table is usually at the center surrounded by the dimension table. Normally in
a snow flake schema the dimension tables are further broken down into more dimension
table.

E.g. Dimension tables include employee, projects and status. Status table can be further
broken into status_weekly, status_monthly.

What is analysis service?

Answer
Analysis service provides a combined view of the data used in OLAP or Data mining. Services
here refer to OLAP, Data mining.

Explain sequence clustering algorithm.

Answer
Sequence clustering algorithm collects similar or related paths, sequences of data containing
events.

E.g. Sequence clustering algorithm may help finding the path to store a product of “similar”
nature in a retail ware house.

Explain discrete and continuous data in data mining.

Answer
Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender.

Continuous data can be considered as data which changes continuously and in an ordered
fashion. E.g. age

Explain time series algorithm in data mining.


Answer
Time series algorithm can be used to predict continuous values of data. Once the algorithm is
skilled to predict a series of data, it can predict the outcome of other series.

E.g. Performance one employee can influence or forecast the profit

What is XMLA?

Answer
XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP,
data mining or data sources on the internet. It is Simple Object Access Protocol. XMLA uses
discover and Execute methods. Discover fetched information from the internet while Execute
allows the applications to execute against the data sources.

Explain the difference between Data warehousing and Business Intelligence.

Answer
Data Warehousing helps you store the data while business intelligence helps you to control
the data for decision making, forecasting etc.

Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to
query the data for reporting, forecasting, business intelligence tools were born.

What is Dimensional Modeling?

Answer
Dimensional modeling is often used in Data warehousing. In simpler words it is a rational or
consistent design technique used to build a data warehouse. DM uses facts and dimensions of
a warehouse for its design. A snow and star flake schema represent data modeling.

What is surrogate key? Explain it with an example.

Answer
Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is
not generated by the user but by the system. A primary difference between a primary key
and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely
identifies an entity.

E.g. an employee may be recruited before the year 2000 while another employee with the
same name may be recruited after the year 2000. Here, the primary key will uniquely identify
the record while the surrogate key will be generated by the system (say a serial number)
since the SK is NOT derived from the data.

What is the purpose of Factless Fact Table?

Answer
Fact less tables are so called because they simply contain keys which refer to the dimension
tables. Hence, they don’t really have facts or any information but are more commonly used
for tracking some information of an event.

Eg. To find the number of leaves taken by an employee in a month.


What is a level of Granularity of a fact table?

Answer
A fact table is usually designed at a low level of Granularity. This means that we need to find
the lowest level of information that can store in a fact table.

E.g. Employee performance is a very high level of granularity. Employee_performance_daily,


employee_perfomance_weekly can be considered lower levels of granularity.

Explain the difference between star and snowflake schemas.

Answer
A snow flake schema design is usually more complex than a start schema. In a start schema
a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is
designed. However, in a snow flake schema, the dimension tables can be further broken down
to sub dimensions. Hence, data in a snow flake schema is more stable and standard as
compared to a Start schema.

E.g. Star Schema: Performance report is a fact table. Its dimension tables include
performance_report_employee, performance_report_manager

Snow Flake Schema: the dimension tables can be broken to


performance_report_employee_weekly, monthly etc.

What is the difference between view and materialized view?

Answer
A view is created by combining data from different tables. Hence, a view does not have data
of itself.

On the other hand, Materialized view usually used in data warehousing has data. This data
helps in decision making, performing calculations etc. The data stored by calculating it before
hand using queries.

When a view is created, the data is not stored in the database. The data is created when a
query is fired on the view. Whereas, data of a materialized view is stored.

What is a Cube and Linked Cube with reference to data warehouse?

Answer
A data cube stores data in a summarized version which helps in a faster analysis of data.
Where as linked cubes use the data cube and are stored on another analysis server. Linking
different data cubes reduces the possibility of sparse data.

E.g. A data cube may store the Employee performance. However in order to know the hours
which calculated this performance, one can create another cube by linking it to the root cube
(in this case employee performance).

What is junk dimension?


Answer
In scenarios where certain data may not be appropriate to store in the schema, this data (or
attributes) can be stored in a junk dimension. The nature of data of junk dimension is usually
Boolean or flag values.

E.g. whether the performance of employee was up to the mark? , Comments on performance.

What are fundamental stages of Data Warehousing?

Answer
Stages of a data warehouse helps to find and understand how the data in the warehouse
changes.

At an initial stage of data warehousing data of the transactions is merely copied to another
server. Here, even if the copied data is processed for reporting, the source data’s
performance won’t be affected.

In the next evolving stage, the data in the warehouse is updated regularly using the source
data.

In Real time Data warehouse stage data in the warehouse is updated for every transaction
performed on the source data (E.g. booking a ticket)

When the warehouse is at integrated stage, It not only updates data as and when a
transaction is performed but also generates transactions which are passed back to the source
online data.

What is active data warehousing?

The term "Active Data Warehousing (ADW)" was coined in 1999 when a Teradata
Magazinearticle described how a few customers were using data warehouse insights
in operational processes. Since then, dozens of Teradata clients have deployed highly
competitive applications using ADW techniques.

The trend started by Teradata customers and Teradata Engineering now goes by the
names "pervasive BI," "operational BI," "real-time BI" and Active Data Warehousing
(ADW). Active Data Warehousing describes the technical techniques needed to
achieve those goals.
An enterprise data warehouse is "just another database" but with the distinction that
it contains the most trusted and comprehensive data owned by the enterprise. It can
be used in all the ways databases can be used — in eCommerce web sites, employee
portals, web services, Java and .NET applications, and embedded in batch or
workflow processes.

The key to ADW applications is a robust, mature mixed-workload management


subsystem. Teradata can help you concurrently run dozens of complex reports,
multiple data load jobs, data mining and hundreds of fast tactical queries while
meeting service level agreements for the front line user.

Active Data Warehousing Advantages

In the broadest sense, Active Data Warehousing extends the traditional data
warehouse by enabling you to:

• Provide access to analytic insights by customers, partners and suppliers


• Deliver up-to-the-minute fresh data in operational reports
• Develop new, highly competitive applications using web services, SOA and
Web 2.0 technologies
• Simplify the IT architecture and reduce costs by consolidating ODS
repositories

Vital Active Elements

Your Active Data Warehouse requires attention to six critical success factors called "active
elements":

Active Access
Frontline users access the data warehouse for operational decision making with a service level
agreement of five seconds or less (also known as "web speed").

Active Load
Near-real time data enter the data warehouse via mini-batch loading, replication services or
continuous streams of data from message queuing systems.

Active Events
Event-driven architectures and business activity monitoring detect significant business events
and issue alerts for timely, informed decisions.

Active Workload Management


Dynamic priority management inside the data warehouse ensures service levels are achieved
across multiple user communities and workload types.

Active Enterprise Integration


Integration tools and designs connect the data warehouse to web sites, portals, SOA, web
services, enterprise service busses, workflow and batch systems.

Active Availability
Policies, procedures and redundant hardware ensure the entire information supply chain is
protected when subsystem failure occurs.
What is difference between dependent data warehouse and independent data
wareshouse ?

Data Mart Does Not Equal Data Warehouse


• Bill Inmon
• DM Direct, November 1999
Editor's Note: We are proud to introduce a new DM Direct feature. Starting with this
issue of DM Direct, once a month we will be including a "classic" article from DM
Review's archives. These classic articles provide valuable business intelligence/data
warehousing insights and have been selected by the editorial department of DM
Review.
"...The data warehouse is nothing more than the union of all the data
marts...," Ralph Kimball, December 29, 1997.
"You can catch all the minnows in the ocean and stack them together and they still
do not make a whale," Bill Inmon, January 8, 1998.
The single most important issue facing the information technology manager this year
is whether to build the data warehouse first or the data mart first. The data mart
vendors have said that data warehouses are difficult and expensive to build, take a
long time to design and develop, require thought and investment, and mandate that
the corporation face difficult issues such as integration of legacy data, managing
massive volumes of data and cost justifying the entire DSS/data warehouse effort to
the management committee. The picture painted by the data mart advocates for
building the data warehouse is gloomy. It is also self-serving and incorrect.

The data mart vendors look upon the data warehouse as an obstacle between
themselves and the revenue that comes from making sales. Of course, they want to
shun the data warehouse. The data warehouse lengthens their sales cycle,
regardless of the long-term effect of building a bunch of data marts with no data
warehouse. The data mart vendors are selling a very short-term perspective at the
expense of long-term architectural success.

The data mart advocates suggest that there may be alternate, much easier paths to
DSS success than building a data warehouse. One of those paths is to build several
data marts and when they grow big enough, call them a data warehouse, rather than
build an actual data warehouse. The data mart advocates argue that the data mart
can be built much more quickly and cheaply than a warehouse. When you build the
data mart there is no need for a great amount of organizational hassle or discipline
and no concern for the long-term architecture that is created by the data marts.

Unfortunately, by avoiding the visceral organizational and design issues of


warehousing, the data mart advocates miss much of the point of warehousing. By
building an architecture consisting entirely of data marts, the data mart advocates
lead the organization into an even larger mess. Instead of messy legacy operational
systems, now we have messy legacy operational systems AND messy data marts.
Stovepipe data marts and stovepipe DSS applications are what result from building
nothing but data marts. There is no integration when all that you build is data marts.
And a DSS environment without integration is like a man without a skeletal system--
hardly a useful, viable entity.
A Change of Approaches
In the early days of the data warehouse marketplace, the data mart vendors tried to
jump on the warehouse gravy train by proclaiming that a data warehouse was the
same thing as a data mart. In trade show after trade show, the data mart vendors
confused people with what a data warehouse is and what a data mart is. The data
mart vendors spread half truths and misinformation about data warehousing. The
result was confusion.

The obfuscation sowed by the data mart vendors caused a few confused customers
to build data marts with no actual warehouse. After about the third data mart, the
customer discovered something was rotten in Denmark. The architectural deficiency
of building nothing but data marts was unmasked. The customer discovered that
when you don't build a data warehouse, there is:
• massive redundancy of detailed and historical data from one data mart to
another,
• inconsistent and irreconcilable results from one data mart to the next,
• an unmanageable interface between the data marts and the legacy
application environment, etc.
In short order, the world discovered that a DSS environment without a data
warehouse was an extremely unsatisfactory thing.

Now that the world has found that building data marts is not the proper way to
proceed in DSS, the data mart vendors and their spokesmen are back again and are
sowing a different brand of confusion. This time they have altered their original
words a little and have promised a new and improved path to easy success. In a
slight twist of concept from the first time around, the notion now being spread is that
a data warehouse is merely a collection of integrated data marts (whatever that is).
The notion that multiple data marts can be integrated is oxymoronic. The whole
essence of data marts is that mart users do their own thing so that they don't have
to integrate with other marts.

Simply stated, for a variety of very powerful reasons, you cannot build data marts,
watch them grow and magically turn them a data warehouse when they reach a
certain size. And by the same token, integrating data across data marts is equally
unthinkable because each department that owns its own data mart has its own
unique specifications.
In order to understand why one or more data marts cannot be transformed into a
data warehouse, you must first understand what a data mart is and a data
warehouse is.
Different Architectural Structures
A data mart and a data warehouse are essentially different architectural structures,
even though when viewed from afar and superficially, they look to be very similar.
What is a Data Mart?
A data mart is a collection of subject areas organized for decision support based on
the needs of a given department. Finance has their data mart, marketing has theirs,
sales has theirs and so on. And the data mart for marketing only faintly resembles
anyone else's data mart.

Perhaps most importantly, the individual departments OWN the hardware, software,
data and programs that constitute the data mart. The rights of ownership allow the
departments to bypass any means of control or discipline that might coordinate the
data found in the different departments.

Each department has its own interpretation of what a data mart should look like and
each department's data mart is peculiar to and specific to its own needs. Typically,
the database design for a data mart is built around a star-join structure that is
optimal for the needs of the users found in the department. In order to shape the
star join, the requirements of the users for the department must be gathered. The
data mart contains only a modicum of historical information and is granular only to
the point that it suits the needs of the department. The data mart is typically housed
in multidimensional technology which is great for flexibility of analysis but is not
optimal for large amounts of data. Data found in data marts is highly indexed.

There are two kinds of data marts--dependent and independent. A dependent data
mart is one whose source is a data warehouse. An independent data mart is one
whose source is the legacy applications environment. All dependent data marts are
fed by the same source--the data warehouse. Each independent data mart is fed
uniquely and separately by the legacy applications environment. Dependent data
marts are architecturally and structurally sound. Independent data marts are
unstable and architecturally unsound, at least for the long haul. The problem with
independent data marts is that their deficiencies do not make themselves manifest
until the organization has built multiple independent data marts.
What is a Data Warehouse?
Data warehouses are significantly different from data marts. Data warehouses are
arranged around the corporate subject areas found in the corporate data model.
Usually the data warehouse is built and owned by centrally coordinated
organizations, such as the classic IT organization. The data warehouse represents a
truly corporate effort.
There may or may not be a relationship between any department's subject areas and
the corporation's subject areas. The data warehouse contains the most granular data
the corporation has. Data mart data is usually much less granular than data
warehouse data (i.e., data warehouses contain more detail information while most
data marts contain more summarized or aggregated data). The data warehouse data
structure is an essentially normalized structure. The structure and the content of the
data in the data warehouse do not reflect the bias of any particular department, but
represent the corporation's needs for data. The volume of data found in the data
warehouse is significantly different from the data found in the data mart. Because of
the volume of data found in the data warehouse, the data warehouse is indexed very
lightly. The data warehouse contains a robust amount of historical data. The
technology housing the data warehouse is optimized on handling an industrial
strength amount of data. The data warehouse data is integrated from the many
legacy sources.

In short, there are very significant differences between the structure and content of
data that resides in a data warehouse and the structure and content of data that
resides in a data mart.

Figure 1 shows some of the differences between a data mart and a data warehouse.
Because data that is granular, integrated and historical resides in a data warehouse,
the data warehouse attracts a significant volume of data. Because the warehouse
attracts a significant amount of data, it is advisable to build the warehouse
iteratively. If you don't build the warehouse iteratively, you will spend years building
the warehouse. From the very first literature that was ever written on data
warehousing, it has been recognized that there was a need to get concrete, tangible
results in front of the end user as quickly as possible. The best advice of the writers
and consultants of the data warehousing industry has consistently been to build the
warehouse quickly and to avoid large, lengthy efforts.

Interestingly, the data mart advocates and their spokesmen claim that data
warehouses take a long time to build. It is only in the hype issued by the data mart
advocates that it has ever been suggested that the warehouse be built in "galactic"
proportions.

Figure 2 shows the recommended construction path for data warehouses.


The most recent theory of the data mart advocates is that you can build one or more
data marts, integrate them (although no one is very clear as to what that means)
and then when they grow to a certain size, they can be (magically!) turned into a
warehouse. This suggestion is sadly mistaken for a variety of reasons:

* The data mart is designed to suit the needs of a department. Many departments
with very different objectives must be satisfied. That is why there are many different
data marts in the corporation, each with its own distinctive look and feel. The data
warehouse is designed to suit the collective needs of the entire corporation. A given
design can be optimal for a single department or the corporation, but not both. The
design objectives for the corporation are very different from the design objectives for
a given department.

* The granularity of data in the data mart is very different from the granularity of
data in the data warehouse. The data mart contains aggregated or summarized data.
The data warehouse contains the most detailed data that is found in the corporation.
Since the data mart granularity is much higher than that found in the data
warehouse, you cannot easily decompose the data mart granularity into data
warehouse granularity. But you can always go the other direction and summarize
detailed units of data into summarizations.
• The structure of the data in the data mart (commonly a star join structure)
is only faintly compatible with the structure of the data in the warehouse (a
normalized structure).
• The amount of historical data found in the data mart is very different from
the history of the data found in the warehouse. Data warehouses contain
robust amounts of history. Data marts contain only modest amounts of
history.
• The subject areas found in the data mart are only faintly related to the
subject areas found in the data warehouse.
• The relationships found in the data mart are not those relationships that
are found in the data warehouse.
• The types of queries satisfied in the data mart are quite different from
those queries found in the data warehouse.
• The kind of users (farmers) that are found in the marts are quite different
from the type of users (explorers) that are found in the data warehouse.
• The key structures found in the data mart are significantly different from
those key structures found in the data warehouse, and so forth.
Reality
There are simply MAJOR, MAJOR significant differences between the data mart and
the data warehouse environment. The assertion that a data mart can be turned into
a data warehouse when it reaches a certain size or that data marts can be integrated
together is no more valid than saying that when a tumbleweed grows large enough
that it can be turned into an oak tree. Reality and genetics being what they are, it is
true that a tumbleweed and an oak tree are, at one point in their life, green living
organisms planted in the soil and are approximately the same size. But just because
those two plants share a few basic characteristics at some moment in time does not
mean that a tumbleweed can be turned into an oak tree. Only a misinformed person
would mistake a tumbleweed for an oak tree at any stage in the life of the plants.
Data Warehousing vs. Business Intelligence

The differentials are:

DW - is a way of storing data and creating information through leveraging data marts. DM's are segments or
categories of information and/or data that are grouped together to provide 'information' into that segment or
category. DW does not require BI to work. Reporting tools can generate reports from the DW.

BI - is the leveraging of DW to help make business decisions and recommendations. Information and data
rules engines are leveraged here to help make these decisions along with statistical analysis tools and data
mining tools.

You will find that BI is much like ERP in that it can be extremely expensive and invasive to your firm and
there is a wide range between the offerings - low end to high end - which facilitates the pricing. There is a
long list of tools to select from. There are also services that provide this as an outsource. Some of these
services will allow you to eventually 'own' the solution and in-source it at a future date. Like anything else,
this comes at a price. This scenario works well for those who do not have a high caliber IT staff and would
like to get results with a short ramp up time, basically because the system is already built. Your rules and
reports just have to be generated. That is a bit oversimplified, but you get the picture.

Describe dimensional Modeling

SQL Server 2005, and especially Analysis Services 2005, has embraced industry-standard dimensional modeling like a
long-lost friend. When you build Analysis Services databases in SQL Server 2000, you have to compromise in creating
the dimensional model. Specifically, you have to create multiple OLAP virtual dimensions and use member properties to
capture attributes that don't fit into hierarchical dimensions. SQL Server 2005 Analysis Services uses attribute-based
dimensions, so every attribute in a dimension is automatically treated as a standalone hierarchy. Now, you can use the
metadata layer that defines Analysis Services cubes (the Unified Dimensional Model—UDM) to define one customer
dimension instead of a half-dozen artificially separate dimensions. One benefit of this deeper dimensional support is
that data warehouse and BI system developers no longer need to convert standard dimensional-modeling techniques to
the limited world view of previous SQL Server versions. Instead, you can build dimensions that realistically represent
how your business operates and can change as your business changes. In this article, we define dimensional models,
describe the basic elements and techniques that support them, and offer a dimensionally based data architecture for
your data warehouse and BI system. . . .

What are Dimensions and various types of Dimensions?


Dimensions contains Textual attributes.
It is Wide not Deep.(whereas FACT is deep not wide)

about Dimensions..

The Slowly Changing Dimensions Wizard creates mappings to


load slowly changing dimension tables:

Type 1 Dimension mapping. Loads a slowly changing dimension


table by inserting new dimensions and overwriting existing
dimensions. Use this mapping when you do not want a history
of previous dimension data.
Type 2 Dimension/Version Data mapping. Loads a slowly
changing dimension table by inserting new and changed
dimensions using a version number and incremented primary
key to track changes. Use this mapping when you want to
keep a full history of dimension data and to track the
progression of changes.
Type 2 Dimension/Flag Current mapping. Loads a slowly
changing dimension table by inserting new and changed
dimensions using a flag to mark current dimension data and
an incremented primary key to track changes. Use this
mapping when you want to keep a full history of dimension
data, tracking the progression of changes while flagging
only the current dimension.
Type 2 Dimension/Effective Date Range mapping. Loads a
slowly changing dimension table by inserting new and
changed dimensions using a date range to define current
dimension data. Use this mapping when you want to keep a
full history of dimension data, tracking changes with an
exact effective date range.
Type 3 Dimension mapping. Loads a slowly changing dimension
table by inserting new dimensions and updating values in
existing dimensions. Use this mapping when you want to keep
the current and previous dimension values in your dimension
table.

apart from these: there are...

Confirmed dimensions

Junk Dimensions
Degenerated dimension

What is degenerate dimension table

A Degenerate dimension is a Dimension which has only a single attribute.

This dimension is typically represented as a single field in a fact table.

The data items thar are not facts and data items that do not fit into the existing dimensions
are termed as Degenerate Dimensions.

Degenerate Dimensions are the fastest way to group similar transactions.

Degenerate Dimensions are used when fact tables represent transactional data.

They can be used as primary key for the fact table but they cannot act as foreign keys.

You might also like