Computer Fundamentals (Final)
Computer Fundamentals (Final)
Presentation
On
Data WareHousing
Submitted By:
Submitted To:
Siddharth Gaur
Sajal Mathur
Anuj Dutta
Kanchan Bhatnagar
Computer Fundamentals
Input Devices (keyboard), Output Devices (monitor) and the CPU
A model expression for computer operation is the following:
processing that
a computer performs, no
matter how
sophisticated, can be
simple functions
(instructions).
Data WareHousing
A Definition of Data Warehousing
By Michael Reed
The data warehousing market consists of tools, technologies, and
methodologies that allow for the construction, usage, management, and
maintenance of the hardware and software used for a data warehouse, as well
as the actual data itself.
In order to clear up some of the confusion that is rampant in the market, here
are some definitions:
Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined
in the following way: "A warehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support of management's decision
making process". He defined the terms in the sentence as follows:
Subject Oriented:
Data that gives information about a particular subject instead of about a
company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular time period.
Non-volatile
Data is stable in a data warehouse. More data is added but data is never
removed. This enables management to gain a consistent picture of the business.
number of times before the model can be stabilized. Great care must be taken at
this stage, because once the model is populated with large amounts of data,
some of which may be very difficult to recreate, the model can not easily be
changed.
Data Acquisition:
This is the process of moving company data from the source systems into the
warehouse. It is often the most time-consuming and costly effort in the data
warehousing project, and is performed with software products known as ETL
(Extract/Transform/Load) tools. There are currently over 50 ETL tools on the
market. The data acquisition phase can cost millions of dollars and take months
or even years to complete. Data acquisition is then an ongoing, scheduled
process, which is executed to keep the warehouse current to a pre-determined
period in time, (i.e. the warehouse is refreshed monthly).
Data Cleansing:
A data warehouse that contains incorrect data is not only useless, but also very
dangerous. The whole idea behind a data warehouse is to enable decision-making. If a
high level decision is made based on incorrect data in the warehouse, the company
could suffer severe consequences, or even complete failure. Data cleansing is a
complicated process that validates and, if necessary, corrects the data before it is
inserted into the warehouse. For example, the company could have three "Customer
Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.",
and one as "International Business Machines". Obviously, these are all the same
customer. Someone in the organization must make a decision as to which is correct,
and then the data cleansing tool will change the others to match the rule. This process
is also referred to as "data scrubbing" or "data quality assurance". It can be an
extremely complex process, especially if some of the warehouse inputs are from older
mainframe file systems (commonly referred to as "flat files" or "sequential files").
Tools that allow the user to look at the data from a number of different "angles".
These tools often use a multi-dimensional database referred to as a "cube".
Query tools:
Tools that allow the user to issue SQL (Structured Query Language) queries against
the warehouse and get a result set back.
The data extracted from diverse sources will have to be checked for integrity and
will have to be cleaned and then loaded into the warehouse for meaningful
analysis. Therefore, harnessing efficient data cleaning and loading technologies
Metadata is data about data. Mapping rules and the maps between the data
sources and the warehouse; Translation, transformation and cleaning rules; date
and time stamps, system of origin, type of filtering, matching; Pre-calculated or
derived fields and rules thereof are all stored in this database. In addition the
metadata database contains a description of the data in the data warehouse; the
navigation paths and rules for browsing the data in the data warehouse; the data
directory; the list of pre-designed and built in queries available to the users.
Metadata Management
Throughout the entire process of identifying, acquiring, and querying the data,
metadata management takes place. Metadata is defined as "data about data". An
example is a column in a table. The datatype (for instance a string or integer) of the
column is one piece of metadata. The name of the column is another. The actual value
in the column for a particular row is not metadata - it is data. Metadata is stored in a
Metadata Repository and provides extremely useful information to all of the tools
mentioned previously. Metadata management has developed into an exacting science
that can provide huge returns to an organization. It can assist companies in analyzing
the impact of changes to database tables, tracking owners of individual data elements
("data stewards"), and much more. It is also required to build the warehouse, since the
ETL tool needs to know the metadata attributes of the sources and targets in order to
"map" the data properly. The BI tools need the metadata for similar reasons.
The concept of data warehousing dates back to the late 1980s [2] when IBM
researchers Barry Devlin and Paul Murphy developed the "business data warehouse".
In essence, the data warehousing concept was intended to provide an architectural
model for the flow of data from operational systems to decision support
environments. The concept attempted to address the various problems associated with
this flow - mainly, the high costs associated with it. In the absence of a data
warehousing architecture, an enormous amount of redundancy was required to
support multiple decision support environments. In larger corporations it was typical
for multiple decision support environments to operate independently. Each
environment served different users but often required much of the same data. The
process of gathering, cleaning and integrating data from various sources, usually long
existing operational systems (usually referred to as legacy systems), was typically in
part replicated for each environment. Moreover, the operational systems were
frequently reexamined as new decision support requirements emerged. Often new
requirements necessitated gathering, cleaning and integrating new data from the
operational systems that were logically related to prior gathered data.
Based on analogies with real-life warehouses, data warehouses were intended as
large-scale collection/storage/staging areas for corporate data. Data could be retrieved
from one central point or data could be distributed to "retail stores" or "data marts"
that were tailored for ready access by users.
There are two leading approaches to storing data in a data warehouse - the
dimensional approach and the normalized approach.
In the dimensional approach, transaction data are partitioned into either "facts", which
are generally numeric transaction data, or "dimensions", which are the reference
information that gives context to the facts. For example, a sales transaction can be
broken up into facts such as the number of products ordered and the price paid for the
products, and into dimensions such as order date, customer name, product number,
order ship-to and bill-to locations, and salesperson responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the
user to understand and to use. Also, the retrieval of data from the data warehouse
tends to operate very quickly. The main disadvantages of the dimensional approach
are: 1) In order to maintain the integrity of facts and dimensions, loading the data
warehouse with data from different operational systems is complicated, and 2) It is
difficult to modify the data warehouse structure if the organization adopting the
dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a
degree, database normalization rules. Tables are grouped together by subject areas
that reflect general data categories (e.g., data on customers, products, finance, etc.)
The main advantage of this approach is that it is straightforward to add information
into the database. A disadvantage of this approach is that, because of the number of
tables involved, it can be difficult for users both to 1) join data from different sources
into meaningful information and then 2) access the information without a precise
understanding of the sources of data and of the data structure of the data warehouse.
These approaches are not mutually exclusive. Dimensional approaches can involve
normalizing data to a degree.
Conforming information
Another important fact in designing a data warehouse is which data to conform and
how to conform the data. For example, one operational system feeding data into the
data warehouse may use "M" and "F" to denote sex of an employee while another
operational system may use "Male" and "Female". Though this is a simple example,
much of the work in implementing a data warehouse is devoted to making similar
meaning data consistent when they are stored in the data warehouse. Typically,
extract, transform, load tools are used in this work.
Master Data Management has the aim of conforming data that could be considered
"dimensions".
APPLICATIONS
Government organizations
Banks
Insurance Companies
Utilities Providers
Security Agencies
Organizations generally start off with relatively simple use of data warehousing. Over
time, more sophisticated use of data warehousing evolves. The following general
stages of use of the data warehouse can be distinguished:
Data Mining
Data Mining is the process of analyzing data from different perspectives and
summarizing it into useful information that can be used to increase revenue,
cut costs, or both.
Data mining software is one of a number of analytical tools for analyzing data.
It allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified.
Technically, data mining is the process of finding correlations or patterns
among dozens of fields in large relational databases.
They support the OLAP concept and include query-and-reporting tools,
intelligent agents, multi-dimensional analysis tools, and statistical tools.
How We Benefit
Seeing the time periods in which products sell(which month/season) e.g. more
educational books are sold in the months of August, September and October.
Gather Information about the activities and trends of our On-Line Users.
Comparing sales of a particular book/Magazine to that of a competitor.
Identify which books might sell together(even possibly what book and
magazine).
Customers use of coupons & special offers.
Customers preferences and behaviours(on the customer loyalty Program).
Data Mart
A data mart is a subset of an organizational data store,
usually oriented to a specific purpose or major data subject,
that may be distributed to support business needs. Data marts
are analytical data stores designed to focus on specific
business functions for a specific community within an
organization. Data marts are often derived from subsets of
data in a data warehouse, though in the bottom-up data
warehouse design methodology the data warehouse is created
from the union of organizational data marts.
A data warehouse provides a common data model for all data of interest
regardless of the data's source. This makes it easier to report and analyze
information than it would be if multiple data models were used to retrieve
information such as sales invoices, order receipts, general ledger charges, etc.
Prior to loading data into the data warehouse, inconsistencies are identified
and resolved. This greatly simplifies reporting and analysis.
Information in the data warehouse is under the control of data warehouse users
so that, even if the source system data is purged over time, the information in
the warehouse can be stored safely for extended periods of time.
Because they are separate from operational systems, data warehouses provide
retrieval of data without slowing down operational systems.
Data warehouses can work in conjunction with and, hence, enhance the value
of operational business applications, notably customer relationship
management (CRM) systems.
Data warehouses facilitate decision support system applications such as trend
reports (e.g., the items with the most sales in a particular area within the last
two years), exception reports, and reports that show actual performance versus
goals.
manner because they have to request and depend on IT staff for making
special reports which often takes long time to generate. An Information
Warehouse can deliver strategic intelligence to the decision makers and
provide an insight into the overall situation. This greatly facilitates decisionmakers in taking micro level decisions in a timely manner without the need to
depend on their IT staff. By organizing person and land-related data into a
meaningful Information Warehouse, the Government decision makers can be
empowered with a flexible tool that enables them to make informed policy
decisions for citizen facilitation and accessing their impact over the intended
section of the population.
They do not have to deal with the heterogeneous and sporadic information
generated by various state-level computerization projects as they can access
current data with a high granularity from the information warehouse.
They can take micro-level decisions in a timely manner without the need to
depend on their IT staff.
They can obtain easily decipherable and comprehensive information without
the need to use sophisticated tools.
They can perform extensive analysis of stored data to provide answers to the
exhaustive queries to the administrative cadre. This helps them to formulate
more effective strategies and policies for citizen facilitation
They are the ultimate beneficiaries of the new policies formulated by the
decision makers and policy planner's extensive analysis on person and landrelated data.
They can view frequently asked queries whose results will already be there in
the database and will be immediately shown to the user saving the time
required for processing.
They can have easy access to the Government policies of the state.
The web access to Information Warehouse enables them to access the public
domain data from anywhere.
Data warehouses are not the optimal environment for unstructured data.
Because data must be extracted, transformed and loaded into the warehouse,
there is an element of latency in data warehouse data.
Over their life, data warehouses can have high costs. The data warehouse is
usually not static. Maintenance costs are high.
Data warehouses can get outdated relatively quickly. There is a cost of
delivering suboptimal information to the organization.
There is often a fine line between data warehouses and operational systems.
Duplicate, expensive functionality may be developed. Or, functionality may be
developed in the data warehouse that, in retrospect, should have been
developed in the operational systems and vice versa.
Bibliography
B.Sc.(IT) Book
Kuvempu University
(Karnataka)