History of Datawarehouse

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 17

History of Datawarehouse 

The Datawarehouse benefits users to understand and enhance their organization's performance.


The need to warehouse data evolved as computer systems became more complex and needed to handle
increasing amounts of Information. However, Data Warehousing is a not a new thing. Here are some
key events in evolution of Data Warehouse- 
•1960- Dartmouth and General Mills in a joint research project, develop the terms dimensions
and facts. 
•1970- A Nielsen and IRI introduces dimensional data marts for retail sales. 
•1983- Tera Data Corporation introduces a database management system which is specifically
designed for decision support 
•Data warehousing started in the late 1980s when IBM worker Paul Murphy and Barry Devlin
developed the Business Data Warehouse. 
•However, the real concept was given by Inmon Bill. He was considered as a father of data
warehouse. He had written about a variety of topics for building, usage, and maintenance of the
warehouse & the Corporate Information Factory.
What is Data Warehouse? 

A data warehouse is a blend of technologies and components which allows the strategic use of
data. It is a technique for collecting and managing data from varied sources to provide meaningful
business insights.

Data warehouse system is also known by the following name: 


•Decision Support System (DSS)
•Executive Information System
•Management Information System
•Business Intelligence Solution
•Analytic Application
•Data Warehouse
How Datawarehouse works? 
A Data Warehouse works as a central repository where information arrives from one or more
data sources. Data flows into a data warehouse from the transactional system and other relational
databases. Data may be: 
1.Structured 
2.Semi-structured 
3.Unstructured data 
The data is processed, transformed, and ingested so that users can access the pro- cessed data in
the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data
warehouse merges information coming from different sources into one comprehensive database. By
merging all of this information in one place, an organization can analyze its cus- tomers more
holistically. This helps to ensure that it has considered all the infor-
mation available. Data warehousing makes data mining possible. Data mining is looking for patterns in
the data that may lead to higher sales and profits.
Types of Data Warehouse Three main types of Data Warehouses are: 
1. Enterprise Data Warehouse: 
Enterprise Data Warehouse is a centralized warehouse. It provides decision sup- port service across the
enterprise. It offers a unified approach for organizing and representing data. It also provide the ability
to classify data according to the sub- ject and give access according to those divisions. 2. Operational
Data Store: Operational Data Store, which is also called ODS, are nothing but data store re- quired
when neither Data warehouse nor OLTP systems support organizations re- porting needs. In ODS, Data
warehouse is refreshed in real time. Hence, it is wide- ly preferred for routine activities like storing
records of the Employees. 
2. Data Mart: 
A data mart is a subset of the data warehouse. It specially designed for a particular line of business,
such as sales, finance, sales or finance. In an independent data mart, data can collect directly from
sources.

General stages of Data Warehouse 


Earlier, organizations started relatively simple use of data warehousing. However, over time, more
sophisticated use of data warehousing begun. The following are general stages of use of the data
warehouse: Offline Operational Database: In this stage, data is just copied from an operational
system to another server. In this way, loading, processing, and reporting of the copied data do not
impact the operational system’s performance. Offline Data Warehouse: Data in the Datawarehouse is
regularly updated from the Operational Database. The data in Datawarehouse is mapped and
transformed to meet the Dataware- house objectives. 
Real time Data Warehouse: In this stage, Data warehouses are updated whenever any
transaction takes place in operational database. For example, Airline or railway booking system.
Integrated Data Warehouse: In this stage, Data Warehouses are updated continuously when
the operational sys- tem performs a transaction. The Datawarehouse then generates transactions
which are passed back to the operational system.
Components of Data warehouse Four components of Data Warehouses are:
Load manager: Load manager is also called the front component. It performs with all the operations
associated with the extraction and load of data into the ware- house. These operations include
transformations to prepare the data for entering into the Data warehouse. 
Warehouse Manager: Warehouse manager performs operations associated with the management of
the data in the warehouse. It performs operations like analysis of data to ensure consistency, creation of
indexes and views, generation of denormal- ization and aggregations, transformation and merging of
source data and archiving and baking-up data. 
Query Manager: Query manager is also known as backend component. It performs all the operation
operations related to the management of user queries. The opera- tions of this Data warehouse
components are direct queries to the appropriate ta- bles for scheduling the execution of queries. 
End-user access tools: This is categorized into five different groups like
1. Data Reporting
2. Query Tools 
3. Application development tools
4. EIS tools
5. OLAP tools and data mining tools.
Who needs Data warehouse? 
Data warehouse is needed for all types of users like: 
•Decision makers who rely on mass amount of data 
•Users who use customized, complex processes to obtain information from multiple data sources. 
•It is also used by the people who want simple technology to access the data 
•It also essential for those people who want a systematic approach for making decisions. 
•If the user wants fast performance on a huge amount of data which is a necessity for reports, grids or
charts, then Data warehouse proves useful. 
•Data warehouse is a first step If you want to discover ‘hidden patterns’ of  data-flows and groupings.
What Is a Data Warehouse Used For?
  Here, are most common sectors where Data warehouse is used: 
Airline: In the Airline system, it is used for operation purpose like crew assignment, analy- ses
of route profitability, frequent flyer program promotions, etc. 
Banking: It is widely used in the banking sector to manage the resources available on
desk effectively. Few banks also used for the market research, performance analysis of the product and
operations. 
Healthcare: Healthcare sector also used Data warehouse to strategize and predict
outcomes, generate patient’s treatment reports, share data with tie-in insurance companies, 
medical aid services, etc. 
Public sector: In the public sector, data warehouse is used for intelligence gathering. It helps
gov- ernment agencies to maintain and analyze tax records, health policy records, for every individual. 
Investment and Insurance sector: In this sector, the warehouses are primarily used to analyze data
patterns, cus- tomer trends, and to track market movements. 
Retain chain: In retail chains, Data warehouse is widely used for distribution and marketing. It also
helps to track items, customer buying pattern, promotions and also used for determining pricing policy.
Telecommunication: A data warehouse is used in this sector for product promotions, sales
decisions and to make distribution decisions. 
Hospitality Industry: This Industry utilizes warehouse services to design as well as estimate their
adver-tising and promotion campaigns where they want to target clients based on their feedback and
travel patterns.

Steps to Implement Data Warehouse 


The best way to address the business risk associated with a Datawarehouse imple- mentation is to
employ a three-prong strategy as below 
1.Enterprise strategy: Here we identify technical including current archi-tecture and tools. We
also identify facts, dimensions, and attributes. Data mapping and transformation is also passed. 
2.Phased delivery: Datawarehouse implementation should be phased based on subject areas.
Related business entities like booking and billing should be first implemented and then integrated with
each other. 
3.IterativePrototyping: Rather than a big bang approach to implemen-tation, the
Datawarehouse should be developed and tested iteratively.  
Here, are key steps in Datawarehouse implementation along with its deliverables. 

Step  Tasks  Deliverables 

1  Need to define project scope  Scope Definition 

2  Need to determine business needs  Logical Data Model 

Define Operational Operational Data Store


3  Datastore requirements  Model 

4  Acquire or develop Extraction tools  Extract tools and Software 

Define Data Warehouse


5  Data requirements  Transition Data Model 

6  Document missing data  To Do Project List 

7  Maps Operational Data Store to  D/W Data Integration Map

Best practices to implement a Data Warehouse 


•Decide a plan to test the consistency, accuracy, and integrity of the data. 
•The data warehouse must be well integrated, well defined and time stamped. 
•While designing Datawarehouse make sure you use right tool, stick to life cycle, take care about data
conflicts and ready to learn you’re your mis- takes. 
•Never replace operational systems and reports 
•Don’t spend too much time on extracting, cleaning and loading data. 
•Ensure to involve all stakeholders including business personnel in Datawarehouse implementation
process. Establish that Data warehousing is a joint/ team project. You don’t want to create Data
warehouse that is not useful to the end users. 
•Prepare a training plan for the end users.
Why We Need Data Warehouse? Advantages & Disadvantages 

Advantages of Data Warehouse: 

•Data warehouse allows business users to quickly access critical data from  some sources all in
one place. 

•Data warehouse provides consistent information on various cross-functional activities. It is also


supporting ad-hoc reporting and query. 

•Data Warehouse helps to integrate many sources of data to reduce stress on the production
system. 

•Data warehouse helps to reduce total turnaround time for analysis and re-porting. 

•Restructuring and Integration make it easier for the user to use for report-ing and analysis. 

•Data warehouse allows users to access critical data from the number of sources in a single
place. Therefore, it saves user’s time of retrieving data from multiple sources. 

•Data warehouse stores a large amount of historical data. This helps users to analyze different
time periods and trends to make future predictions.

  Disadvantages of Data Warehouse: 

•Not an ideal option for unstructured data. 

•Creation and Implementation of Data Warehouse is surely time confusing affair. 

•Data Warehouse can be outdated relatively quickly 

•Difficult to make changes in data types and ranges, data source schema, 
Data Warehouse Tools 
There are many Data Warehousing tools are available in the market. Here, are some most
prominent one: 
1. MarkLogic: MarkLogic is useful data warehousing solution that makes data integration
easier and faster using an array of enterprise features. This tool helps to perform very complex search
operations. It can query different types of data like documents, relationships, and
metadata. http://developer.marklogic.com/products 
2. Oracle: Oracle is the industry-leading database. It offers a wide range of choice of
data warehouse solutions for both on-premises and in the cloud. It helps to optimize  customer
experiences by increasing operational efficiency. efficiency. https://www.oracle.com/index.html 
3. Amazon RedShift: Amazon Redshift is Data warehouse tool. It is a simple and cost-effective
tool to analyze all types of data using standard SQL and existing BI tools. It also allows running
complex queries against petabytes of structured data, using the technique of query
optimization. https://aws.amazon.com/redshift/?nc2=h_m1 Here is a complete list of useful
Datawarehouse Tools -  https://  https://ru99.-com/top-etl-etl-database-housing-tools.html

Difference between Operational Database and Data Warehouse

Operational Database Data Warehouse

Operational systems are designed to support high- Data warehousing systems are typically designed
volume transaction processing. to support high-volume analytical processing
(i.e., OLAP).

Operational systems are usually concerned with Data warehousing systems are usually concerned
current data. with historical data.
Data within operational systems are mainly updated Non-volatile, new data may be added regularly.
regularly according to need. Once Added rarely changed.

It is designed for real-time business dealing and It is designed for analysis of business measures
processes. by subject area, categories, and attributes.

It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a time complex, unpredictable queries that access many
per table. rows per table.

It is optimized for validation of incoming Loaded with consistent, valid information,


information during transactions, uses validation data requires no real-time validation.
tables.

It supports thousands of concurrent clients. It supports a few concurrent clients relative to


OLTP.

Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented

Operational systems are usually optimized to Data warehousing systems are usually optimized
perform fast inserts and updates of associatively to perform fast retrievals of relatively high
small volumes of data. volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Processing (OLTP) Processing (OLAP)

Chapter 2: Data Warehouse Architecture

What is data warehouse Data warehouse is an information system that contains historical and
commutative data from single or multiple sources. It simplifies reporting and analysis process of the
organization. It is also a single version of truth for any company for decision making and fore- casting.

Characteristics of Data warehouse 


A data warehouse has following characteristics: 
•Subject-Oriented 
•Integrated 
•Time-variant 
•Non-volatile
Subject-Oriented 

A data warehouse is subject oriented as it offers information regarding a theme in- stead of


companies’ ongoing operations. These subjects can be sales, marketing, distributions, etc. A data
warehouse never focuses on the ongoing operations. Instead, it put
empha- sis on modeling and analysis of data for decision making. It also provides a simple and concise
view around the specific subject by excluding data which not helpful to support the decision process

Integrated 

In Data Warehouse, integration means the establishment of a common unit of measure for all
similar data from the dissimilar database. The data also needs to be stored in the Datawarehouse in
common and universally acceptable manner. A data warehouse is developed by integrating data from
varied sources like a main- frame, relational databases, flat files, etc. Moreover, it must keep consistent
nam- ing conventions, format, and coding. This integration helps in effective analysis of data.
Consistency in naming conven- tions, attribute measures, encoding structure etc. have to be ensured.
Consider the following example:

In the above example, there are three different application labeled A, B and C. Infor- mation stored in
these applications are Gender, Date, and Balance. However, each application's data is stored different way. 
•In Application A gender field store logical values like M or F 
•In Application B gender field is a numerical value, 
•In Application C application, gender field stored in the form of a character value. 
•Same is the case with Date and balance 
 However, after transformation and cleaning process all this data is stored in com- mon format in the
Data Warehouse.

Time-Variant The time horizon for data warehouse is quite extensive compared with
operational systems. The data collected in a data warehouse is recognized with a particular pe- riod and
offers information from the historical point of view. It contains an element of time, explicitly or
implicitly. One such place where Datawarehouse data display time variance is in in the struc- ture of
the record key. Every primary key contained with the DW should have either implicitly or explicitly an
element of time. Like the day, week month, etc. Another aspect of time variance is that once data is
inserted in the warehouse, it can’t be updated or changed.

Non-volatile Data warehouse is also non-volatile means the previous data is not erased
when new data is entered in it. Data is read-only and periodically refreshed. This also helps to analyze
historical data and understand what & when happened. It does not require transaction process, recovery
and concurrency control mechanisms. Activities like delete, update, and insert which are performed in
an operational application environment are omitted in Data warehouse environment. Only two types of
data operations performed in the Data Warehousing are 

Here, are some major differences between Application and Data Warehouse 

Operational Application  Data Warehouse 

Complex program must be coded to make sure This kind of issues does not
that data upgrade processes maintain high happen because data update is
integrity of the final product.  not performed. 

Data is placed in a normalized form to ensure Data is not stored in


minimal redundancy.  normalized form. 

Technology needed to support issues


of transactions, data recovery, rollback,
and resolution as its deadlock is It offers relative simplicity
quite complex.  in technology.

Data Warehouse Architectures 


There are mainly three types of Datawarehouse Architectures: -
 Single-tier architecture The objective of a single layer is to minimize the amount of
data stored. This goal is to remove data redundancy. This architecture is not frequently used in
practice. 
Two-tier architecture Two-layer architecture separates physically available sources
and data warehouse. This architecture is not expandable and also not supporting a large number
of end- users. It also has connectivity problems because of network limitations. 
Three-tier architecture This is the most widely used architecture. It consists of the
Top, Middle and Bottom Tier. 
1.Bottom Tier: The database of the Datawarehouse servers as the bottom 
tier. It is usually a relational database system. Data is cleansed, trans- formed,
and loaded into this layer using back-end tools. 
2.Middle Tier: The middle tier in Data warehouse is an OLAP server 
which is implemented using either ROLAP or MOLAP model. For a user, this
application tier presents an abstracted view of the database. This layer also acts
as a mediator between the end-user and the data- base. 
3.Top-Tier: The top tier is a front-end client layer. Top tier is the tools and 
API that you connect and get data out from the data warehouse. It could be
Query tools, reporting tools, managed query tools, Analysis tools and Data
mining tools.
The data warehouse is based on an RDBMS server which is a central information repository that is
surrounded by some key components to make the entire envi- ronment functional, manageable and
accessible There are mainly five components of Data Warehouse:

Three-Tier Data Warehouse Architecture

Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)

2. Middle Tier (OLAP Server)

3. Top Tier (Front end Tools).

A bottom-tier that consists of the Data Warehouse server, which is almost always an


RDBMS. It may include several specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway. A
gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and


Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.The
OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that


directly implements multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as


additional tools for data mining of the OLAP-generated data.

The overall Data Warehouse Architecture is shown in fig:

The metadata repository stores information that defines DW objects. It includes the following


parameters and information for the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse schema, dimension, hierarchies,
data mart locations, and contents, etc.

2. Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.

3. System performance data, which includes indices, used to improve data access and retrieval
performance.

4. Information about the mapping from operational databases, which provides


source RDBMSs and their contents, cleaning and transformation rules, etc.

5. Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.

Introduction to Autonomous Data Warehouse Cloud:

• Easy – Fully-managed, pre-configured and optimized for DW workloads

– Simply load data and run • No need to define indexes, create partitions, etc.

• Fast – Based on Exadata technology

• Elastic – Instant scaling of compute or storage with no downtime .

Fully-managed

• Oracle automates end-to-end management of the data warehouse

– Provisioning new databases

– Growing/shrinking storage and/or compute

– Patching and upgrades

– Backup and recovery

• Full lifecycle managed using the service console

– Alternatively, can be managed via command-line interface or REST API


Automated Tuning

• “Load and go”

– Define tables, load data, run queries

• No tuning required

• No special database expertise required

• No need to worry about tablespaces, partitioning, compression, in-memory, indexes,


parallel execution

– Fast performance out of the box with zero tuning

– Simple web-based monitoring console – Built-in resource-management plans

Fully-elastic

• Size the DW to the exact compute and storage required

– Not constrained by fixed building blocks, no predefined shapes

• Scale the DW on demand

– Independently scale compute or storage

– Resizing occurs instantly, fully online

• Shut off idle compute to save money – Restart instantly

Full Support of DW Ecosystem

• Autonomous Data Warehouse Cloud supports :


– Existing tools, running on-premises or in the cloud

• Third-party BI tools

• Third-party data-integration tools •

 Oracle BI and data-integration tools: BIEE, ODI, etc

– Oracle cloud services: Analytics Cloud Service, Golden Gate Cloud Service, Integration Cloud Service,
and others

– Connectivity via SQL*Net, JDBC, ODBC

SnowFlakes:

What is Snowflake used for?

Snowflake is a cloud data warehouse that can store and analyze all your data records in one place. It
can automatically scale up/down its compute resources to load, integrate, and analyze data.

Snowflake’s architecture automatically allocates the right resources

Snowflake’s decoupled storage, compute, and services architecture enables the platform to automatically
deliver the optimal set of IO, memory, and CPU resources for each workload and usage scenario.
Snowflake uses a new multi-cluster, shared data architecture that decouples storage, compute resources, and
system services. Snowflake’s architecture has the following three components:

1. Storage: Snowflake uses a scalable cloud storage service to ensure a high degree of data replication,
scalability, and availability without much manual user intervention. It allows users to organize
information in databases, as per their needs.

2. Compute: Snowflake uses massively parallel processing (MPP) clusters to allocate compute resources
for tasks like loading, transforming, and querying data. It allows users to isolate workloads within
particular virtual warehouses. Users can also specify which databases in the storage layer a particular
virtual warehouse has access to.

3. Cloud services: Snowflake uses a set of services such as metadata, security, access control, security,
and infrastructure management. It allows users to communicate with client applications such as
Snowflake web user interface, JDBC, or ODBC.

You might also like