0% found this document useful (0 votes)
6 views

Data Warehouse Tutorial

A Data Warehouse is a centralized repository designed for query and analysis, supporting decision-making by storing integrated historical data from various sources. It features characteristics such as being subject-oriented, integrated, time-variant, and non-volatile, enabling better business analytics and faster queries. Despite its benefits, building a Data Warehouse can be costly and complex, requiring careful data integration and management.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Warehouse Tutorial

A Data Warehouse is a centralized repository designed for query and analysis, supporting decision-making by storing integrated historical data from various sources. It features characteristics such as being subject-oriented, integrated, time-variant, and non-volatile, enabling better business analytics and faster queries. Despite its benefits, building a Data Warehouse can be costly and complex, requiring careful data integration and management.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Data Warehouse Tutorial

Data Warehouse
Data Warehouse is a relational database management system (RDBMS) construct to meet the requirement of
transaction processing systems. It can be loosely described as any centralized data repository which can be queried
for business benefits. It is a database that stores information oriented to satisfy decision-making requests. It is a
group of decision support technologies, targets to enabling the knowledge worker (executive, manager, and analyst)
to make superior and higher decisions. So, Data Warehousing support architectures and tool for business executives
to systematically organize, understand and use their information to make strategic decisions.

Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution, an online analytical
processing (OLAP) engine, customer analysis tools, and other applications that handle the process of gathering
information and delivering it to business users.

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction
processing. It includes historical data derived from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for
decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

It is a database designed for investigative tasks, using data from various applications.

It supports a relatively small number of clients with relatively long interactions.

It includes current and historical data to provide a historical perspective of information.

Its usage is read-intensive.

It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's
decisions."

Characteristics of Data Warehouse


• Subject-Oriented: A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a particular subject,
such as customer, product, or sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the users to
understand the subject.

• Integrated: A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data warehousing to
ensure consistency in naming conventions, attributes types, etc., among different data sources.
• Time-Variant: Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations with a
transactions system, where often only the most current file is kept.
• Non-Volatile:The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires only two procedures in data accessing:
Initial loading of data and access to data. Therefore, the DW does not require transaction processing,
recovery, and concurrency capabilities, which allows for substantial speedup of data retrieval. Non-Volatile
defines that once entered into the warehouse, and data should not change.

History of Data Warehouse


The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and Paul Murphy
established the "Business Data Warehouse."

In essence, the data warehousing idea was planned to support an architectural model for the flow of information
from the operational system to decisional support environments. The concept attempt to address the various
problems associated with the flow, mainly the high costs associated with it.

In the absence of data warehousing architecture, a vast amount of space was required to support multiple decision
support environments. In large corporations, it was ordinary for various decision support environments to operate
independently.

Goals of Data Warehousing


To help reporting as well as analysis

Maintain the organization's historical information

Be the foundation for decision making.

Need for Data Warehouse

Data Warehouse is needed for the following reasons:

Data warehousing provide the capabilities to analyze a large amount of historical data. A Database Management
System (DBMS) stores data in the form of tables and uses an ER model and the goal is ACID properties. For example,
a DBMS of a college has tables for students, faculty, etc.

A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected from multiple
heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in decision-
making. For example, a college might want to see quick different results, like how the placement of CS students has
improved over the last 10 years, in terms of salaries, counts, etc.

Issues Occur while Building the Warehouse


When and how to gather data: In a source-driven architecture for gathering data, the data sources transmit new
information, either continually (as transaction processing takes place), or periodically (nightly, for example). In a
destination-driven architecture, the data warehouse periodically sends requests for new data to the sources. Unless
updates at the sources are replicated at the warehouse via two phase commit, the warehouse will never be quite up
to-date with the sources. Two-phase commit is usually far too expensive to be an option, so data warehouses
typically have slightly out-of-date data. That, however, is usually not a problem for decision-support systems.

What schema to use: Data sources that have been constructed independently are likely to have different schemas. In
fact, they may even use different data models. Part of the task of a warehouse is to perform schema integration, and
to convert data to the integrated schema before they are stored. As a result, the data stored in the warehouse are
not just a copy of the data at the sources. Instead, they can be thought of as a materialized view of the data at the
sources.

Data transformation and cleansing: The task of correcting and preprocessing data is called data cleansing. Data
sources often deliver data with numerous minor inconsistencies, which can be corrected. For example, names are
often misspelled, and addresses may have street, area, or city names misspelled, or postal codes entered incorrectly.
These can be corrected to a reasonable extent by consulting a database of street names and postal codes in each city.
The approximate matching of data required for this task is referred to as fuzzy lookup.

How to propagate update: Updates on relations at the data sources must be propagated to the data warehouse. If
the relations at the data warehouse are exactly the same as those at the data source, the propagation is
straightforward. If they are not, the problem of propagating updates is basically the view-maintenance problem.

What data to summarize: The raw data generated by a transaction-processing system may be too large to store
online. However, we can answer many queries by maintaining just summary data obtained by aggregation on a
relation, rather than maintaining the entire relation. For example, instead of storing data about every sale of clothing,
we can store total sales of clothing by item name and category.

Need for Data Warehouse


An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing data of TB size, the
storage shifted to the Data Warehouse. Besides this, a transactional database doesn’t offer itself to analytics. To
effectively perform analytics, an organization keeps a central Data Warehouse to closely study its business by
organizing, understanding, and using its historical data for making strategic decisions and analyzing trends.

Benefits of Data Warehouse


Better business analytics: Data warehouse plays an important role in every business to store and analysis of all the
past data and records of the company. which can further increase the understanding or analysis of data for the
company.

Faster Queries: The data warehouse is designed to handle large queries that’s why it runs queries faster than the
database.

Improved data Quality: In the data warehouse the data you gathered from different sources is being stored and
analyzed it does not interfere with or add data by itself so your quality of data is maintained and if you get any issue
regarding data quality then the data warehouse team will solve this.

Historical Insight: The warehouse stores all your historical data which contains details about the business so that one
can analyze it at any time and extract insights from it.

Data Warehouse vs DBMS

Example Applications of Data Warehousing


Data Warehousing can be applied anywhere where we have a huge amount of data and we want to see statistical
results that help in decision making.

Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are based on analyzing
large data sets. These sites gather data related to members, groups, locations, etc., and store it in a single central
repository. Being a large amount of data, Data Warehouse is needed for implementing the same.

Banking: Most of the banks these days use warehouses to see the spending patterns of account/cardholders. They
use this to provide them with special offers, deals, etc.
Government: Government uses a data warehouse to store and analyze tax payments which are used to detect tax
thefts.

Features of Data Warehousing


Data warehousing is essential for modern data management, providing a strong foundation for organizations to
consolidate and analyze data strategically. Its distinguishing features empower businesses with the tools to make
informed decisions and extract valuable insights from their data.

Centralized Data Repository: Data warehousing provides a centralized repository for all enterprise data from various
sources, such as transactional databases, operational systems, and external sources. This enables organizations to
have a comprehensive view of their data, which can help in making informed business decisions.

Data Integration: Data warehousing integrates data from different sources into a single, unified view, which can help
in eliminating data silos and reducing data inconsistencies.

Historical Data Storage: Data warehousing stores historical data, which enables organizations to analyze data trends
over time. This can help in identifying patterns and anomalies in the data, which can be used to improve business
performance.

Query and Analysis: Data warehousing provides powerful query and analysis capabilities that enable users to explore
and analyze data in different ways. This can help in identifying patterns and trends, and can also help in making
informed business decisions.

Data Transformation: Data warehousing includes a process of data transformation, which involves cleaning, filtering,
and formatting data from various sources to make it consistent and usable. This can help in improving data quality
and reducing data inconsistencies.

Data Mining: Data warehousing provides data mining capabilities, which enable organizations to discover hidden
patterns and relationships in their data. This can help in identifying new opportunities, predicting future trends, and
mitigating risks.

Data Security: Data warehousing provides robust data security features, such as access controls, data encryption, and
data backups, which ensure that the data is secure and protected from unauthorized access.

Advantages of Data Warehousing


Intelligent Decision-Making: With centralized data in warehouses, decisions may be made more quickly and
intelligently.

Business Intelligence: Provides strong operational insights through business intelligence.

Historical Analysis: Predictions and trend analysis are made easier by storing past data.

Data Quality: Guarantees data quality and consistency for trustworthy reporting.

Scalability: Capable of managing massive data volumes and expanding to meet changing requirements.

Effective Queries: Fast and effective data retrieval is made possible by an optimized structure.

Cost reductions: Data warehousing can result in cost savings over time by reducing data management procedures
and increasing overall efficiency, even when there are setup costs initially.

Data security: Data warehouses employ security protocols to safeguard confidential information, guaranteeing that
only authorized personnel are granted access to certain data.

Disadvantages of Data Warehousing


Cost: Building a data warehouse can be expensive, requiring significant investments in hardware, software, and
personnel.
Complexity: Data warehousing can be complex, and businesses may need to hire specialized personnel to manage
the system.

Time-consuming: Building a data warehouse can take a significant amount of time, requiring businesses to be patient
and committed to the process.

Data integration challenges: Data from different sources can be challenging to integrate, requiring significant effort to
ensure consistency and accuracy.

Data security: Data warehousing can pose data security risks, and businesses must take measures to protect sensitive
data from unauthorized access or breaches. Components or Building Blocks of Data Warehouse

Architecture is the proper arrangement of the elements. We build a data warehouse with software and hardware
components. To suit the requirements of our organizations, we arrange these building we may want to boost up
another part with extra tools and services. All of these depends on our circumstances.
Data Warehouse Components
The figure shows the essential elements of a typical warehouse. We see the Source Data component shows on the
left. The Data staging element serves as the next building block. In the middle, we see the Data Storage component
that handles the data warehouses data. This element not only stores and manages the data; it also keeps track of
data using the metadata repository. The Information Delivery component shows on the right consists of all the
different ways of making the information from the data warehouses available to the users.

Source Data Component


Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the enterprise. Based on the data
requirements in the data warehouse, we choose segments of the data from the various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports, customer profiles, and
sometimes even department databases. This is the internal data, part of which could be useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business. In every operational system, we
periodically take the old data and store it in achieved files.

External Data: Most executives depend on information from external sources for a large percentage of the
information they use. They use statistics associating to their industry produced by the external department.

Data Staging Component


After we have been extracted data from various operational systems and external sources, we have to prepare the
files for storing in the data warehouse. The extracted data coming from several different sources need to be changed,
converted, and made ready in a format that is relevant to be saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging area.

Data Warehouse Components


1) Data Extraction: This method has to deal with numerous data sources. We have to employ the appropriate
techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from many different sources. If data
extraction for a data warehouse posture big challenges, data transformation present even significant challenges. We
perform several individual tasks as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the correction of misspellings or may deal with
providing default values for missing data elements, or elimination of duplicates when we bring in the same data from
various source systems.

Standardization of data components forms a large part of data transformation. Data transformation contains many
forms of combining pieces of data from different sources. We combine data from single source record or related data
parts from many source records.

On the other hand, data transformation also contains purging source data that is not useful and separating outsource
records into new combinations. Sorting and merging of data take place on a large scale in the data staging area.
When the data transformation function ends, we have a collection of integrated data that is cleaned, standardized,
and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete the structure and
construction of the data warehouse and go live for the first time, we do the initial loading of the information into the
data warehouse storage. The initial load moves high volumes of data using up a substantial amount of time.

Data Storage Components


Data storage for the data warehousing is a split repository. The data repositories for the operational systems
generally include only the current data. Also, these data repositories include the data structured in highly normalized
for fast and efficient processing.

Information Delivery Component


The information delivery element is used to enable the process of subscribing for data warehouse files and having it
transferred to one or more destinations according to some customer-specified scheduling algorithm.

Data Warehouse Components


Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database management system.
In the data dictionary, we keep the data about the logical data structures, the data about the records and addresses,
the information about the indexes, and so on.

Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to
particular selected subjects. Data in a data warehouse should be a fairly current, but not mainly up to the minute,
although development in the data warehouse industry has made standard and incremental data dumps more
achievable. Data marts are lower than data warehouses and usually contain organization. The current trends in data
warehousing are to developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.

Management and Control Component


The management and control elements coordinate the services and functions within the data warehouse. These
components control the data transformation and the data transfer into the data warehouse storage. On the other
hand, it moderates the data delivery to the clients. Its work with the database management systems and authorizes
data to be correctly saved in the repositories. It monitors the movement of information into the staging method and
from there into the data warehouses storage itself.

Why we need a separate Data Warehouse?


Data Warehouse queries are complex because they involve the computation of large groups of data at summarized
levels.

It may require the use of distinctive data organization, access, and implementation method based on
multidimensional views.

Performing OLAP queries in operational database degrade the performance of functional tasks.

Data Warehouse is used for analysis and decision making in which extensive database is required, including historical
data, which operational database does not typically maintain.

The separation of an operational database from data warehouses is based on the different structures and uses of
data in these systems.

Because the two systems provide different functionalities and require different kinds of data, it is necessary to
maintain separate databases.
Database Data Warehouse

1. It is used for Online Transactional Processing


1. It is used for Online Analytical Processing (OLAP).
(OLTP) but can be used for other objectives such as
This reads the historical information for the
Data Warehousing. This records the data from the
customers for business decisions.
clients for history.

2. The tables and joins are complicated since they 2. The tables and joins are accessible since they are
are normalized for RDBMS. This is done to reduce de-normalized. This is done to minimize the
redundant files and to save storage space. response time for analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are used 4. Data: Modeling approach are used for the Data
for RDBMS database design. Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical queries.

7. The database is the place where the data is taken 7. Data Warehouse is the place where the
as a base and managed to get available fast and application data is handled for analysis and
efficient access. reporting objectives.

Database Data Warehouse

A common Database is based on operational or


transactional processing. Each operation is an A data Warehouse is based on analytical processing.
indivisible transaction.

A Data Warehouse maintains historical data over


Generally, a Database stores current and up-to-date time. Historical data is the data kept over years and
data which is used for daily operations. can used for trend analysis, make future predictions
and decision support.

A Data Warehouse is integrated generally at the


organization level, by combining data from different
A database is generally application specific. databases.
Example – A database stores related data, such as Example – A data warehouse integrates the data
the student details in a school. from one or more databases , so that analysis can be
done to get results , such as the best performing
school in a city.

Constructing a Database is not so expensive. Constructing a Data Warehouse can be expensive.


Difference between Operational Database and Data Warehouse

The Operational Database is the source of information for the data warehouse. It includes detailed information used
to run the day to day operations of the business. The data frequently changes as updates are made and reflect the
current value of the last transactions.

Operational Database Management Systems also called as OLTP (Online Transactions Processing Databases), are used
to manage dynamic data in real-time.

Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and decision-making.
Such systems can organize and present information in specific formats to accommodate the diverse needs of various
users. These systems are called as Online-Analytical Processing (OLAP) Systems.

Data Warehouse and the OLTP database are both relational databases. However, the goals of both these databases
are different.

Operational Database Data Warehouse

Data warehousing systems are typically designed to


Operational systems are designed to support high-
support high-volume analytical processing (i.e.,
volume transaction processing.
OLAP).

Operational systems are usually concerned with Data warehousing systems are usually concerned with
current data. historical data.

Data within operational systems are mainly updated Non-volatile, new data may be added regularly. Once
regularly according to need. Added rarely changed.

It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.

It is optimized for a simple set of transactions, It is optimized for extent loads and high, complex,
generally adding or retrieving a single row at a time unpredictable queries that access many rows per
per table. table.

It is optimized for validation of incoming information Loaded with consistent, valid information, requires no
during transactions, uses validation data tables. real-time validation.

It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.

Data warehousing systems are widely subject-


Operational systems are widely process-oriented.
oriented
Operational systems are usually optimized to perform Data warehousing systems are usually optimized to
fast inserts and updates of associatively small perform fast retrievals of relatively high volumes of
volumes of data. data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Processing (OLTP) Processing (OLAP)
Feature OLTP OLAP

It is a system which is used to It is a system which is used to


Characteristic
manage operational Data. manage informational Data.

Knowledge workers, including


Clerks, clients, and information
Users managers, executives, and
technology professionals.
analysts.

OLTP system is a customer-


OLAP system is market-oriented,
oriented, transaction, and query
knowledge workers including
System orientation processing are done by clerks,
managers, do data analysts
clients, and information technology
executive and analysts.
professionals.

OLAP system manages a large


amount of historical data, provides
facilitates for summarization and
OLTP system manages current data
aggregation, and stores and
Data contents that too detailed and are used for
manages data at different levels of
decision making.
granularity. This information makes
the data more comfortable to use
in informed decision making.

Database Size 100 MB-GB 100 GB-TB

OLTP system usually uses an entity-


OLAP system typically uses either a
relationship (ER) data model and
Database design star or snowflake model and
application-oriented database
subject-oriented database design.
design.

OLAP system often spans multiple


versions of a database schema, due
OLTP system focuses primarily on
to the evolutionary process of an
the current data within an
organization. OLAP systems also
View enterprise or department, without
deal with data that originates from
referring to historical information
various organizations, integrating
or data in different organizations.
information from many data
stores.

Because of their large volume,


Volume of data Not very large OLAP data are stored on multiple
storage media.

The access patterns of an OLTP


Accesses to OLAP systems are
system subsist mainly of short,
mostly read-only methods because
Access patterns atomic transactions. Such a system
of these data warehouses stores
requires concurrency control and
historical data.
recovery techniques.
Access mode Read/write Mostly write

Short and fast inserts and updates Periodic long-running batch jobs
Insert and Updates
proposed by end-users. refresh the data.

Number of records accessed Tens Millions

Normalization Fully Normalized Partially Normalized

It depends on the amount of files


contained, batch data refresh, and
Processing Speed Very Fast complex query may take many
hours, and query speed can be
upgraded by creating indexes.

Difference between OLTP and OLAP

OLTP System

OLTP System handle with operational data. Operational data are those data contained in the operation of a particular
system. Example, ATM transactions and Bank transactions, etc.

OLAP System

OLAP handle with Historical Data or Archival Data. Historical data are those data that are achieved over a long period.
For example, if we collect the last 10 years information about flight reservation, the data can give us much
meaningful data such as the trends in the reservation. This may provide useful information like peak time of travel,
what kind of people are traveling in various classes (Economy/Business) etc.

The major difference between an OLTP and OLAP system is the amount of data analyzed in a single transaction.
Whereas an OLTP manage many concurrent customers and queries touching only an individual record or limited
groups of files at a time. An OLAP system must have the capability to operate on millions of files to answer a single
query.
Data Warehouse Architecture

A data warehouse architecture is a method of defining the overall architecture of data communication processing
and presentation that exist for end-clients computing within the enterprise. Each data warehouse is different, but all
are characterized by standard vital components.

Production applications such as payroll accounts payable product purchasing and inventory control are designed for
online transaction processing (OLTP). Such applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity recently dubbed
online analytical processing (OLAP). These include applications such as forecasting, profiling, summary reporting, and
trend analysis.

Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a warehouse
database is updated from operational systems periodically, usually during off-hours. As OLTP data accumulates in
production databases, it is regularly extracted, filtered, and then loaded into a dedicated warehouse server that is
accessible to users. As the warehouse is populated, it must be restructured tables de-normalized, data cleansed of
errors and redundancies and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.

Advertisement

Data warehouses and their architectures very depending upon the elements of an organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic

o Data Warehouse Architecture: With Staging Area

o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process the day-to-
day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file in the system must have a
different name.

Advertisement

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and work with particular instances
of data more accessible. For example, author, data build, and data changed, and file size are examples of very basic
document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data generated
by the warehouse manager.

Advertisement

The goals of the summarized information are to speed up query performance. The summarized record is updated
continuously as new information is loaded into the warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business managers for strategic decision-
making. These customers interact with the warehouse using end-client access tools.

Advertisement

The examples of some of the end-user access tools can be:

o Reporting and Query Tools

o Application Development Tools

o Executive Information Systems Tools

o Online Analytical Processing Tools

o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the warehouse.

Advertisement

e can do this programmatically, although data warehouses uses a staging area (A place where data is processed
before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method coming from multiple source
systems, especially for enterprise data warehouses where all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within our organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided information
for reporting and analysis on a section, unit, department or operation in the company, e.g., sales, payroll, production,
etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In this example, a financial
analyst wants to analyze historical data for purchases and sales or mine historical information to make predictions
about customer behavior.
Properties of Data Warehouse Architectures

Advertisement

The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as much as possible.

2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which has to be
managed and processed, and the number of user's requirements, which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and technologies without redesigning
the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data stored in the data warehouses.
5. Administerability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures

Single-Tier Architecture

Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data stored to
reach this goal; it removes data redundancies.

Advertisement

The figure shows the only layer physically available is the source layer. In this method, data warehouses are virtual.
This means that the data warehouse is implemented as a multidimensional view of operational data created by
specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between analytical and
transactional processing. Analysis queries are agreed to operational data after the middleware interprets them. In
this way, queries affect transactional workloads.

Two-Tier Architecture

The requirement for separation plays an essential role in defining the two-tier architecture for a data warehouse
system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation between physically available sources
and data warehouses, in fact, consists of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially to
corporate relational databases or legacy databases, or it may come from an information system outside the
corporate walls.

2. Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and fill
gaps, and integrated to merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into a data warehouse.

3. Data Warehouse layer: Information is saved to one logically centralized individual repository: a data
warehouse. The data warehouses can be directly accessed, but it can also be used as a source for creating
data marts, which partially replicate data warehouse contents and are designed for specific enterprise
departments. Meta-data repositories store information on sources, access procedures, data staging, users,
data mart schema, and so on.

4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically analyze
information, and simulate hypothetical business scenarios. It should feature aggregate information
navigators, complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture

The three-tier architecture consists of the source layer (containing multiple source system), the reconciled layer and
the data warehouse layer (containing both data warehouses and data marts). The reconciled layer sits between the
source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data model for a whole enterprise.
At the same time, it separates the problems of source data extraction and integration from those of data warehouse
population. In some cases, the reconciled layer is also directly used to accomplish better some operational tasks,
such as producing daily reports that cannot be satisfactorily prepared using the corporate applications or generating
data flows to feed external processes periodically to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of this structure is
the extra file storage space used through the extra redundant reconciled layer. It also makes the analytical tools a
little further away from being real-time.

Three-Tier Data Warehouse Architecture

Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)


2. Middle Tier (OLAP Server)

3. Top Tier (Front end Tools).

A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may include several
specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile data provided by external consultants)
are extracted using application program interfaces called a gateway. A gateway is provided by the underlying DBMS
and allows customer programs to generate SQL code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and Embedding for
Databases), by Microsoft, and JDBC (Java Database Connection).

Backward Skip 10sPlay VideoForward Skip 10s

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on multidimensional
data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly implements
multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional tools for data
mining of the OLAP-generated data.
The overall Data Warehouse Architecture is shown in fig:

The metadata repository stores information that defines DW objects. It includes the following parameters and
information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension, hierarchies, data mart
locations, and contents, etc.

2. Operational metadata, which usually describes the currency level of the stored data, i.e., active, archived or
purged, and warehouse monitoring information, i.e., usage statistics, error reports, audit, etc.

3. System performance data, which includes indices, used to improve data access and retrieval performance.

4. Information about the mapping from operational databases, which provides source RDBMSs and their
contents, cleaning and transformation rules, etc.

5. Summarization algorithms, predefined queries, and reports business data, which include business terms and
definitions, ownership information, etc.

Principles of Data Warehousing


Load Performance

Data warehouses require increase loading of new data periodically basis within narrow time windows; performance
on the load process should be measured in hundreds of millions of rows and gigabytes per hour and must not
artificially constrain the volume of data business.

Load Processing

Many phases must be taken to load new or update data into the data warehouse, including data conversion, filtering,
reformatting, indexing, and metadata update.

Data Quality Management

Fact-based management demands the highest data quality. The warehouse ensures local consistency, global
consistency, and referential integrity despite "dirty" sources and massive database size.

Query Performance

Fact-based management must not be slowed by the performance of the data warehouse RDBMS; large, complex
queries must be complete in seconds, not days.

Terabyte Scalability

Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of gigabytes and
terabyte-sized data warehouses.

What is Operational Data Stores?

An ODS has been described by Inmon and Imhoff (1996) as a subject-oriented, integrated, volatile, current valued
data store, containing only detailed corporate data. A data warehouse is a documenting database that includes
associatively recent as well as historical information and may also include aggregate data.

The ODS is a subject-oriented. It is organized around the significant information subject of an enterprise. In a
university, the subjects may be students, lecturers and courses while in the company the subjects might be users,
salespersons and products.

The ODS is an integrated. That is, it is a group of subject-oriented record from a variety of systems to provides an
enterprise-wide view of the information.
The ODS is a current-valued. That is, an ODS is up-to-date and follow the current status of the data. An ODS does not
contain historical information. Since the OLTP system data is changing all the time, data from underlying sources
refresh the ODS as generally and frequently as possible.

Backward Skip 10sPlay VideoForward Skip 10s

The ODS is volatile. That is, the data in the ODS frequently changes as new data refreshes the ODS.

The ODS is a detailed. That is, ODS is detailed enough to serve the need of the operational management staff in the
enterprise. The granularity of the information in the ODS does not have to be precisely the same as in the
source OLTP system.

ODS Design and Implementation

The extraction of data from source databases needs to be efficient, and the quality of records needs to be
maintained. Since the data is refreshed generally and frequently, suitable checks are required to ensure the quality of
data after each refresh. An ODS is a read-only database other than regular refreshing by the OLTP systems. Customer
should not be allowed to update ODS information.

Populating an ODS contains an acquisition phase of extracting, transforming and loading information from OLTP
source systems. This procedure is ETL. Completing populating the database, analyze for anomalies and testing for
performance are essential before an ODS system can go online.

Flash Monitoring and Reporting Tools

Flash monitoring and the reporting tools are like a dashboard that support meaningful online data on the operational
status of the enterprise. This method is achieved by the use of ODS data as inputs to the flash monitoring and
reporting tools, to provide business users with a refreshed continuously, enterprise-wide view of operations without
creating unwanted interruptions or additional load on transactions-processing systems.
Zero Latency Enterprise (ZLE)

The Gantner Group has used a method Zero Latency Enterprise (ZLE) for near real-time integration of operational
information so that there is no necessary delay in getting data from one part or one system of an enterprise to
another system that needs the data.

A ZLE data store is like an ODS that is integrated and up-to-date. The objective of a ZLE data store is to allow
management a single view of enterprise information by bringing together relevant information in real-time and
providing management with a "360-degree" aspect of the user.

A ZLE generally has the following features. It has a consolidated view of the enterprise operational information. It has
a massive level of availability, and it contains online refreshing of data. ZLE requires data that is as current as possible.
Since a ZLE needs to provide a large number of concurrent users, for example, call centre users, the fast turnaround
time for transactions and 24/7 availability are required.

Difference between Operational Data Stores and Data Warehouse

Operational Data Stores Data Warehouse

ODS means for operational reporting and supports A data warehouse is intended for historical and trend
current or near real-time reporting requirements. analysis, usually reporting on a large volume of data.

An ODS consist of only a short window of data. A data warehouse includes the entire history of data.

It is typically detailed data only. It contains summarized and detailed data.

It is used for detailed decision making and operational It is used for long term decision making and management
reporting. reporting.

It is used at the operational level. It is used at the managerial level.

It serves as conduct for data between operational and It serves as a repository for cleansed and consolidated
analytics system. data sets.

It is updated often as the transactions system generates It is usually updated in batch processing mode on a set
new data. schedule.
ETL (Extract, Transform, and Load) Process
What is ETL?

The mechanism of extracting information from source systems and bringing it into the data warehouse is commonly
called ETL, which stands for Extraction, Transformation and Loading.

The ETL process requires active inputs from various stakeholders, including developers, analysts, testers, top
executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with business
changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and needs to be agile,
automated, and well documented.

How ETL Works?

ETL consists of three separate phases:

Backward Skip 10sPlay VideoForward Skip 10s

Extraction

o Extraction is the operation of extracting information from a source system for further use in a data
warehouse environment. This is the first stage of the ETL process.

o Extraction process is often one of the most time-consuming tasks in the ETL.

o The source systems might be complicated and poorly documented, and thus determining which data needs
to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all changed data to the warehouse
and keep it up-to-date.

Cleansing

The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data quality. The
primary data cleansing features found in ETL tools are rectification and homogenization. They use specific
dictionaries to rectify typing mistakes and to recognize synonyms, as well as rule-based cleansing to enforce domain-
specific rules and defines appropriate associations between values.

The following examples show the essential of data cleaning:

If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date list of contact
addresses, email addresses and telephone numbers must be available.

If a client or supplier calls, the staff responding should be quickly able to find the person in the enterprise database,
but this need that the caller's name or his/her company name is listed in the database.

If a user appears in the databases with two or more slightly different names or different account numbers, it
becomes difficult to update the customer's information.

Transformation

Transformation is the core of the reconciliation phase. It converts records from its operational source format into a
particular data warehouse format. If we implement a three-layer architecture, this phase outputs our reconciled data
layer.

The following points must be rectified in this phase:

o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly show that this is a
Limited Partnership company.

o Different formats can be used for individual data. For example, data can be saved as a string or as three
integers.

Following are the main transformation processes aimed at populating the reconciled data layer:

o Conversion and normalization that operate on both storage formats and units of measure to make data
uniform.

o Matching that associates equivalent fields in different sources.

o Selection that reduces the number of source fields and records.

Cleansing and Transformation processes are often closely linked in ETL tools.
Loading

The Load is the process of writing the data into the target database. During the load step, it is necessary to ensure
that the load is performed correctly and with as little resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older file is replaced. Refresh is
usually used in combination with static extraction to populate a data warehouse initially.

2. Update: Only those changes applied to source information are added to the Data Warehouse. An update is
typically carried out without deleting or modifying preexisting data. This method is used in combination with
incremental extraction to update data warehouses regularly.

Selecting an ETL Tool

Selection of an appropriate ETL Tools is an important decision that has to be made in choosing the importance of an
ODS or data warehousing application. The ETL tools are required to provide coordinated access to multiple data
sources so that relevant data may be extracted from them. An ETL tool would generally contains tools for data
cleansing, re-organization, transformations, aggregation, calculation and automatic loading of information into the
object database.

An ETL tool should provide a simple user interface that allows data cleansing and data transformation rules to be
specified using a point-and-click approach. When all mappings and transformations have been defined, the ETL tool
should automatically generate the data extract/transformation/load programs, which typically run in batch mode.
← prevnext →

Difference between ETL and ELT

ETL (Extract, Transform, and Load)

Extract, Transform and Load is the technique of extracting the record from sources (which is present outside or on-
premises, etc.) to a staging area, then transforming or reformatting with business manipulation performed on it in
order to fit the operational needs or data analysis, and later loading into the goal or destination databases or data
warehouse.

Strengths

Development Time: Designing from the output backwards provide that only information applicable to the solution is
extracted and processed, potentially decreasing development, delete, and processing overhead.

Targeted data: Due to the targeted feature of the load process, the warehouse contains only information relevant to
the presentation. Reduced warehouse content simplify the security regime enforce and hence the administration
overheads.

Tools Availability: The number of tools available that implement ETL provides the flexibility of approach and the
opportunity to identify the most appropriate tool. The proliferation of tools has to lead to a competitive functionality
war, which often results in loss of maintainability.

Weaknesses

Flexibility: Targeting only relevant information for output means that any future requirements that may need data
that was not included in the original design will need to be added to the ETL routines. Due to the nature of tight
dependency between the methods developed, this often leads to a need for fundamental redesign and
development. As a result, this increase the time and cost involved.

Hardware: Most third-party tools utilize their engine to implement the ETL phase. Regardless of the estimate of the
solution, this can necessitate the investment in additional hardware to implement the tool's ETL engine. The use of
third-party tools to achieve the ETL process compels the information of new scripting languages and processes.

Learning Curve: Implementing a third-party tools that uses foreign processes and languages results in the learning
curve that is implicit in all technologies new to an organization and can often lead to consecutive blind alleys in their
use due to shortage of experience.

ELT (Extract, Load and Transform)

ELT stands for Extract, Load and Transform is the various sight while looking at data migration or movement. ELT
involves the extraction of aggregate information from the source system and loading to the target method instead of
transformation between the extraction and loading phase. Once the data is copied or loaded into the target method,
then change takes place.
The extract and load step can be isolated from the transformation process. Isolating the load phase from the
transformation process delete an inherent dependency between these phases. In addition to containing the data
necessary for the transformations, the extract and load process can include components of data that may be
essential in the future. The load phase could take the entire source and loaded it into the warehouses.

Separating the phases enables the project to be damaged down into smaller chunks, thus making it more specific and
manageable.

Performing the data integrity analysis in the staging method enables a further phase in the process to be isolated and
dealt with at the most appropriate point in the process. This method also helps to ensure that only cleaned and
checked information is loaded into the warehouse for transformation.

Isolating the transformations from the load steps helps to encourage a more staged way to the warehouse design
and implementation.

Strengths

Project Management: Being able to divide the warehouse method into specific and isolated functions, enables a
project to be designed on a smaller function basis, therefore the project can be broken down into feasible chunks.

Flexible & Future Proof: In general, in an ELT implementation, all record from the sources are loaded into the data
warehouse as part of the extract and loading process. This, linked with the isolation of the transformation phase,
means that future requirements can easily be incorporated into the data warehouse architecture.

Risk minimization: Deleting the close interdependencies between each technique of the warehouse build system
enables the development method to be isolated, and the individual process design can thus also be separated. This
provides a good platform for change, maintenance and management.

Utilize Existing Hardware: In implementing ELT as a warehouse build process, the essential tools provided with the
database engine can be used.

Utilize Existing Skill sets: By using the functionality support by the database engine, the existing investment in
database functions are re-used to develop the warehouse. No new skills need to be learned, and the full weight of
the experience in developing the engine?s technology is utilized, further reducing the cost and risk in the
development process.

Weaknesses

Against the Norm: ELT is a new method to data warehouse design and development. While it has proven itself many
times over through its abundant use in implementations throughout the world, it does require a change in mentality
and design approach against traditional methods.

Tools Availability: Being an emergent technology approach, ELT suffers from the limited availability of tools.

Difference between ETL vs. ELT


Basics ETL ELT

Data is transferred to the ETL server and


Data remains in the DB except for cross
Process moved back to DB. High network
Database loads (e.g. source to object).
bandwidth required.

Transformations are performed in ETL Transformations are performed (in the source
Transformation
Server. or) in the target.

Typically used for

o Source to target transfer

Typically used for


Code Usage o Compute-intensive
o High amounts of data
Transformations

o Small amount of data

Time- It needs highs maintenance as you need


Low maintenance as data is always available.
Maintenance to select data to load and transform.

Overwrites existing column or Need to


Easily add the calculated column to the
Calculations append the dataset and push to the
existing table.
target platform.

Analysis

Types of Data Warehouses

There are different types of data warehouses, which are as follows:


Host-Based Data Warehouses

There are two types of host-based data warehouses which can be implemented:

o Host-Based mainframe warehouses which reside on a high volume database. Supported by robust and
reliable high capacity structure such as IBM system/390, UNISYS and Data General sequent systems, and
databases such as Sybase, Oracle, Informix, and DB2.

o Host-Based LAN data warehouses, where data delivery can be handled either centrally or from the
workgroup environment. The size of the data warehouses of the database depends on the platform.

Data Extraction and transformation tools allow the automated extraction and cleaning of data from production
systems. It is not applicable to enable direct access by query tools to these categories of methods for the following
reasons:

1. A huge load of complex warehousing queries would possibly have too much of a harmful impact upon the
mission-critical transaction processing (TP)-oriented application.

2. These TP systems have been developing in their database design for transaction throughput. In all methods,
a database is designed for optimal query or transaction processing. A complex business query needed the
joining of many normalized tables, and as result performance will usually be poor and the query constructs
largely complex.

3. There is no assurance that data in two or more production methods will be consistent.

Host-Based (MVS) Data Warehouses

Those data warehouse uses that reside on large volume databases on MVS are the host-based types of data
warehouses. Often the DBMS is DB2 with a huge variety of original source for legacy information, including VSAM,
DB2, flat files, and Information Management System (IMS).
Before embarking on designing, building and implementing such a warehouse, some further considerations must be
given because

1. Such databases generally have very high volumes of data storage.

2. Such warehouses may require support for both MVS and customer-based report and query facilities.

3. These warehouses have complicated source systems.

4. Such systems needed continuous maintenance since these must also be used for mission-critical objectives.

To make such data warehouses building successful, the following phases are generally followed:

1. Unload Phase: It contains selecting and scrubbing the operation data.

2. Transform Phase: For translating it into an appropriate form and describing the rules for accessing and
storing it.

3. Load Phase: For moving the record directly into DB2 tables or a particular file for moving it into another
database or non-MVS warehouse.

An integrated Metadata repository is central to any data warehouse environment. Such a facility is required for
documenting data sources, data translation rules, and user areas to the warehouse. It provides a dynamic network
between the multiple data source databases and the DB2 of the conditional data warehouses.

A metadata repository is necessary to design, build, and maintain data warehouse processes. It should be capable of
providing data as to what data exists in both the operational system and data warehouse, where the data is located.
The mapping of the operational data to the warehouse fields and end-user access techniques. Query, reporting, and
maintenance are another indispensable method of such a data warehouse. An MVS-based query and reporting tool
for DB2.

Host-Based (UNIX) Data Warehouses

Oracle and Informix RDBMSs support the facilities for such data warehouses. Both of these databases can extract
information from MVS� based databases as well as a higher number of other UNIX� based databases. These types of
warehouses follow the same stage as the host-based MVS data warehouses. Also, the data from different network
servers can be created. Since file attribute consistency is frequent across the inter-network.

LAN-Based Workgroup Data Warehouses

A LAN based workgroup warehouse is an integrated structure for building and maintaining a data warehouse in a LAN
environment. In this warehouse, we can extract information from a variety of sources and support multiple LAN
based warehouses, generally chosen warehouse databases to include DB2 family, Oracle, Sybase, and Informix. Other
databases that can also be contained through infrequently are IMS, VSAM, Flat File, MVS, and VH.

Designed for the workgroup environment, a LAN based workgroup warehouse is optimal for any business
organization that wants to build a data warehouse often called a data mart. This type of data warehouse generally
requires a minimal initial investment and technical training.

Data Delivery: With a LAN based workgroup warehouse, customer needs minimal technical knowledge to create and
maintain a store of data that customized for use at the department, business unit, or workgroup level. A LAN based
workgroup warehouse ensures the delivery of information from corporate resources by providing transport access to
the data in the warehouse.

Host-Based Single Stage (LAN) Data Warehouses

Within a LAN based data warehouse, data delivery can be handled either centrally or from the workgroup
environment so business groups can meet process their data needed without burdening centralized IT resources,
enjoying the autonomy of their data mart without comprising overall data integrity and security in the enterprise.
Limitations

Both DBMS and hardware scalability methods generally limit LAN� based warehousing solutions.

Many LAN based enterprises have not implemented adequate job scheduling, recovery management, organized
maintenance, and performance monitoring methods to provide robust warehousing solutions.

Often these warehouses are dependent on other platforms for source record. Building an environment that has data
integrity, recoverability, and security require careful design, planning, and implementation. Otherwise,
synchronization of transformation and loads from sources to the server could cause innumerable problems.

A LAN based warehouse provides data from many sources requiring a minimal initial investment and technical
knowledge. A LAN based warehouse can also work replication tools for populating and updating the data warehouse.
This type of warehouse can include business views, histories, aggregation, versions in, and heterogeneous source
support, such as

o DB2 Family

o IMS, VSAM, Flat File [MVS and VM]

A single store frequently drives a LAN based warehouse and provides existing DSS applications, enabling the business
user to locate data in their data warehouse. The LAN based warehouse can support business users with complete
data to information solution. The LAN based warehouse can also share metadata with the ability to catalog business
data and make it feasible for anyone who needs it.

Multi-Stage Data Warehouses

It refers to multiple stages in transforming methods for analyzing data through aggregations. In other words, staging
of the data multiple times before the loading operation into the data warehouse, data gets extracted form source
systems to staging area first, then gets loaded to data warehouse after the change and then finally to
departmentalized data marts.

This configuration is well suitable to environments where end-clients in numerous capacities require access to both
summarized information for up to the minute tactical decisions as well as summarized, a commutative record for
long-term strategic decisions. Both the Operational Data Store (ODS) and the data warehouse may reside on host-
based or LAN Based databases, depending on volume and custom requirements. These contain DB2, Oracle,
Informix, IMS, Flat Files, and Sybase.

Usually, the ODS stores only the most up-to-date records. The data warehouse stores the historical calculation of the
files. At first, the information in both databases will be very similar. For example, the records for a new client will look
the same. As changes to the user record occur, the ODs will be refreshed to reflect only the most current data,
whereas the data warehouse will contain both the historical data and the new information. Thus the volume
requirement of the data warehouse will exceed the volume requirements of the ODS overtime. It is not familiar to
reach a ratio of 4 to 1 in practice.

Stationary Data Warehouses

In this type of data warehouses, the data is not changed from the sources, as shown in fig:
Instead, the customer is given direct access to the data. For many organizations, infrequent access, volume issues, or
corporate necessities dictate such as approach. This schema does generate several problems for the customer such
as

o Identifying the location of the information for the users

o Providing clients the ability to query different DBMSs as is they were all a single DBMS with a single API.

o Impacting performance since the customer will be competing with the production data stores.

Such a warehouse will need highly specialized and sophisticated 'middleware' possibly with a single interaction with
the client. This may also be essential for a facility to display the extracted record for the user before report
generation. An integrated metadata repository becomes an absolute essential under this environment.

Distributed Data Warehouses

The concept of a distributed data warehouse suggests that there are two types of distributed data warehouses and
their modifications for the local enterprise warehouses which are distributed throughout the enterprise and a global
warehouses as shown in fig:
Characteristics of Local data warehouses

o Activity appears at the local level

o Bulk of the operational processing

o Local site is autonomous

o Each local data warehouse has its unique architecture and contents of data

o The data is unique and of prime essential to that locality only

o Majority of the record is local and not replicated

o Any intersection of data between local data warehouses is circumstantial

o Local warehouse serves different technical communities

o The scope of the local data warehouses is finite to the local site

o Local warehouses also include historical data and are integrated only within the local site.

Virtual Data Warehouses

Virtual Data Warehouses is created in the following stages:

1. Installing a set of data approach, data dictionary, and process management facilities.

2. Training end-clients.

3. Monitoring how DW facilities will be used

4. Based upon actual usage, physically Data Warehouse is created to provide the high-frequency results

This strategy defines that end users are allowed to get at operational databases directly using whatever tools are
implemented to the data access network. This method provides ultimate flexibility as well as the minimum amount
of redundant information that must be loaded and maintained. The data warehouse is a great idea, but it is difficult
to build and requires investment. Why not use a cheap and fast method by eliminating the transformation phase of
repositories for metadata and another database. This method is termed the 'virtual data warehouse.'

To accomplish this, there is a need to define four kinds of data:

1. A data dictionary including the definitions of the various databases.

2. A description of the relationship between the data components.

3. The description of the method user will interface with the system.

4. The algorithms and business rules that describe what to do and how to do it.

Disadvantages

1. Since queries compete with production record transactions, performance can be degraded.

2. There is no metadata, no summary record, or no individual DSS (Decision Support System) integration or
history. All queries must be copied, causing an additional burden on the system.

3. There is no refreshing process, causing the queries to be very complex.

Data Warehouse Modeling

Data warehouse modeling is the process of designing the schemas of the detailed and summarized information of
the data warehouse. The goal of data warehouse modeling is to develop a schema describing the reality, or at least a
part of the fact, which the data warehouse is needed to support.

Data warehouse modeling is an essential stage of building a data warehouse for two main reasons. Firstly, through
the schema, data warehouse clients can visualize the relationships among the warehouse data, to use them with
greater ease. Secondly, a well-designed schema allows an effective data warehouse structure to emerge, to help
decrease the cost of implementing the warehouse and improve the efficiency of using it.
Data modeling in data warehouses is different from data modeling in operational database systems. The primary
function of data warehouses is to support DSS processes. Thus, the objective of data warehouse modeling is to make
the data warehouse efficiently support complex queries on long term information.

In contrast, data modeling in operational database systems targets efficiently supporting simple transactions in the
database such as retrieving, inserting, deleting, and changing data. Moreover, data warehouses are designed for the
customer with general information knowledge about the enterprise, whereas operational database systems are more
oriented toward use by software specialists for creating distinct applications.

Data Warehouse model is illustrated in the given diagram.

The data within the specific warehouse itself has a particular architecture with the emphasis on various levels of
summarization, as shown in figure:
The current detail record is central in importance as it:

o Reflects the most current happenings, which are commonly the most stimulating.

o It is numerous as it is saved at the lowest method of the Granularity.

o It is always (almost) saved on disk storage, which is fast to access but expensive and difficult to manage.

Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept at a level detail
consistent with current detailed data.

Lightly summarized data is data extract from the low level of detail found at the current, detailed level and usually is
stored on disk storage. When building the data warehouse have to remember what unit of time is summarization
done over and also the components or what attributes the summarized data will contain.

Highly summarized data is compact and directly available and can even be found outside the warehouse.

Metadata is the final element of the data warehouses and is really of various dimensions in which it is not the same
as file drawn from the operational data, but it is used as:-

o A directory to help the DSS investigator locate the items of the data warehouse.

o A guide to the mapping of record as the data is changed from the operational data to the data warehouse
environment.

o A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.

Data Modeling Life Cycle

In this section, we define a data modeling life cycle. It is a straight forward process of transforming the business
requirements to fulfill the goals for storing, maintaining, and accessing the data within IT systems. The result is a
logical and physical data model for an enterprise data warehouse.

The objective of the data modeling life cycle is primarily the creation of a storage area for business information. That
area comes from the logical and physical data modeling stages, as shown in Figure:
Conceptual Data Model

A conceptual data model recognizes the highest-level relationships between the different entities.

Characteristics of the conceptual data model

o It contains the essential entities and the relationships among them.

o No attribute is specified.

o No primary key is specified.

We can see that the only data shown via the conceptual data model is the entities that define the data and the
relationships between those entities. No other data, as shown through the conceptual data model.

Logical Data Model

A logical data model defines the information in as much structure as possible, without observing how they will be
physically achieved in the database. The primary objective of logical data modeling is to document the business data
structures, processes, rules, and relationships by a single view - the logical data model.

Features of a logical data model

o It involves all entities and relationships among them.

o All attributes for each entity are specified.


o The primary key for each entity is stated.

o Referential Integrity is specified (FK Relation).

The phase for designing the logical data model which are as follows:

o Specify primary keys for all entities.

o List the relationships between different entities.

o List all attributes for each entity.

o Normalization.

o No data types are listed

Physical Data Model

Physical data model describes how the model will be presented in the database. A physical database model
demonstrates all table structures, column names, data types, constraints, primary key, foreign key, and relationships
between tables. The purpose of physical data modeling is the mapping of the logical data model to the physical
structures of the RDBMS system hosting the data warehouse. This contains defining physical RDBMS structures, such
as tables and data types to use when storing the information. It may also include the definition of new data
structures for enhancing query performance.

Characteristics of a physical data model

o Specification all tables and columns.

o Foreign keys are used to recognize relationships between tables.

The steps for physical data model design which are as follows:

o Convert entities to tables.

o Convert relationships to foreign keys.

o Convert attributes to columns.


Types of Data Warehouse Models

Enterprise Warehouse

An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It supports
corporate-wide data integration, usually from one or more operational systems or external data providers, and it's
cross-functional in scope. It generally contains detailed information as well as summarized information and can range
in estimate from a few gigabyte to hundreds of gigabytes, terabytes, or beyond.

An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers, or parallel
architecture platforms. It required extensive business modeling and may take years to develop and build.
Data Mart

A data mart includes a subset of corporate-wide data that is of value to a specific collection of users. The scope is
confined to particular selected subjects. For example, a marketing data mart may restrict its subjects to the customer,
items, and sales. The data contained in the data marts tend to be summarized.

Data Marts is divided into two parts:

Independent Data Mart: Independent data mart is sourced from data captured from one or more operational
systems or external data providers, or data generally locally within a different department or geographic area.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses.

Virtual Warehouses

Virtual Data Warehouses is a set of perception over the operational database. For effective query processing, only
some of the possible summary vision may be materialized. A virtual warehouse is simple to build but required excess
capacity on operational database servers.

Data Warehouse Design

A data warehouse is a single data repository where a record from multiple data sources is integrated for online
business analytical processing (OLAP). This implies a data warehouse needs to meet the requirements from all the
business stages within the entire organization. Thus, data warehouse design is a hugely complex, lengthy, and hence
error-prone process. Furthermore, business analytical functions change over time, which results in changes in the
requirements for the systems. Therefore, data warehouse and OLAP systems are dynamic, and the design process is
continuous.

Data warehouse design takes a method different from view materialization in the industries. It sees data warehouses
as database systems with particular needs such as answering management related queries. The target of the design
becomes how the record from multiple data sources should be extracted, transformed, and loaded (ETL) to be
organized in a database as the data warehouse.

There are two approaches

1. "top-down" approach

2. "bottom-up" approach

Top-down Design Approach

In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time-variant, non-volatile
and integrated data repository for the entire enterprise data from different sources are validated, reformatted and
saved in a normalized (up to 3NF) database as the data warehouse. The data warehouse stores "atomic" information,
the data at the lowest level of granularity, from where dimensional data marts can be built by selecting the data
required for specific business subjects or particular departments. An approach is a data-driven approach as the
information is gathered and integrated first and then business requirements by subjects for building data marts are
formulated. The advantage of this method is which it supports a single integrated data source. Thus data marts built
from it will have consistency when they overlap.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design


This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.

Bottom-Up Design Approach

In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data specifical architecture for
query and analysis," term the star schema. In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes (or subjects). Thus it is needed to be a business-driven
approach in contrast to Inmon's data-driven approach.

Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized database for
the data warehouse, a denormalized dimensional database is adapted to meet the data delivery requirements of
data warehouses. Using this method, to use the set of data marts as the enterprise data warehouse, data marts
should be built with conformed dimensions in mind, defining that ordinary objects are represented the same in
different data marts. The conformed dimensions connected the data marts to form a data warehouse, which is
generally called a virtual data warehouse.

The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart, a data
warehouse for a single subject, takes far less time and effort than developing an enterprise-wide data warehouse.
Also, the risk of failure is even less. This method is inherently incremental. This method allows the project team to
learn and grow.
Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data marts.

Disadvantages of bottom-up design

the locations of the data warehouse and the data marts are reversed in the bottom-up approach design.

Differentiate between Top-Down Design Approach and Bottom-Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

Solves the essential low-level problem and integrates t


Breaks the vast problem into smaller subproblems.
higher one.

Inherently architected- not a union of several data marts. Inherently incremental; can schedule essential data ma

Single, central storage of information about the content. Departmental information stored.

Centralized rules and control. Departmental rules and control.


It includes redundant information. Redundancy can be removed.

Less risk of failure, favorable return on investment, and


It may see quick results if implemented with repetitions.
techniques.

Data Warehouse Implementation

There are various implementation in data warehouses which are as follows

1. Requirements analysis and capacity planning: The first process in data warehousing involves defining enterprise
needs, defining architectures, carrying out capacity planning, and selecting the hardware and software tools. This
step will contain be consulting senior management as well as the different stakeholder.

2. Hardware integration: Once the hardware and software has been selected, they require to be put by integrating
the servers, the storage methods, and the user software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views. This may
contain using a modeling tool if the data warehouses are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed. This contains
designing the physical data warehouse organization, data placement, data partitioning, deciding on access
techniques, and indexing.
5. Sources: The information for the data warehouse is likely to come from several data sources. This step contains
identifying and connecting the sources using the gateway, ODBC drives, or another wrapper.

6. ETL: The data from the source system will require to go through an ETL phase. The process of designing and
implementing the ETL phase may contain defining a suitable ETL tool vendors and purchasing and implementing the
tools. This may contains customize the tool to suit the need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the tools will be needed,
perhaps using a staging area. Once everything is working adequately, the ETL tools may be used in populating the
warehouses given the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-user applications. This step contains
designing and implementing applications required by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the end-client
applications tested, the warehouse system and the operations may be rolled out for the user's community to use.

Advertisement

Implementation Guidelines

1. Build incrementally: Data warehouses must be built incrementally. Generally, it is recommended that a data marts
may be created with one particular project in mind, and once it is implemented, several other sections of the
enterprise may also want to implement similar systems. An enterprise data warehouses can then be implemented in
an iterative manner allowing all data marts to extract information from the data warehouse.

2. Need a champion: A data warehouses project must have a champion who is active to carry out considerable
researches into expected price and benefit of the project. Data warehousing projects requires inputs from many units
in an enterprise and therefore needs to be driven by someone who is needed for interacting with people in the
enterprises and can actively persuade colleagues.
3. Senior management support: A data warehouses project must be fully supported by senior management. Given
the resource-intensive feature of such project and the time they can take to implement, a warehouse project signal
for a sustained commitment from senior management.

4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit by the organizations
should be loaded in the data warehouses.

5. Corporate strategy: A data warehouse project must be suitable for corporate strategies and business goals. The
purpose of the project must be defined before the beginning of the projects.

6. Business plan: The financial costs (hardware, software, and peopleware), expected advantage, and a project plan
for a data warehouses project must be clearly outlined and understood by all stakeholders. Without such
understanding, rumors about expenditure and benefits can become the only sources of data, subversion the projects.

7. Training: Data warehouses projects must not overlook data warehouses training requirements. For a data
warehouses project to be successful, the customers must be trained to use the warehouses and to understand its
capabilities.

Advertisement

8. Adaptability: The project should build in flexibility so that changes may be made to the data warehouses if and
when required. Like any system, a data warehouse will require to change, as the needs of an enterprise change.

9. Joint management: The project must be handled by both IT and business professionals in the enterprise. To ensure
that proper communication with the stakeholder and which the project is the target for assisting the enterprise's
business, the business professional must be involved in the project along with technical professionals.

What is Meta Data?

Metadata is data about the data or documentation about the information which is required by the users. In data
warehousing, metadata is one of the essential aspects.

Metadata includes the following:

1. The location and descriptions of warehouse systems and components.

2. Names, definitions, structures, and content of data-warehouse and end-users views.

3. Identification of authoritative data sources.

4. Integration and transformation rules used to populate data.

5. Integration and transformation rules used to deliver information to end-user analytical tools.

6. Subscription information for information delivery to analysis subscribers.

7. Metrics used to analyze warehouses usage and performance.

8. Security authorizations, access control list, etc.

Metadata is used for building, maintaining, managing, and using the data warehouses. Metadata allow users access
to help understand the content and find data.

Several examples of metadata are:

1. A library catalog may be considered metadata. The directory metadata consists of several predefined
components representing specific attributes of a resource, and each item can have one or more values.
These components could be the name of the author, the name of the document, the publisher's name, the
publication date, and the methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the book.

3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the person's
weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the data is 80.

4. Another examples of metadata are data about the tables and figures in a report like this book. A table (which
is a record) has a name (e.g., table titles), and there are column names of the tables that may be treated
metadata. The figures also have titles or names.

Why is metadata necessary in a data warehouses?

o First, it acts as the glue that links all parts of the data warehouses.

o Next, it provides information about the contents and structures to the developers.

o Finally, it opens the doors to the end-users and makes the contents recognizable in their terms.

Metadata is Like a Nerve Center. Various processes during the building and administering of the data warehouse
generate parts of the data warehouse metadata. Another uses parts of metadata generated by one process. In the
data warehouse, metadata assumes a key position and enables communication among various methods. It acts as a
nerve centre in the data warehouse.

Backward Skip 10sPlay VideoForward Skip 10s

Figure shows the location of metadata within the data warehouse.

Types of Metadata

Metadata in a data warehouse fall into three major parts:

o Operational Metadata
o Extraction and Transformation Metadata

o End-User Metadata

Operational Metadata

As we know, data for the data warehouse comes from various operational systems of the enterprise. These source
systems include different data structures. The data elements selected for the data warehouse have various fields
lengths and data types.

In selecting information from the source systems for the data warehouses, we divide records, combine factor of
documents from different source files, and deal with multiple coding schemes and field lengths. When we deliver
information to the end-users, we must be able to tie that back to the source data sets. Operational metadata
contains all of this information about the operational data sources.

Extraction and Transformation Metadata

Extraction and transformation metadata include data about the removal of data from the source systems, namely,
the extraction frequencies, extraction methods, and business rules for the data extraction. Also, this category of
metadata contains information about all the data transformation that takes place in the data staging area.

End-User Metadata

The end-user metadata is the navigational map of the data warehouses. It enables the end-users to find data from
the data warehouses. The end-user metadata allows the end-users to use their business terminology and look for the
information in those ways in which they usually think of the business.

Metadata Interchange Initiative

The metadata interchange initiative was proposed to bring industry vendors and user together to address a variety of
severe problems and issues concerning exchanging, sharing, and managing metadata. The goal of metadata
interchange standard is to define an extensible mechanism that will allow the vendor to exchange standard metadata
as well as carry along "proprietary" metadata. The founding members agreed on the following initial goals:

1. Creating a vendor-independent, industry-defined, and maintained standard access mechanisms and


application programming interfaces (API) for metadata.

2. Enabling users to control and manage the access and manipulation of metadata in their unique environment
through the use of interchange standards-compliant tools.

3. Users are allowed to build tools that meet their needs and also will enable them to adjust accordingly to
those tools configurations.

4. Allowing individual tools to satisfy their metadata requirements freely and efficiently within the content of an
interchange model.

5. Describing a simple, clean implementation infrastructure which will facilitate compliance and speed up
adoption by minimizing the amount of modification.

6. To create a procedure and process not only for maintaining and establishing the interchange standard
specification but also for updating and extending it over time.

Metadata Interchange Standard Framework

Interchange standard metadata model implementation assumes that the metadata itself may be stored in storage
format of any type: ASCII files, relational tables, fixed or customized formats, etc.

It is a framework that is based on a framework that will translate an access request into the standard interchange
index.

Several approaches have been proposed in metadata interchange coalition:


o Procedural Approach

o ASCII Batch Approach

o Hybrid Approach

In a procedural approach, the communication with API is built into the tool. It enables the highest degree of
flexibility.

In ASCII Batch approach, instead of relying on ASCII file format which contains information of various metadata items
and standardized access requirements that make up the interchange standards metadata model.

In the Hybrid approach, it follows a data-driven model.

Components of Metadata Interchange Standard Frameworks

1) Standard Metadata Model: It refers to the ASCII file format, which is used to represent metadata that is being
exchanged.

2) The standard access framework that describes the minimum number of API functions.

3) Tool profile, which is provided by each tool vendor.

4) The user configuration is a file explaining the legal interchange paths for metadata in the user's environment.

Metadata Repository

The metadata itself is housed in and controlled by the metadata repository. The software of metadata repository
management can be used to map the source data to the target database, integrate and transform the data, generate
code for data transformation, and to move data to the warehouse.

Benefits of Metadata Repository


1. It provides a set of tools for enterprise-wide metadata management.

2. It eliminates and reduces inconsistency, redundancy, and underutilization.

3. It improves organization control, simplifies management, and accounting of information assets.

4. It increases coordination, understanding, identification, and utilization of information assets.

5. It enforces CASE development standards with the ability to share and reuse metadata.

6. It leverages investment in legacy systems and utilizes existing applications.

7. It provides a relational model for heterogeneous RDBMS to share information.

8. It gives useful data administration tool to manage corporate information assets with the data dictionary.

9. It increases reliability, control, and flexibility of the application development process.

What is Data Mart?

A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or primary data
subject which may be distributed to provide business needs. Data Marts are analytical record stores designed to
focus on particular business functions for a specific community within an organization. Data marts are derived from
subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology, the data
warehouse is created from the union of organizational data marts.

The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to gather, store, access, and
analyze record. It can be used by smaller businesses to utilize the data they have accumulated since it is less
expensive than implementing a data warehouse.

Reasons for creating a data mart

o Creates collective data by a group of users

o Easy access to frequently needed data

o Ease of creation

o Improves end-user response time

o Lower cost than implementing a complete data warehouses


o Potential clients are more clearly defined than in a comprehensive data warehouse

o It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts

o Independent Data Marts

Dependent Data Marts

A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to this
technique, the data marts are treated as the subsets of a data warehouse. In this technique, firstly a data warehouse
is created from which further various data marts can be created. These data mart are dependent on the data
warehouse and extract the essential record from it. In this technique, as the data warehouse creates the data mart;
therefore, there is no need for data mart integration. It is also known as a top-down approach.

Independent Data Marts

The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and then a
data warehouse is designed using these independent multiple data marts. In this approach, as all the data marts are
designed independently; therefore, the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts

It allows us to combine input from sources other than a data warehouse. This could be helpful for many situations;
especially when Adhoc integrations are needed, such as after a new group or product is added to the organizations.

Steps in Implementing a Data Mart

The significant steps in implementing a data mart are to design the schema, construct the physical storage, populate
the data mart with data from source systems, access it to make informed decisions and manage it over time. So, the
steps are:

Designing

The design step is the first in the data mart process. This phase covers all of the functions from initiating the request
for a data mart through gathering data about the requirements and developing the logical and physical design of the
data mart.

It involves the following tasks:

1. Gathering the business and technical requirements

2. Identifying data sources

3. Selecting the appropriate subset of data

4. Designing the logical and physical architecture of the data mart.

Constructing

This step contains creating the physical database and logical structures associated with the data mart to provide fast
and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces associated with the data mart.

2. creating the schema objects such as tables and indexes describe in the design step.

3. Determining how best to set up the tables and access structures.

Populating

This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to the right
format and level of detail, and moving it into the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources

2. Extracting data

3. Cleansing and transforming the information.

4. Loading data into the data mart

5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs and
publishing them.

It involves the following tasks:


1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates database
operations and objects names into business conditions so that the end-clients can interact with the data
mart using words which relates to the business functions.

2. Set up and manage database architectures like summarized tables which help queries agree through the
front-end tools execute rapidly and efficiently.

Managing

This step contains managing the data mart over its lifetime. In this step, management functions are performed as:

1. Providing secure access to the data.

2. Managing the growth of the data.

3. Optimizing the system for better performance.

4. Ensuring the availability of data event with system failures.

Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart

A data mart is an only subtype of


A Data Warehouse is a vast repository of information collected from various
Warehouses. It is architecture to
organizations or departments within a corporation.
requirement of a specific user gro

It holds only one subject area. Fo


It may hold multiple subject areas.
Finance or Sales.

It holds very detailed information. It may hold more summarized da

It concentrates on integrating dat


Works to integrate all data sources
given subject area or set of sourc
In Data Mart, Star Schema and Sn
In data warehousing, Fact constellation is used.
Schema are used.

It is a Centralized System. It is a Decentralized System.

Data Warehousing is the data-oriented. Data Marts is a project-oriented.

Data Warehouse Delivery Process

Now we discuss the delivery process of the data warehouse. Main steps used in data warehouse delivery process
which are as follows:

IT Strategy: DWH project must contain IT strategy for procuring and retaining funding.

Business Case Analysis: After the IT strategy has been designed, the next step is the business case. It is essential to
understand the level of investment that can be justified and to recognize the projected business benefits which
should be derived from using the data warehouse.

Education & Prototyping: Company will experiment with the ideas of data analysis and educate themselves on the
value of the data warehouse. This is valuable and should be required if this is the company first exposure to the
benefits of the DS record. Prototyping method can progress the growth of education. It is better than working
models. Prototyping requires business requirement, technical blueprint, and structures.

Business Requirement: It contains such as


The logical model for data within the data warehouse.

The source system that provides this data (mapping rules)

The business rules to be applied to information.

The query profiles for the immediate requirement

Technical blueprint: It arranges the architecture of the warehouse. Technical blueprint of the delivery process makes
an architecture plan which satisfies long-term requirements. It lays server and data mart architecture and essential
components of database design.

Building the vision: It is the phase where the first production deliverable is produced. This stage will probably create
significant infrastructure elements for extracting and loading information but limit them to the extraction and load of
information sources.

History Load: The next step is one where the remainder of the required history is loaded into the data warehouse.
This means that the new entities would not be added to the data warehouse, but additional physical tables would
probably be created to save the increased record volumes.

AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate against the data warehouse.

These end-customer access tools are capable of automatically generating the database query that answers any
question posed by the user.

Automation: The automation phase is where many of the operational management processes are fully automated
within the DWH. These would include:

Extracting & loading the data from a variety of sources systems

Transforming the information into a form suitable for analysis

Backing up, restoring & archiving data

Generating aggregations from predefined definitions within the Data Warehouse.

Monitoring query profiles & determining the appropriate aggregates to maintain system performance.

Extending Scope: In this phase, the scope of DWH is extended to address a new set of business requirements. This
involves the loading of additional data sources into the DWH i.e. the introduction of new data marts.

Requirement Evolution: This is the last step of the delivery process of a data warehouse. As we all know that
requirements are not static and evolve continuously. As the business requirements will change it supports to be
reflected in the system.

Concept Hierarchy

Concept hierarchy is directed acyclic graph of ideas, where a unique name identifies each of the theories.

An arc from the concept a to b denotes which is a more general concept than b. We can tag the text with ideas.

Each text report is tagged by a set of concepts which corresponds to its content.

Tagging a report with a concept implicitly entails its tagging with all the ancestors of the concept hierarchy. It is,
therefore desired that a report should be tagged with the lowest concept possible.

The method to automatically tag the report to the hierarchy is a top-down approach. An evaluation function
determines whether a record currently tagged to a node can also be tagged to any of its child nodes.

If so, then then the tag moves down the hierarchy till it cannot be pushed any further.

The outcome of this step is a hierarchy of report and, at each node, there is a set of the report having a common
concept related to the node.
The hierarchy of reports resulting from the tagging step is useful for many texts mining process.

It is assumed that the hierarchy of concepts is called a priori. We can even have such a hierarchy of documents
without a concept hierarchy, by using any hierarchical clustering algorithm, which results in such a hierarchy.

Concept hierarchy defines a sequence of mapping from a set of particular, low-level concepts to more general,
higher-level concepts.

In a data warehouse, it is usually used to express different levels of granularity of an attribute from one of the
dimension tables.

Concept hierarchies are crucial for the formulation of useful OLAP queries. The hierarchies allow the user to
summarize the data at various levels.
For example, using the location hierarchy, the user can retrieve data which summarizes sales for each location, for all
the areas in a given state, or even a given country without the necessity of reorganizing the data.

What is Star Schema?

A star schema is the elementary form of a dimensional model, in which data are organized into facts and dimensions.
A fact is an event that is counted or measured, such as a sale or log in. A dimension includes reference data about the
fact, such as date, item, or customer.

A star schema is a relational schema where a relational schema whose design represents a multidimensional data
model. The star schema is the explicit data warehouse schema. It is known as star schema because the entity-
relationship diagram of this schemas simulates a star, with points, diverge from a central table. The center of the
schema consists of a large fact table, and the points of the star are the dimension tables.

Fact Tables

A table in a star schema which contains facts and connected to dimensions. A fact table has two types of columns:
those that include fact and those that are foreign keys to the dimension table. The primary key of the fact tables is
generally a composite key that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the same level of
aggregation.

Dimension Tables

A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a dimension has
not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each of the dimensions table
are part of the composite primary keys of the fact table. Dimensional attributes help to define the dimensional value.
They are generally descriptive, textual values. Dimensional tables are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic region (markets, cities), clients,
products, times, channels.
Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design because of the following features:

o It creates a DE-normalized database that can quickly provide query responses.

o It provides a flexible design that can be changed easily or added to throughout the development cycle, and
as the database grows.

o It provides a parallel in design to how end-users typically think of and use the data.

o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

Star Schemas are easy for end-users and application to understand and navigate. With a well-designed schema, the
customer can instantly analyze large, multidimensional data sets.

The main advantage of star schemas in a decision-support environment are:

Query Performance

A star schema database has a limited number of table and clear join paths, the query run faster than they do against
OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous. Large join
queries that contain multiple tables takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the central fact table. When the two-
dimension table is used in a query, only one join path, intersecting the fact tables, exist between those two tables.
This design feature enforces authentic and consistent query results.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of record into a star schema database. By
describing facts and dimensions and separating them into the various table, the impact of a load structure is
reduced. Dimension table can be populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.

Built-in referential integrity


A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced because
each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate foreign keys
drawn from the dimension table. A record in the fact table which is not related correctly to a dimension cannot be
given the correct key value to be retrieved.

Easily Understood

A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These joins
are more significant to the end-user because they represent the fundamental relationship between parts of the
underlying business. Customer can also browse dimension table attributes before constructing a query.

Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the relationship between the user, and bank
account cannot describe as star schema as the relationship between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables connected to it for
time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each item_Key,
item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key, branch_name,
branch_type. The LOCATION table has columns of geographic data, including street, city, state, and country.

In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME, ITEM,
BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data, three columns for
BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table is significantly reduced. When we
need to change an item, we need only make a single change in the dimension table, instead of making many changes
in the fact table.

We can create even more complex star schemas by normalizing a dimension table into several tables. The normalized
dimension table is called a Snowflake.

What is Snowflake Schema?

A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more dimension
tables do not connect directly to the fact table but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes into more points. It
is called snowflake schema because the diagram of snowflake schema resembles a snowflake. Snowflaking is a
method of normalizing the dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each fact
surrounded by its associated dimensions, and those dimensions are related to other dimensions, branching out into a
snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension tables, which can be linked to
other dimension tables through a many-to-one relationship. Tables in a snowflake schema are generally normalized
to the third normal form. Each dimension table performs exactly one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having three levels. A snowflake
schemas can have any number of dimension, and each dimension can have any number of levels.

Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product, Line, and
Family dimension tables. The Market dimension has two dimension tables with Store as the primary dimension table,
and Location as the outrigger dimension table. The product dimension has three dimension tables with Product as
the primary dimension table, and the Line and Family table are the outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This needed more disk space than a
more normalized snowflake schema. Snowflaking normalizes the dimension by moving attributes with low cardinality
into separate dimension tables that relate to the core dimension table by using foreign keys. Snowflaking for the sole
purpose of minimizing disk space is not recommended, because it can adversely impact query performance.

In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are damaged into
multiple dimension tables.

Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table include quantity, price,
and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME are the dimension tables.
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized version now extends
to eleven tables. We will notice that in the snowflake schema, the attributes with low cardinality in each original
dimension tables are removed to form separate tables. These new tables are connected back to the original
dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions and relationship. It is suitable
for many to many and one to many relationships between dimension levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query performance due to minimized
disk storage requirements and joining smaller lookup tables.

2. It provides greater scalability in the interrelationship between dimension levels and components.

3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required due to the
increasing number of lookup tables. It is also known as a multi fact star schema.

2. There are more complex queries and hence, difficult to understand.

3. More tables more join so more query execution time.

Difference between Star and Snowflake Schemas

Star Schema

o In a star schema, the fact table will be at the center and is connected to the dimension tables.

o The tables are completely in a denormalized structure.

o SQL queries performance is good as there is less number of joins involved.

o Data redundancy is high and occupies more disk space.

Snowflake Schema
o A snowflake schema is an extension of star schema where the dimension tables are connected to one or
more dimensions.

o The tables are partially denormalized in structure.

o The performance of SQL queries is a bit less when compared to star schema as more number of joins are
involved.

o Data redundancy is low and occupies less disk space when compared to star schema.

Let's see the differentiate between Star and Snowflake Schema.


Basis for Comparison Star Schema Snowflake Schema

No redundancy and therefore


It has redundant data and hence
Ease of Maintenance/change more easy to maintain and
less easy to maintain/change
change

More complex queries and


Less complex queries and simple
Ease of Use therefore less easy to
to understand
understand

In a star schema, a dimension In a snowflake schema, a


Parent table table will not have any parent dimension table will have one or
table more parent tables

Less number of foreign keys and


More foreign keys and thus more
Query Performance hence lesser query execution
query execution time
time

Normalization It has De-normalized tables It has normalized tables

Good for data marts with simple Good to use for data warehouse
Type of Data Warehouse relationships (one to one or one core to simplify complex
to many) relationships (many to many)

Joins Fewer joins Higher number of joins

It contains only a single It may have more than one


Dimension Table dimension table for each dimension table for each
dimension dimension

Hierarchies are broken into


separate tables in a snowflake
Hierarchies for the dimension are
schema. These hierarchies help
Hierarchies stored in the dimensional table
to drill down the information
itself in a star schema
from topmost hierarchies to the
lowermost hierarchies.

When dimensional table store a


huge number of rows with
When the dimensional table
redundancy information and
When to use contains less number of rows, we
space is such an issue, we can
can go for Star schema.
choose snowflake schema to
store space.

Work best in any data Better for small data


Data Warehouse system
warehouse/ data mart warehouse/data mart.

What Is a Data Lake?


A data lake is a storage system that keeps large amounts of raw data in its original form. It can store different types of
data, such as structured, semi-structured, and unstructured. Unlike a data warehouse, which organizes and processes
data, a data lake requires the data to be cleaned, joined, and possibly aggregated to make it useful, needing
processing power to manage and analyze it.

Key features of data lakes include.

• Storing data in its original format

• Supporting all data types

• Using a schema-on-read approach

• High scalability and flexibility

• Allowing advanced analytics and machine learning

Why Use a Data Lake?

Using a data lake provides several advantages, especially when used with a traditional data warehouse (DW). Some
of the benefits include.

• Quick Data Storage: Data can be stored quickly without any setup, allowing skilled users like data analysts
and data scientists to access it faster. This quick access helps them generate reports and train machine
learning models more efficiently.

• Cost Savings: Data lakes usually offer cheaper computing options compared to data warehouses.

• Efficient Investigation: If users need source data, it can be quickly copied to the data lake for a quick review
before creating a structure in the data warehouse.

• High Performance: Multiple computing options can work on data simultaneously, which improves
performance.

• Flexibility: Data lakes allow for more complex data modifications using different methods, unlike the
restrictions of SQL in a data warehouse.

• No Maintenance Windows: Data lakes provide continuous 24/7 access to the data warehouse, minimizing
conflicts between users and heavy data processing tasks.

Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to
boost your career.

Advertisement

Key Components of a Data Lake

These components work together to help store and manage data effectively. Each layer is important for getting data
in, keeping it safe, processing it, and making it accessible, so users can easily gain insights and make informed
decisions.

• Data Ingestion Layer: This is how data enters the lake. Data can come from different sources, like databases,
applications, or sensors. It includes tools for batch ingestion, real-time streaming, and change data capture.

• Storage Layer: This is the "lake" where data is stored. It's usually built on systems like Hadoop Distributed File
System (HDFS) or cloud storage like Amazon S3.
• Metadata Management Layer: This layer keeps track of important details about the data in the lake, such as
where it comes from, its format, and how it relates to other data. It helps users find and understand the data
better.

• Data Processing Layer: This layer cleans, transforms, and analyzes the data. Common tools like Apache Spark
and Flink are used here to support both batch and real-time processing.

• Data Access Layer: This layer allows users and applications to retrieve data from the lake. It includes SQL
query engines, data visualization tools, and APIs, facilitating efficient data access.

• Security: This layer protects data privacy and makes sure the organization follows regulations. It includes
access control to manage who can view the data, encryption to protect sensitive information, and auditing
features to monitor data usage.

• Data Governance Layer: This layer focuses on managing data quality and security. It includes tools to keep
data accurate, manage metadata, and control who can access the data, keeping everything organized and
following rules.

• Data Workflow and Monitoring Layer: This layer manages the flow of data and checks system performance.
It makes sure all processes run smoothly and helps quickly find and fix any issues.

Bottom-Up Approach in Data Lake Architecture

The bottom-up approach in data lakes allows users to start working with data quickly and easily, without needing a
lot of initial planning. This method is great for looking at data when you're unsure what questions to ask. Here's how
it works.

• Exploring Data: Users can start by looking through the data without specific questions in mind. This
exploration helps them find valuable insights they might not notice otherwise.

• Predictive Analytics: Once patterns are identified, data scientists can use machine learning to analyze
historical data and predict future events.

• Prescriptive Analytics: Going a step further, this approach suggests actions based on those predictions. For
example, it can recommend the best delivery routes in logistics or ways to reduce risks.

• Wider Applications: Data lakes were first used mainly for predictive and prescriptive analytics, but now they
are valuable for many types of analysis. This makes them useful for organizations in various fields.

• Data Modeling: If users find useful data during their exploration, they can later organize and transfer it to a
relational data warehouse for easier access. Data modeling helps clarify how the data is related and how it
should be arranged.

The bottom-up approach allows user to interact with data more freely, leading to fresh insights and improved
decision-making.

Multiple Data Lakes

Creating just one large data lake for all your data might seem like the best approach, making it easier to find and
combine information. However, there are several reasons why having multiple separate data lakes can be beneficial.

Advantages of Multiple Data Lakes

• Organizational Needs: Different teams may need their own data lakes for specific projects, helping them
manage their data better.

• Compliance and Security: Rules often require keeping sensitive data separate. Multiple data lakes can help
ensure that confidential information stays safe and follows regulations.

• Cloud Management: Having several data lakes can help you stay within cloud storage limits. Each lake can
have its own rules, making it easier to follow company guidelines and track costs.
• Performance and Availability: Placing data lakes closer to you can make access faster. If one lake has
problems, you can quickly switch to another lake without losing access to data.

• Data Retention Management: Different data lakes can have their own rules for how long to keep data,
ensuring you meet legal requirements while using storage efficiently.

Disadvantages of Multiple Data Lakes

While there are clear benefits, managing multiple data lakes can be more complicated and costly. It may require extra
resources and skills. Moving data between lakes can also be difficult, especially if they are located far apart, which
can slow down access to information needed for reports.

How is data lake architecture different from traditional storage systems?

Data Lake architecture is different from traditional storage systems in several ways. Data lakes can hold raw data in
various formats, while traditional systems need data to be structured first. This means they can store all types of
data, including structured, semi-structured, and unstructured, making them more flexible. Data lakes are also easier
to scale, allowing for the management of large amounts of data without high costs. Users can access and analyze
data quickly without needing much preparation. Overall, data lakes provide more flexibility and efficiency for today's
data needs.

Best Practices for Data Lake Design

Designing a data lake effectively is important for its success. Here are some key practices to keep in mind.

• Plan Carefully: Take time to identify all the data sources you currently use and might use in the future.
Understand the type, size, and speed of the data. A good design now can save you from expensive changes
later.

• Organize into Layers: Divide your data lake into several layers to improve data quality and manageability.
Each layer has a specific role, moving from raw data to polished information:

o Raw Layer: Keeps unprocessed data in its original form and stores historical records.

o Conformed Layer: Aligns all data formats (like changing to Parquet) for consistency.

o Cleansed Layer: Improves data by cleaning and combining it into usable datasets.

o Presentation Layer: Applies business logic to prepare data for analysis, making it easy to understand.

o Sandbox Layer (optional): A space for data scientists to experiment and analyze data freely.

• Create a Folder Structure: Set up a clear folder structure for each layer. This organization makes it easy for
users to find data and improves security and performance.

• Focus on Governance: Implement data governance practices to maintain data quality and make sure
everything follows the rules. This helps prevent a "data swamp", where data becomes disorganized and hard
to manage.

• Use Versatile Applications: Data lakes can handle many types of analysis. They started with predictive and
prescriptive analytics but now support various analyses across different industries.

• Facilitate Data Modeling: When users find useful data, they can organize it and later move it to a relational
data warehouse. Data modeling helps show how data is related and organized, making it easier to access and
use.

Real-world Use Cases of Data Lakes

Data lakes help businesses in different ways.

• Customer 360: Bringing together all customer data to better understand their needs.

• IoT Analytics: Analyzing data from connected devices to improve products and services.
• Risk Analysis: Using past data to identify and manage potential risks.

• Personalization: Customizing products or services to fit what each customer prefers.

Future Trends in Data Lakes

Here are some new directions in data management.

• Data Mesh: A decentralized approach to managing data across the organization.

• Automated Data Quality: Using technology to automatically detect and fix data issues.

• Real-time Analytics: Providing instant insights from live data streams.

• Multi-cloud Data Lakes: Storing data across different cloud services for greater flexibility and reliability.

Difference between Data Lake and Data Warehouse


The following table highlights all the key differences between data lake and data warehouse

Key Data Lake Data Warehouse

A data lake is a very big storage A data warehouse is a repository


repository which is used to store raw for structured, filtered data that
Basic
unstructured data machine to machine, has already been processed for a
logs flowing through in real-time. specific purpose

Data warehouse has denormalized


Normalized Data is not in normalized form
schema

Schema Schema is created before the data


Schema is created after data is loaded
Creation is loaded

ELT/ETL It used ELT process It used ETL process

It is ideal for those who want in-depth


Uses It is good for operational users
analysis
What is Data Lake ?

In the fast-paced world of data science, managing and harnessing vast amounts of raw data is crucial for deriving
meaningful insights. One technology that has revolutionized this process is the concept of Data Lakes. A Data Lake
serves as a centralized repository that can store massive volumes of raw data until it is needed for analysis.

In this article, Let's delve into the key points that shed light on how Data Lakes efficiently manage, and store raw
data for later use, Data Lake architecture, and the Challenges of Data Lakes.

What is a Data Lake?

A Data Lake is a storage system that can store structured and unstructured data at any scale. It differs from traditional
databases by allowing data to be stored in its raw, unprocessed form.

1. Structuring Raw Data: Unlike traditional databases that require structured data, Data Lakes accommodate
raw and diverse data formats, including text, images, videos, and more. This flexibility is vital as it enables
organizations to store data in its original state, preserving its integrity and context.

2. Scalability and Cost-Efficiency: Data Lakes can scale horizontally, accommodating massive amounts of data
from various sources. The use of scalable and cost-effective storage solutions, such as cloud storage, makes it
feasible to store large volumes of raw data without incurring exorbitant costs.

3. Integration with Data Processing Tools: Data Lakes integrate seamlessly with data processing tools,
facilitating the transformation of raw data into a usable format for analysis. Popular tools like Apache
Spark or Apache Hadoop can process data within the Data Lake, ensuring that insights can be derived
without the need to transfer data between systems.

4. Metadata Management: Metadata plays a crucial role in Data Lakes, providing information about the data's
structure, source, and quality. Metadata management ensures that users can easily discover, understand, and
trust the data within the Data Lake.

Different data processing tools

Apache Spark

• Overview: Open-source, distributed computing system for fast and versatile large-scale data processing.

• Key Features: In-memory processing, multi-language support (Scala, Python, Java), compatibility with diverse
data sources.

Apache Hadoop

• Overview: Framework for distributed storage and processing of large datasets using a simple programming
model.

• Key Features: Scalability, fault-tolerance, Hadoop Distributed File System (HDFS) for storage.

Apache Flink

• Overview: Stream processing framework for big data analytics with a focus on low-latency and high-
throughput.

• Key Features: Event time processing, exactly-once semantics, support for batch processing.

TensorFlow

• Overview: Open-source machine learning framework developed by Google.

• Key Features: Ideal for deep learning applications, supports neural network models, extensive tools for
model development.

Apache Storm
• Overview: Real-time stream processing system for handling data in motion.

• Key Features: Scalability, fault-tolerance, integration with various data sources.

Data Lake Architecture

A data lake is a centralized depository that allows associations to store all their structured and unshaped data at any
scale. Unlike traditional data storage systems, a data lake enables the storage of raw, granular data without the need
for a predefined schema. The architecture of a data lake is designed to handle massive volumes of data from various
sources and allows for flexible processing and analysis.

Data-Lake Architecture

Essential Elements of a Data Lake and Analytics Solution

1. Storage Layer: The core of a data lake is its storage layer, which can accommodate structured, semi-
structured, and unstructured data. It is typically built on scalable and distributed file systems or object
storage solutions.

2. Ingestion Layer: This layer involves mechanisms for collecting and loading data into the data lake. Various
tools and technologies, such as ETL (Extract, Transform, Load) processes, streaming data pipelines, and
connectors, are used for efficient data ingestion.

3. Metadata Store: Metadata management is crucial for a data lake. A metadata store keeps track of
information about the data stored in the lake, including its origin, structure, lineage, and usage.

4. Security and Governance: As data lakes hold diverse and sensitive information, robust security measures and
governance policies are essential. Access controls, encryption, and auditing mechanisms help ensure data
integrity and compliance with regulations.

5. Processing and Analytics Layer: This layer involves tools and frameworks for processing and analyzing the
data stored in the lake. Technologies like Apache Spark, Apache Flink, and machine learning frameworks can
be integrated for diverse analytics workloads.

6. Data Catalog: A data catalog provides a searchable inventory of available data assets within the data lake.

Data Warehouse vs. Data Lake

Data Warehouse: Data warehouses are designed for processing and analyzing structured data. They follow a schema-
on-write approach, meaning data must be structured before being ingested. Data warehouses are optimized for
complex queries and reporting, making them suitable for business intelligence and decision support.

Data Lake: Data lakes, on the other hand, support structured and unstructured data in its raw form. They follow a
schema-on-read approach, allowing users to apply the schema at the time of analysis. Data lakes are more suitable
for handling large volumes of diverse data types and are well-suited for exploratory and advanced analytics.

Challenges of Data Lakes

1. Data Quality: Ensuring data quality in a data lake can be challenging, as it stores raw and unprocessed data.
Without proper governance, the lake may become a "data swamp" with inconsistent and unreliable
information.

2. Security Concerns: As data lakes accumulate a vast amount of sensitive data, ensuring robust security
measures is crucial to prevent unauthorized access and data breaches.

3. Metadata Management: Managing metadata and maintaining a comprehensive data catalog can be
complex, making it difficult for users to discover and understand the available data.

4. Integration Complexity: Integrating data from diverse sources and ensuring compatibility can be challenging,
especially when dealing with varied data formats and structures.
5. Skill Requirements: Implementing and managing a data lake requires specialized skills in big data
technologies, which might pose challenges for organizations lacking the necessary expertise.

Values of Data Lakes

• Data Exploration and Discovery: Data lakes enables user to store diverse types of raw and unstructured data
in their native formats. This allows more flexible and comprehensive storage of data.

• Scalability: Data Lakes provides scalable storage of data and solutions. It allows to handle massive volume of
data.

• Cost-Efficiency: Data lakes oftens are cost effective storage solutions, such as object storage which is suitable
for storing large volumes of raw data.

• Flexibility and Agility: Data lakes allows a schema-on-read approchthat means the data is not rigidly
structured upon ingestion.

• Advanced Analytics: Data lakes serves a foundation for advances analytics that include machine learning,
Artificial Intelligence and predictive analysis.

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel) or
maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that almost 90%
of today's data has been generated in the past 3 years.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a day
to day basis as they have billions of users worldwide.

o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.

o Weather Station: All the weather station and satellite gives very huge data which are stored and manipulated
to forecast weather.

o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their
plans and for this they store the data of its million users.

o Share Market: Stock exchange across the world generates huge amount of data through its daily transaction.

3V's of Big Data

1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in every
2 years.

2. Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured. Log
file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like the
transaction data of the bank.

3. Volume: The amount of data which we deal with is of very large size of Peta bytes.

Use case

An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to its top 10 customers who
have spent the most in the previous year.Moreover, they want to find the buying trend of these customers so that
company can suggest more items related to them.
Issues

Huge amount of unstructured data which needs to be stored, processed and analyzed.

Solution

Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity
hardware to form clusters and store data in a distributed fashion. It works on Write once, read many times principle.

Processing: Map Reduce paradigm is applied to data distributed over network to find the required output.

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge
in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline
processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled
up just by adding nodes in the cluster.

Modules of Hadoop

HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed.
It states that the files will be broken into blocks and stored in nodes over the distributed architecture.

Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.

Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key value
pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair. The
output of Map task is consumed by reduce task and then the out of reducer gives the desired result.

Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File
System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task
Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.

Hadoop Architecture

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave
architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes
performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to
develop HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software.
NameNode

It is a single master server exist in the HDFS cluster.

As it is a single node, it may become the reason of single point failure.

It manages the file system namespace by executing an operation like the opening, renaming and closing the files.

It simplifies the architecture of the system.

DataNode

The HDFS cluster contains multiple DataNodes.

Each DataNode contains multiple data blocks.

These data blocks are used to store data.

It is the responsibility of DataNode to read and write requests from the file system's clients.

It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.

In response, NameNode provides metadata to Job Tracker.

Task Tracker

It works as a slave node for Job Tracker.

It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a
Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In
response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or
time out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop

Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools to
process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes of
data in minutes and Peta bytes in hours.

Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.

Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as
compared to traditional relational database management system.

Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down
or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are
replicated thrice but the replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper,
published by Google.
History of Hadoop

Let's focus on the history of Hadoop in the following steps: -

In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web
crawler software project.

While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of costs
which becomes the consequence of that project. This problem becomes one of the important reason for the
emergence of Hadoop.

In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file system
developed to provide efficient access to data.

In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large
clusters.

In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File
System). This file system also includes Map reduce.

In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting introduces a
new project Hadoop with a file system known as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0
released in this year.

Doug Cutting gave named his project Hadoop after his son's toy elephant.

In 2007, Yahoo runs two clusters of 1000 machines.

In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.

In 2013, Hadoop 2.2 was released.

In 2017, Hadoop 3.0 was released.

Year

Event

2003

Google released the paper, Google File System (GFS).

2004

Google released a white paper on Map Reduce.

2006

Hadoop introduced.

Hadoop 0.1.0 released.

Yahoo deploys 300 machines and within this year reaches 600 machines.

2007

Yahoo runs 2 clusters of 1000 machines.

Hadoop includes HBase.

2008

YARN JIRA opened


Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.

Yahoo clusters loaded with 10 terabytes per day.

Cloudera was founded as a Hadoop distributor.

2009

Yahoo runs 17 clusters of 24,000 machines.

Hadoop becomes capable enough to sort a petabyte.

MapReduce and HDFS become separate subproject.

2010

Hadoop added the support for Kerberos.

Hadoop operates 4,000 nodes with 40 petabytes.

Apache Hive and Pig released.

2011

Apache Zookeeper released.

Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012

Apache Hadoop 1.0 version released.

2013

Apache Hadoop 2.2 version released.

2014

Apache Hadoop 2.6 version released.

2015

Apache Hadoop 2.7 version released.

2017

Apache Hadoop 3.0 version released.

2018

Apache Hadoop 3.1 version released.

What is HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several machines and
replicated to ensure their durability to failure and high availability to parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node name.

Where to use HDFS

Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.


Streaming Data Access: The time to read whole data set is more important than latency in reading the first. HDFS is
built on write-once and read-many-times pattern.

Commodity Hardware:It works on low cost hardware.

Where not to use HDFS

Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is
giving importance to whole data rather than time to fetch the first record.

Lots Of Small Files:The name node contains the metadata of files in memory and if the files are small in size it takes a
lot of memory for name node's memory which is not feasible.

Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts

Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB by default and this
is configurable.Files n HDFS are broken into block-sized chunks,which are stored as independent units.Unlike a file
system, if the file is in HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.

Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node is controller and
manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata information being file
permission, names and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this
information is handled bya single machine. The file system operations like opening, closing, renaming etc. are
executed by it.

Data Node: They store and retrieve blocks when they are told to; by client or name node. They report back to name
node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the
work of block creation, deletion and replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS DataNode NameNode

HDFS Read Image:

HDFS Read

HDFS Write Image:

HDFS Write

Since all the metadata is stored in name node, it is very important. If it fails the file system can not be used as there
would be no way of knowing how to reconstruct the files from blocks present in data node. To overcome this, the
concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It performs periodic
check points.It communicates with the name node and take snapshot of meta data which helps minimize downtime
and loss of data.
Starting HDFS

The HDFS should be formatted initially and then started in the distributed mode. Commands are given below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations

Putting data to HDFS from local file system

First create a folder in HDFS where data can be put form local file system.

$ hadoop fs -mkdir /user/test

Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

Display the content of HDFS folder

$ Hadoop fs -ls /user/test

Copying data from HDFS to local file system

$ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt

Compare the files and see that both are same

$ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

hadoop fs -rmr <arg>

Example:

hadoop fs -rmr /user/sonoo/

HDFS Other commands

The below is used in the commands

"<path>" means any file or directory name.

"<path>..." means one or more file or directory names.


"<file>" means any filename.

"<src>" and "<dest>" are path names in a directed operation.

"<localSrc>" and "<localDest>" are paths as above, but on the local file system

put <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within the DFS.

copyFromLocal <localSrc><dest>

Identical to -put

copyFromLocal <localSrc><dest>

Identical to -put

moveFromLocal <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the
local copy on success.

get [-crc] <src><localDest>

Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.

cat <filen-ame>

Displays the contents of filename on stdout.

moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.

setrep [-R] [-w] rep <path>

Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the
target over time)

touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is
already size 0.

test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

stat [format] <path>

Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o),
replication (%r), and modification date (%y, %Y). What is YARN

Yet Another Resource Manager takes programming to the next level beyond Java , and makes it interactive to let
another application Hbase, Spark etc. to work on it.Different Yarn applications can co-exist on the same cluster so
MapReduce, Hbase, Spark all can run at the same time bringing great benefits for manageability and cluster
utilization.

Components Of YARN

o Client: For submitting MapReduce jobs.

o Resource Manager: To manage the use of resources across the cluster

o Node Manager:For launching and monitoring the computer containers on machines in the cluster.

o Map Reduce Application Master: Checks tasks running the MapReduce job. The application master and the
MapReduce tasks run in containers that are scheduled by the resource manager, and managed by the node
managers.

Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were responsible for handling
resources and checking progress management. However, Hadoop 2.0 has Resource manager and NodeManager to
overcome the shortfall of Jobtracker & Tasktracker.

Benefits of YARN

o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000 task, but Yarn is designed for
10,000 nodes and 1 lakh tasks.

o Utiliazation: Node Manager manages a pool of resources, rather than a fixed number of the designated slots
thus increasing the utilization.

o Multitenancy: Different version of MapReduce can run on YARN, which makes the process of upgrading
MapReduce more manageable.

← prevnext →

MapReduce Tutorial

MapReduce tutorial provides basic and advanced concepts of MapReduce. Our MapReduce tutorial is designed for
beginners and professionals.

Our MapReduce tutorial includes all topics of MapReduce such as Data Flow in MapReduce, Map Reduce API, Word
Count Example, Character Count Example, etc.

What is MapReduce?

A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form. It was
developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large Clusters,"
published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the Mapper, the
input is given in the form of a key-value pair. The output of the Mapper is fed to the reducer as input. The reducer
runs only after the Mapper is over. The reducer too takes input in key-value format, and the output of reducer is the
final output.

Steps in Map Reduce

o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in
this case.

o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts
on these list of <key, value> pairs and sends out unique keys and a list of values associated with this unique
key <key, list(values)>.

o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined function on a list of
values for unique keys, and Final output <key, value> will be stored/displayed.

Sort and Shuffle


The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task is complete, the
results are sorted by key, partitioned if there are multiple reducers, and then written to disk. Using the input from
each Mapper <k2,v2>, we collect all the values for each unique key k2. This output from the shuffle phase in the form
of <k2, list(v2)> is sent as input to reducer phase.

Usage of MapReduce

o It can be used in various application like document clustering, distributed sorting, and web link-graph
reversal.

o It can be used for distributed pattern-based searching.

o We can also use MapReduce in machine learning.

o It was used by Google to regenerate Google's index of the World Wide Web.

o It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile
environment.

Data Flow In MapReduce


MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and distributed
form, the data has to flow from various phases.

Phases of MapReduce data flow

Input reader
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB).
Each data block is associated with a Map function.

Once input reads the data, it generates the corresponding key-value pairs. The input files reside in HDFS.

Note - The input data can be in any form.


Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs. The
map input and output type may be different from each other.

Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available key and value
provide this function. It returns the index of reducers.

Shuffling and Sorting


The data are shuffled between/within nodes so that it moves out from the map and get ready to process for reduce
function. Sometimes, the shuffling of data can take much computation time.

The sorting operation is performed on input data for Reduce function. Here, the data is compared using comparison
function and arranged in a sorted form.

Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The values
associated with the keys can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to write the Reduce
output to the stable storage.

MapReduce API

In this section, we focus on MapReduce APIs. Here, we learn about the classes and methods used in MapReduce
programming.

MapReduce Mapper Class

In MapReduce, the role of the Mapper class is to map the input key-value pairs to a set of intermediate key-value
pairs. It transforms the input records into intermediate records.

These intermediate records associated with a given output key and passed to Reducer for the final output.

Methods of Mapper Class

void cleanup(Context context) This method called only once at the end of the task.

This method can be called only once for each key-


void map(KEYIN key, VALUEIN value, Context context)
value in the input split.

This method can be override to control the execution


void run(Context context)
of the Mapper.

This method called only once at the beginning of the


void setup(Context context)
task.

MapReduce Reducer Class

In MapReduce, the role of the Reducer class is to reduce the set of intermediate values. Its implementations can
access the Configuration for the job via the JobContext.getConfiguration() method.

Methods of Reducer Class

void cleanup(Context context) This method called only once at the end of the task.

void map(KEYIN key, Iterable<VALUEIN> values,


This method called only once for each key.
Context context)

This method can be used to control the tasks of the


void run(Context context)
Reducer.

This method called only once at the beginning of the


void setup(Context context)
task.

MapReduce Job Class

The Job class is used to configure the job and submits it. It also controls the execution and query the state. Once the
job is submitted, the set method throws IllegalStateException.
Methods of Job Class

Methods Description

Counters getCounters() This method is used to get the counters for the job.

long getFinishTime() This method is used to get the finish time for the job.

This method is used to generate a new Job without any


Job getInstance()
cluster.

This method is used to generate a new Job without any


Job getInstance(Configuration conf)
cluster and provided configuration.

This method is used to generate a new Job without any


Job getInstance(Configuration conf, String jobName)
cluster and provided configuration and job name.

This method is used to get the path of the submitted job


String getJobFile()
configuration.

String getJobName() This method is used to get the user-specified job name.

This method is used to get the scheduling function of the


JobPriority getPriority()
job.

This method is used to set the jar by providing the class


void setJarByClass(Class<?> c)
name with .class extension.

void setJobName(String name) This method is used to set the user-specified job name.
This method is used to set the key class for the map
void setMapOutputKeyClass(Class<?> class)
output data.

This method is used to set the value class for the map
void setMapOutputValueClass(Class<?> class)
output data.

void setMapperClass(Class<? extends Mapper> class) This method is used to set the Mapper for the job.

This method is used to set the number of reduce tasks


void setNumReduceTasks(int tasks)
for the job

void setReducerClass(Class<? extends Reducer> class) This method is used to set the Reducer for the job.

You might also like