0% found this document useful (0 votes)
12 views15 pages

Unit3 Notes

Data warehouses serve as centralized repositories for storing and analyzing data from various sources to support informed decision-making and business intelligence activities. They are characterized by being subject-oriented, integrated, non-volatile, and time-variant, differing from traditional databases in their focus on extensive analytical queries rather than transactional purposes. Data marts, which are subsets of data warehouses, provide targeted data for specific business functions and can be implemented through dependent or independent approaches.

Uploaded by

APARNA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

Unit3 Notes

Data warehouses serve as centralized repositories for storing and analyzing data from various sources to support informed decision-making and business intelligence activities. They are characterized by being subject-oriented, integrated, non-volatile, and time-variant, differing from traditional databases in their focus on extensive analytical queries rather than transactional purposes. Data marts, which are subsets of data warehouses, provide targeted data for specific business functions and can be implemented through dependent or independent approaches.

Uploaded by

APARNA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT 3

Data warehouses serve as a central repository for storing and analyzing


information to make better informed decisions. An organization's data
warehouse receives data from a variety of sources, typically on a regular
basis, including transactional systems, relational databases, and other
sources.

A data warehouse is a centralized storage system that allows for the


storing, analyzing, and interpreting of data in order to facilitate better
decision-making. Transactional systems, relational databases, and other
sources provide data into data warehouses on a regular basis.

A data warehouse is a type of data management system that facilitates


and supports business intelligence (BI) activities, specifically analysis.
Data warehouses are primarily designed to facilitate searches and
analyses and usually contain large amounts of historical data.

A data warehouse can be defined as a collection of organizational data


and information extracted from operational sources and external data
sources. The data is periodically pulled from various internal applications
like sales, marketing, and finance; customer-interface applications; as well
as external partner systems. This data is then made available for decision-
makers to access and analyze.

Key Characteristics of Data Warehouse

The main characteristics of a data warehouse are as follows:

 Subject-Oriented

A data warehouse is subject-oriented since it provides topic-wise


information rather than the overall processes of a business. Such subjects
may be sales, promotion, inventory, etc. For example, if you want to
analyze your company’s sales data, you need to build a data warehouse
that concentrates on sales. Such a warehouse would provide valuable
information like ‘who was your best customer last year?’ or ‘who is likely
to be your best customer in the coming year?’

 Integrated

A data warehouse is developed by integrating data from varied sources


into a consistent format. The data must be stored in the warehouse in a
consistent and universally acceptable manner in terms of naming, format,
and coding. This facilitates effective data analysis.

 Non-Volatile

Data once entered into a data warehouse must remain unchanged. All
data is read-only. Previous data is not erased when current data is
entered. This helps you to analyze what has happened and when.

 Time-Variant

The data stored in a data warehouse is documented with an element of


time, either explicitly or implicitly. An example of time variance in Data
Warehouse is exhibited in the Primary Key, which must have an element
of time like the day, week, or month.

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some


similarities, they need not be the same idea. The main difference is that in
a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to
perform analytics. Databases provide real-time data, while warehouses
store data to be accessed for big analytical queries.

Data warehouse is an example of an OLAP system or an online database


query answering system. OLTP is an online database modifying system,
for example, ATM.

A Data Mart is a subset of a directorial information store, generally


oriented to a specific purpose or primary data subject which may be
distributed to provide business needs. Data Marts are analytical record
stores designed to focus on particular business functions for a specific
community within an organization. Data marts are derived from subsets of
data in a data warehouse, though in the bottom-up data warehouse
design methodology, the data warehouse is created from the union of
organizational data marts.

The fundamental use of a data mart is Business Intelligence


(BI) applications. BI is used to gather, store, access, and analyze record.
It can be used by smaller businesses to utilize the data they have
accumulated since it is less expensive than implementing a data
warehouse.

Reasons for creating a data mart


o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive
data warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts


There are mainly two approaches to designing data marts. These
approaches are

o Dependent Data Marts


o Independent Data Marts

Dependent Data Marts


A dependent data marts is a logical subset of a physical subset of a higher
data warehouse. According to this technique, the data marts are treated
as the subsets of a data warehouse. In this technique, firstly a data
warehouse is created from which further various data marts can be
created. These data mart are dependent on the data warehouse and
extract the essential record from it. In this technique, as the data
warehouse creates the data mart; therefore, there is no need for data
mart integration. It is also known as a top-down approach.

Independent Data Marts


The second approach is Independent data marts (IDM) Here, firstly
independent data marts are created, and then a data warehouse is
designed using these independent multiple data marts. In this approach,
as all the data marts are designed independently; therefore, the
integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called
"Hybrid Data Marts."

Hybrid Data Marts


It allows us to combine input from sources other than a data warehouse.
This could be helpful for many situations; especially when Adhoc
integrations are needed, such as after a new group or product is added to
the organizations.
Steps in Implementing a Data Mart
The significant steps in implementing a data mart are to design the
schema, construct the physical storage, populate the data mart with data
from source systems, access it to make informed decisions and manage it
over time. So, the steps are:

Designing
The design step is the first in the data mart process. This phase covers all
of the functions from initiating the request for a data mart through
gathering data about the requirements and developing the logical and
physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements


2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.

Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart

A Data Warehouse is a vast A data mart is an only subtype of a


repository of information collected Data Warehouses. It is
from various organizations or architecture to meet the
departments within a corporation. requirement of a specific user
group.

It may hold multiple subject areas. It holds only one subject area. For
example, Finance or Sales.

It holds very detailed information. It may hold more summarized


data.

Works to integrate all data sources It concentrates on integrating data


from a given subject area or set of
source systems.

In data warehousing, Fact In Data Mart, Star Schema and


constellation is used. Snowflake Schema are used.

It is a Centralized System. It is a
Decentralized System.

Data Warehousing is the data- Data Marts is a project-oriented.


oriented.

OLAP

OLAP (for online analytical processing) is software for performing


multidimensional analysis at high speeds on large volumes of data from a data
warehouse, data mart, or some other unified, centralized data store.

OLAP stands for On-Line Analytical Processing. OLAP is a classification


of software technology which authorizes analysts, managers, and
executives to gain insight into information through fast, consistent,
interactive access in a wide variety of possible views of data that has
been transformed from raw information to reflect the real dimensionality
of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information


and support the capability for complex estimations, trend analysis, and
sophisticated data modeling. It is rapidly enhancing the essential
foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting,
Analysis, Simulation-Models, Knowledge Discovery, and Data Warehouses
Reporting. OLAP enables end-clients to perform ad hoc analysis of record
in multiple dimensions, providing the insight and understanding they
require for better decision making.

Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from
the first letters of the characteristics are:
Fast
It defines which the system targeted to deliver the most feedback to the
client within about five seconds, with the elementary analysis taking no
more than one second and very few taking more than 20 seconds.

Analysis
It defines which the method can cope with any business logic and
statistical analysis that is relevant for the function and the user, keep it
easy enough for the target client. Although some preprogramming may be
needed we do not think it acceptable if all application definitions have to
be allow the user to define new Adhoc calculations as part of the analysis
and to document on the data in any desired method, without having to
program so we excludes products (like Oracle Discoverer) that do not
allow the user to define new Adhoc calculation as part of the analysis and
to document on the data in any desired product that do not allow
adequate end user-oriented calculation flexibility.

Share
It defines which the system tools all the security requirements for
understanding and, if multiple write connection is needed, concurrent
update location at an appropriated level, not all functions need customer
to write data back, but for the increasing number which does, the system
should be able to manage multiple updates in a timely, secure manner.
Multidimensional
This is the basic requirement. OLAP system must provide a
multidimensional conceptual view of the data, including full support for
hierarchies, as this is certainly the most logical method to analyze
business and organizations.

Information
The system should be able to hold all the data needed by the applications.
Data sparsity should be handled in an efficient manner.

The main characteristics of OLAP are as follows:

1. Multidimensional conceptual view: OLAP systems let business


users have a dimensional and logical view of the data in the data
warehouse. It helps in carrying slice and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the
OLAP operation should provide normal database operations,
containing retrieval, update, adequacy control, integrity, and
security.
3. Accessibility: OLAP acts as a mediator between data warehouses
and front-end. The OLAP operations should be sitting between data
sources (e.g., data warehouses) and an OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data
sources.
5. Uniform documenting performance: Increasing the number of
dimensions or database size should not significantly degrade the
reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing
values so that aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct
aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for
aggregations of metrics along a single business dimension or across
multiple dimension.
10. OLAP provides the ability to perform intricate calculations and
comparisons.
11. OLAP presents results in a number of meaningful ways,
including charts and graphs.

Benefits of OLAP
OLAP holds several benefits for businesses: -

1. OLAP helps managers in decision-making through the


multidimensional record views that it is efficient in providing, thus
increasing their productivity.
2. OLAP functions are self-sufficient owing to the inherent flexibility
support to the organized databases.
3. It facilitates simulation of business models and problems, through
extensive management of analysis-capabilities.
4. In conjunction with data warehouse, OLAP can be used to support a
reduction in the application backlog, faster data retrieval, and
reduction in query drag.

Motivations for using OLAP


1) Understanding and improving sales: For enterprises that have
much products and benefit a number of channels for selling the product,
OLAP can help in finding the most suitable products and the most famous
channels. In some methods, it may be feasible to find the most profitable
users. For example, considering the telecommunication industry and
considering only one product, communication minutes, there is a high
amount of record if a company want to analyze the sales of products for
every hour of the day (24 hours), difference between weekdays and
weekends (2 values) and split regions to which calls are made into 50
region.

2) Understanding and decreasing costs of doing


business: Improving sales is one method of improving a business, the
other method is to analyze cost and to control them as much as suitable
without affecting sales. OLAP can assist in analyzing the costs related to
sales. In some methods, it may also be feasible to identify expenditures
which produce a high return on investments (ROI). For
example, recruiting a top salesperson may contain high costs, but the
revenue generated by the salesperson may justify the investment.
OLTP (On-Line Transaction Processing) is featured by a large number
of short on-line transactions (INSERT, UPDATE, and DELETE). The primary
significance of OLTP operations is put on very rapid query processing,
maintaining record integrity in multi-access environments, and
effectiveness consistent by the number of transactions per second. In the
OLTP database, there is an accurate and current record, and schema used
to save transactional database is the entity model (usually 3NF).

OLAP (On-line Analytical Processing) is represented by a relatively


low volume of transactions. Queries are very difficult and involve
aggregations. For OLAP operations, response time is an effectiveness
measure. OLAP applications are generally used by Data Mining
techniques. In OLAP database there is aggregated, historical information,
stored in multi-dimensional schemas (generally star schema).

Difference OLTP OLAP


Operational data. OLTP Consolidation data.
Data source systems are the original OLAP data comes from
data sources. the OLTP databases.
Responsible for
Responsible for
planning, problem-
Use controlling and running
solving and supporting
basic business tasks.
business decisions.
Queries are standard
Queries Complex queries.
and straightforward.
Complex queries can
Speed of processing Fast speed. take a long time to
process.
No regular backups.
Frequent complete
Backup and Instead, OLTP data is
backups along with
recovery reloaded as a recovery
incremental backups.
method.
Process Online transactional Online analysis and
data retrieving
system.
process.
Method used Uses traditional DBMS. Uses a data warehouse.
Detailed organisation of
Quality of data Disorganised data.
data.
Market-oriented Customer-oriented
Nature of audience
process. process.
Application-oriented Subject-oriented
Database design
design. design.
Data knowledge
Clerks, online shoppers,
Types of users workers like managers
etc., use OLTP.
and CEOs use OLAP.
Enhances the
Enhances the
Productivity productivity of business
productivity of the user.
analysts.
The user starts the data Regular refreshing of
Updates updates, which are data with long,
short and fast. scheduled batch jobs.
What are the Benefits of OLTP and OLAP methods?
Advantages of OLTP:

 Day to day transactions in an organisation is easily regulated.


 Increases the organisation’s customer base as it simplifies the
individual processes.

Advantages of OLAP:

 Businesses can use a single multipurpose platform for planning,


budgeting, forecasting, and analysing.
 Information and calculations are very consistent in OLAP.
 Adequate security measures are taken to protect confidential data.

What are the Drawbacks of OLTP and OLAP methods?


Disadvantages of OLTP:

 Hardware failures in the system can severely impact online


transactions.
 These systems can become complicated as multiple users can access
and modify data at the same time.

Disadvantages of OLAP:

 Traditional tools in this system need complicated modelling


procedures. Therefore, maintenance is dependent on IT professionals.
 Collaboration between different departments might not always be
possible.
Star Schema:

Star schema is the type of multidimensional model which is used for


data warehouse. In star schema, The fact tables and the dimension
tables are contained. In this schema fewer foreign-key join is used. This
schema forms a star with fact table and dimension tables.

Snowflake Schema:

Snowflake Schema is also the type of multidimensional model which is


used for data warehouse. In snowflake schema, The fact tables,
dimension tables as well as sub dimension tables are contained. This
schema forms a snowflake with fact tables, dimension tables as well as
sub-dimension tables.
S.NO Star Schema Snowflake Schema

In star schema, The fact tables and the While in snowflake schema, The fact tables, dimension
1. dimension tables are contained. tables as well as sub dimension tables are contained.

2. Star schema is a top-down model. While it is a bottom-up model.

3. Star schema uses more space. While it uses less space.

It takes less time for the execution of While it takes more time than star schema for the
4. queries. execution of queries.

In star schema, Normalization is not While in this, Both normalization and denormalization are
5. used. used.

6. It’s design is very simple. While it’s design is complex.


The query complexity of star schema is While the query complexity of snowflake schema is higher
7. low. than star schema.

8. It’s understanding is very simple. While it’s understanding is difficult.

9. It has less number of foreign keys. While it has more number of foreign keys.

10. It has high data redundancy. While it has low data redundancy.

ETL Process
ETL, which stands for extract, transform, and load, is the process data
engineers use to extract data from different sources, transform the
data into a usable and trusted resource, and load that data into the
systems end-users can access and use downstream to solve business
problems.

How Does ETL Work?


Extract
The first step of this process is extracting data from the target
sources that are usually heterogeneous such as business systems,
APIs, sensor data, marketing tools, and transaction databases, and
others. As you can see, some of these data types are likely to be the
structured outputs of widely used systems, while others are semi-
structured JSON server logs.There are different ways to perform the
extraction: Three Data Extraction methods:

1. Partial Extraction – The easiest way to obtain the data is if the


if the source system notifies you when a record has been
changed
2. Partial Extraction (with update notification) - Not all systems
can provide a notification in case an update has taken place;
however, they can point to those records that have been
changed and provide an extract of such records.
3. Full extract – There are certain systems that cannot identify
which data has been changed at all. In this case, a full extract
is the only possibility to extract the data out of the system.
This method requires having a copy of the last extract in the
same format so you can identify the changes that have been
made.

Transform
The second step consists of transforming the raw data that has been
extracted from the sources into a format that can be used by
different applications. In this stage, data gets cleansed, mapped and
transformed, often to a specific schema, so it meets operational
needs. This process entails several types of transformation that
ensure the quality and integrity of data Data is not usually loaded
directly into the target data source, but instead it is common to
have it uploaded into a staging database. This step ensures a quick
roll back in case something does not go as planned. During this
stage, you have the possibility to generate audit reports for
regulatory compliance, or diagnose and repair any data issues.
Load
Finally, the load function is the process of writing converted data
from a staging area to a target database, which may or may not
have previously existed. Depending on the requirements of the
application, this process may be either quite simple or intricate.
Each of these steps can be done with ETL tools or custom code.

You might also like