0% found this document useful (0 votes)
155 views97 pages

Data Warehouse

Uploaded by

Aaruni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views97 pages

Data Warehouse

Uploaded by

Aaruni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

CONTENTS

• Database and Data Warehousing


• History of data warehousing
• Evolution in organization use of data warehouses
• Data Warehouse Architecture
• Benefits of data warehousing
• Strategic uses of data warehousing
• Disadvantages of data warehouses
• Data mart
• Data mining
• Data mining for decision support
• Text mining
• OLAP
• Data warehousing integration
• Business intelligence
Database and Data Ware Housing….

• The Difference…
– DWH Constitute Entire Information Base For All
Time..
– Database Constitute Real Time Information…
– DWH Supports DM And Business Intelligence.
– Database Is Used To Running The Business
– DWH Is How To Run The Business
A producer wants to know….

Which are our


lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the
What impact will competition ?
new products/services
have on revenue
and margins?
Data, Data everywhere
yet ...
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly
documented

• I can’t use the data I found


– results are unexpected
– data needs to be transformed from
one form to other
What is a Data Warehouse?

A single, complete and


consistent store of data
obtained from a variety of
different sources made
available to end users in a
what they can understand and
use in a business context.
What is Data Warehousing?
Information
A process of
transforming data into
information and
making it available to
users in a timely
enough manner to
make a difference

Data
Data Warehousing -- a process

• It is a relational or multidimensional database


management system designed to support
management decision making.
• A data warehousing is a copy of transaction
data specifically structured for querying and
reporting.
• Technique for assembling and managing data from
various sources for the purpose of answering
business questions. Thus making decisions that were
not previous possible
Data warehousing is …
• Subject Oriented: Data that gives information about a particular subject
instead of about a company's ongoing operations.
• Integrated: Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
• Time-variant: All data in the data warehouse is identified with a particular
time period.
• Non-volatile: Data is stable in a data warehouse. More data is added but data
is never removed. This enables management to gain a consistent picture of the
business.
• Data warehousing is combining data from multiple and usually varied
sources
into one comprehensive and easily manipulated database.
• Common accessing systems of data warehousing include queries, analysis
and reporting.
• Because data warehousing creates one database in the end, the number of
sources can be anything you want it to be, provided that the system can
handle the volume, of course.
• The final result, however, is homogeneous data, which can be more easily
manipulated.
History of data warehousing
• The concept of data warehousing dates back to the late 1980s
when IBM researchers Barry Devlin and Paul Murphy
developed the "business data warehouse".
• 1960s - General Mills and Dartmouth College, in a joint
research project, develop the terms dimensions and facts.
• 1970s - ACNielsen and IRI provide dimensional data marts for
retail sales.
• 1983 – Tera data introduces a database management system
specifically designed for decision support.
• 1988 - Barry Devlin and Paul Murphy publish the article An
architecture for a business and information systems in IBM
Systems Journal where they introduce the term "business data
warehouse".
History Leading to Data Warehousing
•Improvement in database technologies,
especially relational DBMSs
•Advances in computer hardware, including mass
storage and parallel architectures
•Emergence of end-user computing with powerful
interfaces and tools
•Advances in middleware, enabling
heterogeneous database connectivity
•Recognition of difference between operational
and informational systems

11
Need for Data Warehousing
•Integrated, company-wide view of high-quality information (from
disparate databases)

•Separation of operational and informational systems and data (for


improved performance)

12
Issues with Company-Wide View
Inconsistent key structures
Synonyms
Free-form vs. structured fields
Inconsistent data values
Missing data

See figure for example

13
Figure: Examples
of heterogeneous
data

Copyright © 2014 Pearson Education, Inc.


14
14
Organizational Trends Motivating Data
Warehouses
•No single system of records
•Multiple systems not synchronized
•Organizational need to analyze activities
in a balanced way
•Customer relationship management
•Supplier relationship management

15
Separating Operational and
Informational Systems
Operational system – a system that is used to run a
business in real time, based on current data; also
called a system of record

Informational system – a system designed to support


decision making based on historical point-in-time and
prediction data for complex queries or data-mining
applications

16
17
Data Warehouse Architectures
•Independent Data Mart
•Dependent Data Mart and Operational
Data Store
•Logical Data Mart and Real-Time Data
Warehouse
•Three-Layer architecture

All involve some form of extract, transform and load (ETL)

18
Figure 9-2 Independent data mart Data marts:
data warehousing architecture Mini-warehouses, limited in scope

T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
Chapter 9 Copyright © 2014 Pearson Education, Inc.
19
19
Figure 9-3 Dependent data mart with ODS provides option for
operational data store: a three-level architecture obtaining current data

T
E
Simpler data access
Single ETL for Dependent data marts
enterprise data warehouse (EDW) loaded from EDW
Chapter 9 Copyright © 2014 Pearson Education, Inc.
20
20
Figure 9-4 Logical data mart and real
ODS and data warehouse
time warehouse architecture are one and the same

T
E
Near real-time ETL for Data marts are NOT separate databases,
Data Warehouse but logical views of the data warehouse
 Easier to create new data marts
Chapter 9 Copyright © 2014 Pearson Education, Inc.
21
21
Chapter 9 Copyright © 2014 Pearson Education, Inc.
22
22
Figure 9-5 Three-layer data architecture for a data warehouse

Chapter 9 Copyright © 2014 Pearson Education, Inc.


23
23
Data Characteristics
Status vs. Event Data Figure 9-6
Example of DBMS
Status log entry

Event = a
database action
(create/ update/
delete) that
results from a
transaction

Status

24
DATA CHARACTERISTICS
STATUS VS. EVENT DATA
Figure 9-7
Transient
operational data

With transient
data, changes to
existing records
are written over
previous records,
thus destroying
the previous
data content
DATA CHARACTERISTICS
STATUS VS. EVENT DATA
Figure 9-8 Periodic
warehouse data

Periodic data are


never physically
altered or deleted
once they have
been added to
the store

26
Other Data Warehouse Changes
New descriptive attributes
New business activity attributes
New classes of descriptive attributes
Descriptive attributes become more refined
Descriptive data are related to one another
New source of data

27
OLT
P
OLTP- ONLINE TRANSACTION PROCESSING
• Special data organization, access methods and
implementation methods are needed to support data
warehouse queries (typically multidimensional
queries)
• OLTP systems are tuned for known transactions and
workloads while workload is not known a priori in a
data warehouse
– e.g., average amount spent on phone calls between
9AM-5PM in Pune during the month of December
OLTP vs Data Warehouse
– OLTP • Warehouse (DSS)
• Application Oriented – Subject Oriented
• Used to run business – Used to analyze business
• Detailed data – Summarized and refined
• Current up to date – Snapshot data
• Isolated Data – Integrated Data
• Clerical User – Knowledge User (Manager)
• Few Records accessed at a time – Large volumes accessed at a time (millions)
(tens) – Mostly Read (Batch Update)
• Read/Update Access – Redundancy present
• No data redundancy – Database Size 100 GB - few
• Database Size 100MB terabytes
-100 GB – Query throughput is the performance metric
• Transaction throughput is the – Hundreds of users
performance metric – Managed by subsets
• Thousands of users
• Managed in entirety
To summarize ...
• OLTP Systems are
used to “run” a business

• The Data Warehouse helps


to “optimize” the business
Evolution in organizational use of data warehouses
Organizations generally start off with relatively simple use of data
warehousing. Over time, more sophisticated use of data warehousing evolves.
The following general stages of use of the data warehouse can be
distinguished:
• Off line Operational Database
–Data warehouses in this initial stage are developed by simply copying the
data off an operational system to another server where the processing load
of reporting against the copied data does not impact the operational
system's performance.
• Off line Data Warehouse
–Data warehouses at this stage are updated from data in the operational
systems on a regular basis and the data warehouse data is stored in a data
structure designed to facilitate reporting.
• Real Time Data Warehouse
–Data warehouses at this stage are updated every time an operational
system performs a transaction (e.g. an order or a delivery or a booking.)
• Integrated Data Warehouse
–Data warehouses at this stage are updated every time an operational
system performs a transaction. The data warehouses then generate
transactions that are passed back into the operational systems.
Data Warehouse Architecture

Client Client

Query &
Analysis

Metadata Warehouse

Integration

Source Source
Source
• The data has been selected from various sources and then integrate and
store the data in a single and particular format.
• Data warehouses contain current detailed data, historical detailed data,
lightly and highly summarized data, and metadata.
• Current and historical data are voluminous because they are stored at the
highest level of detail.
• Lightly and highly summarized data are necessary to save processing time
when users request them and are readily accessible.
• Metadata are “data about data”. It is important for designing,
constructing, retrieving, and controlling the warehouse data.

Technical metadata include where the data come from, how the data were
changed, how the data are organized, how the data are stored, who owns
the data, who is responsible for the data and how to contact them, who
can access the data , and the date of last update.

Business metadata include what data are available, where the data are, what
the data mean, how to access the data, predefined reports and queries,
and how current the data are.
Business
• advantages
It provides business users with a “customer-centric” view
company’s heterogeneous data by helping to integrate
of the
from
sales,
data service, manufacturing and distribution, and other customer-related
business systems.
• It provides added value to the company’s customers by allowing them to
access better information when data warehousing is coupled with internet
technology.
• It consolidates data about individual customers and provides a repository
of all customer contacts for segmentation modeling, customer retention
planning, and cross sales analysis.
• It removes barriers among functional areas by offering a way to reconcile
views from multiple areas, thus providing a look at activities that cross
functional lines.
• It on trends across multidivisional, multinational operating
units,
reports including trends or relationships in areas such as
merchandising, production planning etc.
Strategic uses of data warehousing
Industry Functional areas of Strategic use
use
Airline Operations; marketing Crew assignment, aircraft development, mix
of fares, analysis of route profitability,
frequent flyer program promotions
Banking Product development; Customer service, trend analysis, product and
Operations; marketing service promotions, reduction of IS
expenses

Credit card Product development; Customer service, new information service,


marketing fraud detection
Health care Operations Reduction of operational expenses
Investment and Product development; Risk management, market movements
Insurance Operations; marketing analysis, customer tendencies analysis,
portfolio management

Retail chain Distribution; marketing Trend analysis, buying pattern analysis,


pricing policy, inventory control, sales
promotions, optimal distribution channel
Telecommunications Product development; New product and service promotions,
Operations; marketing reduction of IS budget, profitability
analysis
Personal care Distribution; marketing Distribution decisions, product promotions,
sales decisions, pricing policy
Public sector Operations Intelligence gathering
Disadvantages of data warehouses
• Data warehouses are not the optimal environment for
unstructured data.
• Because data must be extracted, transformed and loaded into the
warehouse, there is an element of latency in data warehouse
data.
• Over their life, data warehouses can have high costs.
Maintenance costs are high.
• Data warehouses can get outdated relatively quickly. There is a
cost of delivering suboptimal information to the organization.
• There is often a fine line between data warehouses and
operational systems. Duplicate, expensive functionality may be
developed. Or, functionality may be developed in the data
warehouse that, in retrospect, should have been developed in the
operational systems and vice versa.
Data Marts
• A data mart is a scaled down version of a data warehouse that focuses on a
particular subject area.
• A data mart is a subset of an organizational data store, usually oriented to a
specific purpose or major data subject, that may be distributed to support
business needs.
• Data marts are analytical data stores designed to focus on specific business
functions for a specific community within an organization.
• Usually designed to support the unique business requirements of a specified
department or business process
• Implemented as the first step in proving the usefulness of the technologies to
solve business problems

Reasons for creating a data mart


• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation in less time
• Lower cost than implementing a full Data warehouse
• Potential users are more clearly defined than in a full Data warehouse
From the Data Warehouse to Data Marts

Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Data Warehouse
Structured

Data
Characteristics of the Departmental Data Mart
• Small
Data mart
• Flexible
• Customized by Department
• OLAP
• Source is departmentally
structured data
warehouse

Data warehouse
Data Mining
• Data Mining is the process of extracting information from the
company's various databases and re-organizing it for purposes
other than what the databases were originally intended for.
• It provides a means of extracting previously unknown, predictive
information from the base of accessible data in data warehouses.
• Data mining process is different for different organizations
depending upon the nature of the data and organization.
• Data mining tools use sophisticated, automated algorithms to
discover hidden patterns, correlations, and relationships among
organizational data.
• Data mining tools are used to predict future trends
and
behaviors, allowing businesses to make proactive,
knowledge driven decisions.
• For ex: for targeted marketing, data mining can use data on past
promotional mailings to identify the targets most likely
to
maximize the return on the company’s investment in
future mailings.
Functions
• Classification: It infers the defining characteristics of
a certain group
• Clustering: identifies group of items that share
a particular characteristic
• Association: identifies relationships between
events that occur at one time
• Sequencing: similar to association, except that
the relationship exists over a period of time
• Forecasting: estimates future values based on patterns
within large sets of data
Characteristics
• Data mining tools are needed to extract the buried information “ore”.
• The “miner” is often an end user, empowered by “data drills” and other
power query tools to ask ad hoc questions and get answers quickly, with
little or no programming skill.
• The data mining environment usually has a client/server architecture.
• Because of the large amounts of data, it is sometimes necessary to use
parallel processing for data mining.
• Data mining tools are easily combined with spreadsheets and other end
user software development tools, enabling the mined data to be analyzed
and processed quickly and easily.
• Data yields five types of information:
associations,
mining sequences, classifications, clusters and forecasting.
• “Striking it rich” often involves finding unexpected, valuable results.
Common data mining applications
APPLICATION DESCRIPTION
Market Identifies the common characteristics of customers
segmentation who buys the same products from the company

Customer churn Predicts which customers are likely to leave your


company and go to a competitor
Fraud detection Identifies which transactions are most likely to be
fraudulent
Direct marketing Identifies which prospects should be included in a
mailing list to obtain the highest response rate
Market based Understands what products or services are
analysis commonly purchased together
Trend analysis Reveals the difference between a typical customer
this month versus last month

Science Simulates nuclear explosions; visualizes quantum


physics
Entertainment Models customer flows in theme parks; analyzes safety
of amusement parks rides

Insurance and Predicts which customers will buy new policies;


health identifies behavior patterns that increase insurance
care risk; spots fraudulent claims

Manufacturing Optimizes product design, balancing manufacturability


and safety; improves shop-floor scheduling and
machine utilization

Medicine Ranks successful therapies for different illnesses;


predicts drug efficacy; discovers new drugs and
treatments

Oil and gas Analyzes seismic data for signs of underground deposits
; prioritizes drilling locations; simulates underground
flows to improve recovery

Retailing Discerns buying-behavior patterns; predicts how


customers will respond to marketing campaigns
Data Mining works with Data
Warehouse
 Data Warehousing
provides the Enterprise with
a memory

 Data Mining provides


the Enterprise with
intelligence
Data mining for decision support
Two capabilities are provided new business
opportunities

• Automated prediction of trends and behavior: for ex, targeted


marketing.

• Automated discovery of previously unknown patterns: for ex,


detecting fraudulent credit card transactions and identifying
anomalous data representing data entry-keying errors.
Data mining tools
IT tools and techniques are used by data miners
• Neural computing: It is a machine learning approach by which
historical data can be examined for patterns.

• Intelligent agents: It is the promising approach to retrieve


information from the internet or from intranet-based databases.

• Association analysis: An approach that uses a specialized set of


algorithms that sort through large data sets and expresses statistical rules
among items.
Text mining
• Text mining is the application of data mining
to non structured or less structured text files.
• Operates with less structured information
• Frequently focused on document format
rather than document content
Text mining helps in….
• Find the “hidden” content of documents, including additional
useful relationships
• Relate documents across previously unnoticed divisions (e.g.:
discover that customers in two different product divisions
have the same characteristics)
• Group documents by common themes (e.g.: identify all the
customers of an insurance firms who have similar complaints
and cancel their policies)
To summarize ...
• OLTP Systems are
used to “run” a business

• The Data Warehouse


helps to “optimize” the
business
OLAP
• Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software
• Generally synonymous with earlier terms such
as Decisions Support, Business
Intelligence, Executive Information System
• OLAP = Multidimensional Database
OLAP
• Online analytical processing refers to such end
user activities as DSS modelling using
spreadsheets and graphics that are done
online.
• OLAP involves many different data items in
complex relationships.
• Objective of OLAP is to analyze complex
relationships and look for patterns, trends and
exceptions.
OLAP Is FASMI
• Fast
• Analysis
• Shared
• Multidimensional
• Information
Strengths of OLAP
• It is a powerful visualization paradigm
• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and outliers
• Many vendors offer OLAP tools such as brio.com, cognus.com,
microstrategy.com etc and it is possible to access an OLAP
database from web.
Data warehousing integration

End Users:
DATA Direct use
SOURCES Decision making and other
(databases tasks: CRM, DSS, EIS
)
Data
organization ;
storage Information Data Use

Direct use Warehouse (storage)


Use of
Data visualization
knowledge
Use

Analytical processing, Generate knowledge


use

Data mining storage

Purchased Organizational
knowledge STORAGE Knowledge base
• Businesses run on information and the knowledge of
how to put that information to use.

• Knowledge is not readily available, it is continuously


constructed from data and/or information, in a
process that may not be simple or easy.

• The transformation of data into knowledge may be


accomplished in several ways

Data collection from various sources stored in simple


databases
• Data can be processed, organized, and stored in a data warehouse and then
analyzed (e.g.) by using analytical processing) by end users for decision
support.
• Some of the data are converted to information prior to storage in the data
warehouse, and some of the data and/or information can be analyzed to
generate knowledge. For example, by using data mining, a process that
looks for unknown relationships and patterns in the data, knowledge
regarding the impact of advertising on a specific group of customers can be
generated.
• This generated knowledge is stored in an organizational knowledge base, a
repository of accumulated corporate knowledge and of purchased
knowledge.
• The knowledge in the knowledge base can be used to support
less experienced and users, or to support complex decision
making.

Both the data and the information, at various times during the process, and the
knowledge derived at the end of the process, may need to be presented to
users.
Data Warehouse for Decision Support
• Putting Information technology to help the knowledge worker
make faster and better decisions
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can be ad-hoc
• Used by managers and end-users to understand the business
and make judgments
Business intelligence and data warehousing
Business Intelligence
• One ultimate use of the data gathered and processed in the
data life cycle is for business intelligence.
• Business intelligence generally involves the creation or use of
a data warehouse and/or data mart for storage of data, and
the use of front-end analytical tools such as Oracle’s Sales
Analyzer and Financial Analyzer or Micro Strategy’s Web.
• Such tools can be employed by end users to access data, ask
queries, request ad hoc (special) reports, examine scenarios,
create CRM activities, devise pricing strategies, and much
more.
How business intelligence works?
• The process starts with raw data which are usually kept in
corporate data bases. For example, a national retail chain
that sells everything from grills and patio furniture to plastic
utensils had data about inventory, customer information, data
about past promotions, and sales numbers in various
databases.
• Though all this information may be scattered across multiple
systems-and may seem unrelated-business intelligence
software can being it together. This is done by using a
data warehouse.
• In the data warehouse (or mart) tables can be linked, and
data cubes are formed. For instance, inventory
information is linked to sales numbers and customer
databases, allowing for deep analysis of information.
• Using the business intelligence software the user can ask
queries, request ad-hoc reports, or conduct any other
analysis.
• For example, deep analysis can be carried out by
performing multilayer queries. Because all the databases
are linked, one can search for what products a store has too
much of, determine which of these products commonly sell
with popular items, bases on previous sales. After
planning a promotion to move the excess stock along with
the popular products (by bundling them together, for
example), one can dig deeper to see where this promotion
would be most popular (and most profitable). The results of
the request can be reports, predictions, alerts, and/or
graphical presentations. These can be disseminated
to decision makers to help them in their decision-making
tasks.
More advanced applications of business
intelligence include outputs such as
• financial modeling
• budgeting
• resource allocation
• and competitive intelligence.
ELT and ETL
ELT
Extract, load, transform (ELT) is an alternative to 
extract, transform, load (ETL) used with data lake implementations. In
contrast to ETL, in ELT models the data is not transformed on entry to
the data lake, but stored in its original raw format. This enables faster
loading times. However, ELT requires sufficient processing power
within the data processing engine to carry out the transformation on
demand, to return the results in a timely manner. Since the data is not
processed on entry to the data lake, the query and schema do not
need to be defined a priori (although often the schema will be
available during load since many data sources are extracts from
databases or similar structured data systems and hence have an
associated schema). ELT is a data pipeline model.
The five critical differences of ETL vs ELT:

1. ETL is the Extract, Transform, and Load process for data. ELT is Extract,
Load, and Transform process for data.
2. In ETL, data moves from the data source to staging into the data
warehouse.
3. ELT leverages the data warehouse to do basic transformations. There is
no need for data staging.
4. ETL can help with data privacy and compliance by cleaning sensitive and
secure data even before loading into the data warehouse.
5. ETL can perform sophisticated data transformations and can be more
cost-effective than ELT. 

ETL vs ELT is easy to explain, but understanding the big picture—i.e., the


potential advantages of ETL vs. ELT—requires a deeper knowledge of how
ETL works with data warehouses, and how ELT works with data lakes.
Overview of ETL and ELT
The ETL and ELT are necessary in data science because information sources—whether
they use a structured SQL database or an unstructured NoSQL database—will rarely
use the same or compatible formats. Therefore, you have to clean, enrich, and
transform your data sources before integrating them into an analyzable whole. That
way, your business intelligence platform (like Looker, Chartio, Tableau, or QuickSight)
can understand the data to derive insights.
Regardless of whether it's ETL or ELT, the data transformation/integration process
involves the following three steps:
Extract: Extraction refers to pulling the source data from the original database or data
source. With ETL, the data goes into a temporary staging area. With ELT, it goes
immediately into a data lake storage system.
Transform: Transformation refers to the process of changing the structure of the
information, so it integrates with the target data system and the rest of the data in that
system.
Load: Loading refers to the process of depositing the information into a data storage
system.
As we’ve already established, ETL and ELT perform these steps in a different order. So
the question is: Should you transform your data before or after loading it into the data
repository? To answer that, you need to understand ETL and ELT separately.
ETL Process in Detail
1. Online Analytical Processing (OLAP) data warehouses—whether they are cloud-
based or onsite—need to work with relational SQL-based data structures. Therefore,
any data you load into your OLAP data warehouse must transform into a relational
format before the data warehouse can ingest it. As a part of this data transformation
process, data mapping may also be necessary to combine multiple data sources
based on correlating information (so your business intelligence platform can analyze
the information as a single, integrated unit).
2. That’s why data warehouses require ETL—because the transformations must
happens before the loading. Here are some details to understand about ETL:
3. A continuous, ongoing process with a well-defined workflow: ETL first extracts data
from homogeneous or heterogeneous data sources. Next, it deposits the data into a
staging area. From there, the data goes through a cleansing process, gets enriched
and transformed, and is finally stored in a data warehouse.
4. Used to required detailed planning, supervision, and coding by data engineers and
developers: The old-school methods of hand-coding ETL transformations in data
warehousing took an enormous amount of time. Even after designing the process, it
took time for the data to go through each stage when updating the data warehouse
with new information.
5. Modern ETL solutions are easier and faster: Modern ETL, especially for cloud-based
data warehouses and cloud-based SaaS platforms, happens a lot faster. By using a 
cloud-based ETL solution, like Xplenty, users can instantly extract, transform, and
load their data from diverse sources without having programming expertise .
The Biggest Advantages of ETL
1. One of the biggest advantages of ETL over ELT relates to the pre-structured nature of the
OLAP data warehouse. After structuring/transforming the data, ETL allows for speedier, more
efficient, more stable data analysis. In contrast, ELT isn't ideal when the task requires speedy
analysis.
2. Another significant advantage for ETL over ELT relates to compliance. Often companies
regulated by GDPR, HIPAA, or CCPA need to remove, mask, or encrypt specific data fields to
protect the privacy of their clients. This could involve transforming emails to just the domain
or removing the last part of an IP address. ETL provides a more secure way to perform these
transformations because it transforms the data before putting it into the data warehouse.
3. In contrast, ELT requires you to upload the sensitive data first. That causes it to show up in
logs where SysAdmins can access to it. Also, using ELT to transform data could inadvertently
violate the EU's GDPR compliance standards if non-compliant data leaves the EU when
uploading it to a data lake. Ultimately, ETL reduces the risk of compliance violations because
non-compliant data will never accidentally find its way into a data warehouse or reports. 
4. Finally, as a data integration/transformation process, ETL has existed for over two decades,
which means that there are many well-developed ETL tools and platforms available to assist
with data extraction, transformation, and loading needs. Also, data engineers skilled and
experienced at setting up ETL pipelines are easy to find.
ELT Process
ELT
ELT stands for "Extract, Load, and Transform." In this process, data gets leveraged via a data warehouse in order
to do basic transformations. That means there's no need for data staging. ELT uses cloud-based data
warehousing solutions for all different types of data - including structured, unstructured, semi-structured, and
even raw data types.
The ELT process also works hand-in-hand with data lakes. "Data Lakes" are special kinds of data stores that—
unlike OLAP data warehouses—accept any kind of structured or unstructured data. Data lakes don't require you
to transform your data before loading it. You can immediately load any type of raw information into a data lake,
no matter the format or lack thereof.
Data transformation is still necessary before analyzing the data with a business intelligence platform. However,
data cleansing, enrichment, and transformation occur after loading the data into the data lake. Here are some
details to understand about ELT and data lakes:
•A new technology made possible by high-speed, cloud-based servers: ELT is a relatively new technology,
made possible because of modern, cloud-based server technologies. Cloud-based data warehouses offer near-
endless storage capabilities and scalable processing power. For example, platforms like Amazon Redshift and
Google BigQuery make ELT pipelines possible because of their incredible processing capabilities.
•Ingest anything and everything as the data becomes available: ELT paired with a data lake lets you ingest an
ever-expanding pool of raw data immediately, as it becomes available. There's no requirement to transform the
data into a special format before saving it in the data lake.
•Transforms only the data you need: ELT transforms only the data required for a particular analysis. Although it
can slow down the process of analyzing the data, it offers more flexibility—because you can transform the data
in different ways on the fly to produce different types of metrics, forecasts, and reports. Conversely, with ETL,
the entire ETL pipeline—and the structure of the data in the OLAP warehouse—may require modification if the
previously-decided structure doesn't allow for a new type of analysis.
•ELT is less-reliable than ETL: It’s important to note that the tools and systems of ELT are still evolving, so
they're not as reliable as ETL paired with an OLAP database. Although it takes more effort to set up, ETL
provides more accurate insights when dealing with massive pools of data. Also, ELT developers who know how
to use ELT technology are more difficult to find than ETL developers.
Advantages of ETL
The primary advantage of ELT over ETL relates to flexibility and ease of storing new,
unstructured data. With ELT, you can save any type of information—even if you don’t
have the time or ability to transform and structure it first—providing immediate access
to all of your information whenever you want it. Furthermore, you don’t have to
develop complex ETL processes before data ingests, and saves developers and BI
analysts time when dealing with new information. 
Here are some other benefits of ELT:
BENEFIT #1: High Speed
When it comes to data availability, ELT is the faster option. ELT allows for all of the data
to go into the system immediately, and from there, users can determine the exact data
they need to both transform and analyze.
BENEFIT #2: Low-Maintenance
With ELT, users generally won't have to have a "high-touch' maintenance plan. Since
ELT is cloud-based, it utilizes automated solutions instead of relying on the user to
initiate manual updates. 
BENEFIT 3#: Quicker Loading
Because the transformation step doesn't occur until after the data has entered the
warehouse, it cuts down on the time it takes to load the data into its final location.
There's no need to wait for the data to be cleansed or otherwise modified, and it only
needs to go into the target system once.
Contd…
Best Ways to Use ELT
As outlined in this article, ETL vs. ELT is an ongoing debate. So, in what circumstances
might you consider using ELT instead of ETL? Here are some of them:
USE CASE #1:
A company with massive amounts of data. ELT works best with huge quantities of data,
both structured and unstructured. As long as the target system is cloud-based, you will
likely be able to process those huge amounts of data more quickly with an ELT solution.
USE CASE #2:
An organization with the resources to handle the processing power needed. With ETL,
the majority of the processing takes place while the data is still in pipeline before it
gets to your warehouse. ELT does its work once the data has already arrived in the data
lake. Depending on what needs to be done to the data to suit your purposes, smaller
companies may not have the financial flexibility to develop or explore the extensive
technology needed to get the full benefits of a data lake.
USE CASE #3: 
A company that needs all it's data in one place as soon as possible. When the
transformations take place at the end of the process, ELT prioritizes the speed of
transfer over almost everything else, which means that all data - good, bad, and
otherwise - ends up in the data lake for later transformation.
ETL Vs ELT
ETL ELT
Adoption of the technology and ETL is a well-developed process ELT is a new technology, so it can
availability of tools and experts used for over 20 years, and ETL be difficult to locate experts and
experts are readily available. more challenging to develop an
ELT pipeline compared to an ETL
pipeline.
Availability of data in the system ETL only transforms and loads the ELT can load all data immediately,
data that you decide is necessary and users can determine later
when creating the data warehouse which data to transform and
and ETL process. Therefore, only analyze.
this information will be available.
Calculations will either replace
existing columns, or you can
Can you add calculations? append the dataset to push the ELT adds calculated columns
directly to the existing dataset.
calculation result to the target
data system.
ETL is not normally a solution for ELT offers a pipeline for data lakes
Compatible with data lakes? data lakes. It transforms data for to ingest unstructured data. Then
integration with a structured it transforms the data on an as-
relational data warehouse system. needed basis for analysis.
ELT requires you to upload
ETL can redact and remove the data before
redacting/removing
sensitive information
before putting it into the sensitive information. This
could violate GDPR, HIPAA,
data warehouse or cloud
and CCPA standards.
Compliance server. This makes it easier Sensitive information will
to satisfy GDPR, HIPAA, and
be more vulnerable to
CCPA compliance standards. hacks and inadvertent
It also protects data from
exposure. You could also
hacks and inadvertent
exposure. violate some compliance
standards if the cloud-
server is in another country.

ETL is best suited for dealing ELT is best when dealing


Data size vs. complexity of with smaller data sets that with massive amounts of
transformations require complex structured and unstructured
transformations. data.
ELT works with cloud-based
ETL works with cloud-based data warehousing solutions
and onsite data warehouses. to support structured,
Data warehousing support
It requires a relational or unstructured, semi-
structured data format. structured, and raw data
types.
Cloud-based ETL platforms (like
Xplenty) don't require special
hardware. Legacy, onsite ETL ELT processes are cloud-based
Hardware requirements processes have extensive and and don't require special
costly hardware requirements, hardware.
but they are not as popular
today.
As long as you have a powerful,
Aggregation becomes more
How are aggregations different? complicated as the dataset cloud-based target data system,
you can quickly process massive
increases in size. amounts of data.
As a new technology, the tools to
ETL experts are easy to procure implement an ELT solution are
when building an ETL pipeline.
Implementation Complexity Highly evolved ETL tools are also still evolving. Moreover, experts
with the requisite ELT knowledge
available to facilitate this process.
and skills can be difficult to find.
Automated, cloud-based ETL
solutions, like Xplenty, require ELT is cloud-based and generally
little maintenance. However, an incorporates automated
Maintenance requirement
onsite ETL solution that uses a solutions, so very little
physical server will require maintenance is required.
frequent maintenance.
Data transformations happen immediately Data is extracted, then loaded into the
Order of the extract, after extraction within a staging area. After target data system first. Only later is some of
transform, load
transformation, the data is loaded into the the data transformed on an “as-needed”
process data warehouse. basis for analytical purposes.

Cloud-based SaaS ELT platforms that bill with


a pay-per-session pricing model offer flexible
Cloud-based SaaS ETL platforms that bill with a plans that start at approximately $100 and go
pay-per-session pricing model (such as up from there. One cost advantage of ELT is
Xplenty) offer flexible plans that start at that you can load and save your data without
approximately $100 and go up from there, incurring large fees, then apply
Costs depending on usage requirements. transformations as needed. This can save
Meanwhile, an enterprise-level onsite ETL money on initial costs if you just want to load
solution like Informatica could cost over $1 and save information. However, financially
million a year! strapped businesses may never be able to
afford the processing power required to reap
the full benefits of their data lake.

Transformations happen within a staging area Transformations happen inside the data
Transformation process outside the data warehouse. system itself, and no staging area is required.

ELT is a solution for uploading unstructured


Unstructured data ETL can be used to structure unstructured data into a data lake and make unstructured
data, but it can’t be used to pass unstructured
support data into the target system. data available to business intelligence
systems.
ETL load times are longer
Data loading happens
than ELT because it's a
faster because there's no
multi-stage process: (1) waiting for
data loads into the staging
transformations and the
Waiting time to load area, (2) transformations data only loads one time
information take place, (3) data loads
into the data warehouse. into the target data
system. However, analysis
Once the data is loaded,
of the information is
analysis of the information slower than ETL.
is faster than ELT.
Since transformations
Data transformations take
more time initially because happen after loading, on an
as-needed basis—and you
every piece of data
requires transformation transform only the data
you need to analyze at the
before loading. Also, as the
Waiting time to perform time—transformations
transformations size of the data system happen a lot of faster.
increases, transformations
However, the need to
take longer. However, once
transformed and in the continually transform data
slows down the total time
system, analysis happens
quickly and efficiently. it takes for
querying/analysis.
To Summarize
1. ETL stands for Extract, Transform, and Load, while ELT stands for Extract,
Load, and Transform.
2. In ETL, data flow from the data source to staging to the data destination.
3. ELT lets the data destination do the transformation, eliminating the need
for data staging.
4. ETL can help with data privacy and compliance, cleansing sensitive data
before loading into the data destination, while ELT is simpler and for
companies with minor data needs.
Difference between Star Schema and Snowflake Schema

Star Schema:
Star schema is the type of multidimensional model which is used for data
warehouse. In star schema, The fact tables and the dimension tables are
contained. In this schema fewer foreign-key join is used. This schema forms a
star with fact table and dimension tables.
Characteristics of Star Schema:
1. Every dimension in a star schema is represented with the only one-
dimension table.
2. The dimension table should contain the set of attributes.
3. The dimension table is joined to the fact table using a foreign key
4. The dimension table are not joined to each other
5. Fact table would contain key and measure
6. The Star schema is easy to understand and provides optimal disk usage.
7. The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have Country lookup table as an OLTP design
would have.
8. The schema is widely supported by BI Tools
Snowflake Schema
Snowflake Schema is also the type of multidimensional model which is used
for data warehouse. In snowflake schema, The fact tables, dimension tables as
well as sub dimension tables are contained. This schema forms a snowflake
with fact tables, dimension tables as well as sub-dimension tables.
Characteristics of Snowflake Schema:
•The main benefit of the snowflake schema it uses smaller disk space.
•Easier to implement a dimension is added to the Schema
•Due to multiple tables query performance is reduced
•The primary challenge that you will face while using the snowflake Schema is
that you need to perform more maintenance efforts because of the more
lookup tables.
Difference
Star Schema Snowflake Schema

In star schema, The fact tables and the dimension tables are While in snowflake schema, The fact tables, dimension
contained. tables as well as sub dimension tables are contained.

Star schema is a top-down model. While it is a bottom-up model.

Star schema uses more space. While it uses less space.

It takes less time for the execution of queries. While it takes more time than star schema for the execution
of queries.

In star schema, Normalization is not used. While in this, Both normalization and denormalization are
used.

It’s design is very simple. While it’s design is complex.

The query complexity of star schema is low. While the query complexity of snowflake schema is higher
than star schema.

It’s understanding is very simple. While it’s understanding is difficult.


Star Schema Snowflake schema
It has less number of foreign keys. While it has more number of foreign keys.

It has high data redundancy. While it has low data redundancy.


Predictive Modeling
Predictive modeling is a process that uses data and statistics to predict
outcomes with data models. These models can be used to predict anything
from sports outcomes and TV ratings to technological advances and corporate
earnings. Predictive modeling is also often referred to as: Predictive analytics.

https://www.youtube.com/watch?v=JOArz7wggkQ
Once data has been collected, the analyst selects and trains statistical models,
using historical data. Although it may be tempting to think that big data
makes predictive models more accurate, statistical theorems show that, after
a certain point, feeding more data into a predictive analytics model 
does not improve accuracy. The old saying "All models are wrong, but some
are useful" is often mentioned in terms of relying solely on predictive models
to determine future action.
In many use cases, including weather predictions, multiple models are run
simultaneously and results are aggregated to create one final prediction. This
approach is known as ensemble modeling. As additional data becomes
available, the statistical analysis will either be validated or revised.
Applications of predictive
modeling
Predictive modeling is often associated with meteorology and weather
forecasting, but it has many applications in business.
•One of the most common uses of predictive modeling is in online advertising
and marketing. Modelers use web surfers' historical data, running it through 
algorithms to determine what kinds of products users might be interested in
and what they are likely to click on.
•Bayesian spam filters use predictive modeling to identify the probability that
a given message is spam. In fraud detection, predictive modeling is used to
identify outliers in a data set that point toward fraudulent activity. And in
customer relationship management (CRM), predictive modeling is used to
target messaging to customers who are most likely to make a purchase. Other
applications include capacity planning, change management, disaster
recovery (DR), engineering, physical and digital security management and city
planning.
Modeling methods
Analyzing representative portions of the available information -- sampling --
can help speed development time on models and enable them to be deployed
more quickly.
Once data scientists gather this sample data, they must select the right
model. Linear regressions are among the simplest types of predictive models.
Linear models essentially take two variables that are correlated -- one
independent and the other dependent -- and plot one on the x-axis and one
on the y-axis. The model applies a best fit line to the resulting data points.
Data scientists can use this to predict future occurrences of the dependent
variable.
Some of the most popular methods include:
•Decision trees. Decision tree algorithms take data (mined, open source,
internal) and graphs it out in branches to display the possible outcomes of
various decisions. Decision trees classify response variables and predict
response variables based on past decisions, can be used with incomplete data
sets and is easily explainable and accessible for novice data scientists.
•Time series analysis. This is a technique for the prediction of events through
a sequence of time. You can predict future events by analyzing past trends
and extrapolating from there.
•Logistic regression. This method is a statistical analysis method that aids in
data preparation. As more data is brought in, the algorithm's ability to sort
and classify it improves and therefore predictions can be made.
•The most complex area of predictive modeling is the neural network. This
type of machine learning model independently reviews large volumes of
labeled data in search of correlations between variables in the data. It can
detect even subtle correlations that only emerge after reviewing millions of
data points. The algorithm can then make inferences about unlabeled data
files that are similar in type to the data set it trained on. Neural networks
form the basis of many of today's examples of artificial intelligence (AI),
including image recognition, smart assistants and natural language generation
(NLG).
Common algorithms for predictive modeling
Random Forest. An algorithm that combines unrelated decision trees and uses classification and regression to
organize and label vast amounts of data.
Gradient boosted model. An algorithm that uses several decision trees, similar to Random Forest, but they are
more closely related. In this, each tree corrects the flaws of the previous one and builds a more accurate
picture.
K-Means. Groups data points in a similar fashion as a clustering model and is popular with personalized retail
offers. It can create personalized offers when dealing with a large group by seeking out similarities.
Prophet. A forecasting procedure especially effective when dealing with capacity planning. This algorithm deals
with time series data and is relatively flexible.
Predictive modeling tools
Before deploying a prediction model tool, it is crucial for your organization to ask questions. You must sort out
the following: clarify who will be running the software, what the use case will be for these tools, what other
tools will your predictive analytics be interacting with, as well as the budget.
Different tools have different data literacy requirements, are effective in different use cases, are best used with
similar software and can be expensive. Once your organization has clarity on these issues, comparing tools
becomes easier.
Sisense. A business intelligence software aimed at a variety of companies that offers a range of business
analytics features. This requires minimal IT background.
Oracle Crystal Ball. A spreadsheet-based application focused at engineers, strategic planners and scientists
across industries that can be used for predictive modeling, forecasting as well as simulation and optimization.
IBM SPSS Predictive Analytics Enterprise. A business intelligence platform that supports open source
integration and features descriptive and predictive analysis as well as data preparation.
SAS advanced Analytics. A program that offers algorithms that identify the likelihood of future outcomes and
can be used for data mining, forecasting and econometrics.
Predictive modeling considerations
One of the most frequently overlooked challenges of predictive modeling is acquiring the amount
of data needed and sorting out the right data to use when developing algorithms. By some
estimates, data scientists spend about 80% of their time on this step. Data collection is important
but limited in usefulness if this data is not properly managed and cleaned.
Once the data has been sorted, organizations must be careful to avoid overfitting. Over-testing on
training data can result in a model that appears very accurate but has memorized the key points
in the data set rather than learned how to generalize.
While predictive modeling is often considered to be primarily a mathematical problem, users
must plan for the technical and organizational barriers that might prevent them from getting the
data they need. Often, systems that store useful data are not connected directly to centralized 
data warehouses. Also, some lines of business may feel that the data they manage is their asset,
and they may not share it freely with data science teams.
Another potential stumbling block for predictive modeling initiatives is making sure projects
address real business challenges. Sometimes, data scientists discover correlations that seem
interesting at the time and build algorithms to investigate the correlation further. However, just
because they find something that is statistically significant doesn't mean it presents an insight the
business can use. Predictive modeling initiatives need to have a solid foundation of business
relevance.

You might also like