0% found this document useful (0 votes)
30 views

DataWareHouse Notes

Uploaded by

krish2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

DataWareHouse Notes

Uploaded by

krish2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Datawarehouse:- System that aggregates data from multiple sources into central repository of structured data to support

analytics (OLAP-OnLine Analytical Processing). Supports ML, AI, data mining, OLAP and reporting.

Another def:- Subject/business oriented (customer/supplier/product/sales etc.), integrated (data collected from
multiple data sources), time-variant (timely collection of data over period) and non-volatile (existing data is not changed
just new data appended) collection of data to support mgmt. decision making process.

DWH provided on appliances, on-cloud, on-premises and mixed solutions by IBM, Oracle, Microsoft, amazon, Google etc.

Data marts:- domain/user/business function specific repository system (Type- Independent, dependent, hybrid). Specific
schema data repository for ease of retrieval and for analytics.

Data lake:- Repository of raw data in its native form without any preprocessing. For structured, semi-structured and
unstructured data. Cons- Data duplication lead to storage excess and less data quality

Data lakehouse:- To ensure optimized data quality with less storage costs and with schematic data. Pros of both DWH
and Datalake.

FACT and Dimension tables:-

FACT- quantitative/aggregated data of business processes, contains foreign keys to dimension tables

DIMENSION-categorical variables to filter, group fact data. Contains business entities

Data Modeling into FLAT schema, STAR schema or SNOWFLAKE schema depending upon the storage/query processing
requirement.

Why do we use these schemas, and how do they differ?

Star schemas are optimized for reads and are widely used for designing data marts(query boost), whereas snowflake
schemas are optimized for writes and are widely used for transactional data warehousing(writing/size boost).
 Normalization reduces redundancy, data size (5 NF types)

Data Cube Rep:-

Slicing- 1 layer of cube is cut

Dicing- large cube is filtered into small cube

Drill up and down-Drilling up and down into subsequent layers

Pivoting-Rearrange the view of cube

Rolling up- summarize data using aggregate functions

1. Grouping sets- subtotals for every requested tuple of items


2. CUBE-subtotals/totals for combined and single category
3. ROLLUP-
4. Materialized Views:- Snapshot of contents of sql query or to replicate data in staging database or precompute
expensive queries for DWH

DWH architecture:-

DataSources(DB,Datalakes,ERP,OLTPs)ETLProcessing w/o staging areaDWHDatamartReporting/analytical tools


Data Quality concerns:-

 Accuracy (Match b/w src / target system)


 Completeness (missing, null, invalid values)
 Consistency (datatypes, datafields, names etc.)
 Currency (up to date information)

Managing DQ :- DetectCaptureReportInvestigateDiagnoseCorrect and then automating workflows

1.
Question 1
What do we call a normalized version of the star schema?
1 / 1 point
Product schema
Normalized schema
Parent dimension
Snowflake schema
Correct
Correct, the normalized version of the star schema is called a snowflake schema, due to its multiple layers of
branching which resembles a snowflake pattern.
2.
Question 2
Considering a general architectural model for an Enterprise Data Warehouse, which of these components is holding
data and developing workflows?
1 / 1 point
Enterprise data warehouse repository
Staging and sandbox areas
Data sources
Data marts
Correct
Correct, these components are holding data and developing workflows.
3.
Question 3
Materialized Views can be set up to have different refresh options, such as: (Select 1 answer).
1 / 1 point
Populated
Never, upon request, and immediately
Automatically
Manually refresh
Correct
Materialized Views can be set up to have different refresh options, such as “never” (they are only populated when
created, which is useful if the data seldom changes), “upon request” (manually refresh, for example, after changes
to the data have been made, or scheduled refresh, for example, after daily data loads), and “immediately”
(automatically refresh after every statement).
4.
Question 4
Accumulating snapshot fact tables are used to __________.
0 / 1 point
extract data
process events
load data
record events
Incorrect
Incorrect, please review the Facts and Dimensional Modeling video.
5.
Question 5
In what location is data from source systems extracted to?
1 / 1 point
Target systems
Operating system
Staging area
Business intelligence platform
Correct
Correct, a staging area is a separate location where data from source systems is extracted to.
6.
Question 6
Materialized views can be used to __________.
1 / 1 point
safely work with affecting source database
automatically safe query results
replicate data
synchronize updates
Correct
Correct, they can be used to replicate data, for example to be used in a staging database

 2 design approaches of DWH:- Top down (SRCDWHDM) and Bottom-Ups (SRCDMDWH)

You might also like