0% found this document useful (0 votes)
138 views62 pages

DataWarehouseDesignDecisions PDF

The document provides an overview of data warehousing design decisions and technologies. It discusses traditional data warehouse concepts like dimensional modeling, ETL processes, and OLAP. It also covers emerging technologies like Hadoop and how it can handle big data challenges related to volume, velocity, and variety of data. The document suggests adapting the data warehouse to be more flexible and agile to analyze changing data sources and meet new business needs.

Uploaded by

Hemant Dujari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views62 pages

DataWarehouseDesignDecisions PDF

The document provides an overview of data warehousing design decisions and technologies. It discusses traditional data warehouse concepts like dimensional modeling, ETL processes, and OLAP. It also covers emerging technologies like Hadoop and how it can handle big data challenges related to volume, velocity, and variety of data. The document suggests adapting the data warehouse to be more flexible and agile to analyze changing data sources and meet new business needs.

Uploaded by

Hemant Dujari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Warehouse Design

Decisions
August 2015

Colleen Barnitz
Director, IT Development
MVT Services
Colleen Barnitz

 over 20 Years in IT
 worked with SQL Server since version 6.5
 developer and an architect on data warehouse projects during that
time
 leads a team of developers in providing applications and business
intelligence solutions for the enterprise
 co-founded and is active in the Las Cruces & El Paso SQL Server
User Group

 Twitter: @ColleenBarnitz
 Email: [email protected]

2
It’s An Exciting Time to be in the Data
Business…

3
Agenda

IT and Data Services

Traditional Data Warehouse


• Schema
• ETL
• OLAP

Big Data
• Hadoop

Streaming Data
• Event Processing

Integration – one big happy data family

4
Data Services

• Businesses need to adapt quickly to changing conditions

• IT role is to make the data resources available for decision making

• Data for both analysis and applications

• Make it pain-free for users

• Be prepared to be Agile

5
What is a Data Warehouse?

 Central repository for integrated data from one or more sources

 Concept originated in the late 1980’s

 Stores current and historical data

 Typically built in relational database

 Schema-driven

 Optimized for query and analysis

6
Built for the Business

 A source of approved data for your users and applications

 User friendly – renamed cryptic columns and table names

 Eliminates user research into what is relevant data / what isn’t

 Contains approved business rule logic

 Prevents querying against the line of business systems

7
Still valid after all these years…

• sub-terabyte data

• Setup a RDBMS on a server or in a cloud environment

• Implement your data model

• Populate your data

• Hook up your front-end BI tools

8
The Data Warehouse Project

 Project Management basics

 Get senior management sponsorship

 Define and prioritize specific business needs to be addressed

 Define data architecture to be used

 Define the technology stack

 Iterate to add increasing value

 Sustain the data warehouse with governance

 Promote and train data warehouse users

9
Data Modeling

Choices:
 Kimball – Dimensional Modeling – Star Schemas

 Inmon – model 3rd Normal Form first – then build data warehouse

 Data Vault – Hub, Links, and Satellites

 Anchor Modelling – variation upon the Data Vault

1
0
Kimball Methodology
Enterprise Data Warehouse Bus Architecture

1
1
Dimensional Modeling

 Gather business requirements and source data realities

 Select the process

 Declare the grain

 Identify the dimensions

 Identify the facts

 Get Business buy-in

1
2
Star Schema

 Fact table(s) linked to their associated dimension tables


 Linked by Primary / Foreign key relationships

 OLAP cubes can be derived from the star schema

1
3
1
4
Dimension Tables

 Provide WHO, WHAT, WHERE, WHEN, WHY, and HOW context

 Descriptive attributes used for filtering, grouping

 Take time to design – critical for the end user’s BI experience

 Different flavors:
 Slowly Changing Dimensions – Types 0 thru 7
 Junk Dimensions
 Bridge Tables

1
5
Conformed Dimensions

 Design common dimensions to be used for multiple fact tables

 Support drill across and integrate data

 Reuse shortens development time

1
6
Fact Tables

 Measurements from the business process

 Almost always numeric

 Only facts consistent with the grain

 A physical observable event - e.g. a sales transaction

 Kinds:
 Periodic
 Accumulating
 Temporal

1
7
Data Vault Modeling

 Unified Decomposition

 Decompose the data into:


 Business Keys
 Relationships
 Attributes

 Strengths:
 Agility
 Auditability
 History Tracking
 Easy to Automate

1
8
Data Vault Cont’d

 Every row has to have source and load date


 No distinction between good and bad data (not cleansed)
 Parallel loading – can scale without major redesign

 Easy going in – but challenging getting data out


 More tables and MORE joins

 Not intended for direct access


 Will need to be translated to a dimensional or other data mart
model for distribution (lost me here)

1
9
2
0
Anchor Modeling

 Open source database modeling technique


 Designed to address constant change
 Large change on the outside of the model – small change within

 Concepts:
 Objects
 Attributes
 Relationships

 Based on sixth normal form


 Highly decomposed

2
1
2
2
Populating the Warehouse

2
3
ETL

 Extract – the process by which data is extracted from the data source
 Transform – transformations and integrity checking
 Load – loading the data into the warehouse

 Designing from the end backwards

 Business rules are coded into the transformation process

 Only the data required for the output is included in the extraction
process

 Trying to design for current and future needs

2
4
ELT – Extract, Load, Transform
 “We don’t need no stinking transforms” interpretation

 Extract data sources into staging – perform integrity checks


 Load source data into the warehouse (cleaned, offline copy)
 Transform – create target output

 DW has a cleaned, offline copy of the source data

 Separates the transformation process from the Extract and Load.

 Separates loading from process design

 Removes the dependency on designing from the back

 Brought all the data and now available for future needs

 Tradeoff of using a lot more database server resources

2
5
Automation of ETL/ELT is KEY
 Brittle designs hold back changes and new development
 SSIS editing can be very frustrating

 Code Generation
 BIML / Bids Helper
 Write your own

 Code Generator will use metadata about sources and destinations


 Map the views/tables
 Map the columns
 Flag the Business keys
 Flag the Update type (dimensional modeling)

 Helper views on the source can provide simple transforms

2
6
ELT – Automation Friendly

 Not getting bogged down by destination design decisions

 Metadata Mapping is straight-forward

 Source meet Target – GO!

2
7
OLAP - On Line Analytical Processing

2
8
Analysis Services (SSAS) Options

 Multi-Dimensional mode

 Works well with star schemas

 Tabular Mode

 Forgiving, adaptable to non-star schema designs

 xVelocity in-memory analytics engine (VertiPaq)

2
9
Adding Data Scientists / Uber Analysts
to the Mix

 Extreme users of data

 All about insights into revenue, profitability, customer satisfaction

 Working directly with senior management

 They need to spend their time understanding what data means and
not creating clean, integrated data sets

3
0
Providing for the Analysts – A Sandbox

 An Analyst Sandbox complements the data warehouse

 Environment for:

 Experiments

 New ideas

 Test hypotheses

 New data sources

 Evaluating and exploring new tools

*Minimally Governed

(Bob Becker – Design Tip #174 Kimball Group)

3
1
Analytics Sandbox

CAVEATS:

 Not for core business functions

 Not for ongoing reporting or analytics

 If an experiment pans out – turn it over for implementation in the


data warehouse

3
2
Traditional Data Warehouse Flow

Audits STAGING

Reporting
ERP

OLAP Adhoc Reporting


Data Warehouse
- Source/Persistent
- Star Schema
Power Pivot
Web Services

Power BI

Analytics
Sandbox Applications

3
3
Adapting to Changing Times

3
4
Old School DW

 very structured

 heavily designed

 subject to well-defined business rules

 tightly governed by the enterprise

 Synchronized with production via regularly scheduled loads.

 fairly rigid. Takes time to react to new data and analytic requests.

3
5
4 V’s of Big Data

 Variety: new sources and types of data

 Volume: amounts of data increasing.

 Velocity: low-latency, real time delivery expected

 Veracity: uncertainty about the quality of data

3
6
Volume

 Volume expanding tenfold every five years

 Log Files
 Social Media
 Click Stream
 Device generated
 Remote monitoring sensor
 RFID
 Spatial and GPS coordinates

 Sometimes it’s too big to copy


 Ralph Kimball - “Whale in the swimming pool”

3
7
Velocity

 Used to be good enough to load the data nightly

 We used to only analyze history, or near-history

 New way – take action on current data

3
8
Hadoop

 Stores non-relational, unstructured data

 Handles quickly arriving data

 Distributed processing across clusters of commodity computers

 Cost effective for increased capacity and processing

 Popular with companies having to deal with PB size data

 Useful for < PB data as well

3
9
Map Reduce

 How Hadoop processes work among its nodes

 Map
 Split the data into pieces
 Processed in parallel on individual nodes
 Stores results locally

 Reduce
 Aggregate the data

4
0
HDFS

 Hadoop Distributed File System

 Where the data is stored

 Linked across all the nodes in the cluster

4
1
Azure HDInsight

 Hortonworks Hadoop hosted in Azure

 Integrates with Excel, SQL Server Analysis Services, and SQL Server
Reporting Services

 Includes Implementations of:


 Hbase – NoSQL database built on Hadoop
 Apache Storm – Processes large streams of data fast
 Pig – Simpler scripting for MapReduce
 Sqoop – Data import and export (etl)
 Etc.

4
2
Spark for Azure HDInsight

 Significantly faster than MapReduce

 Can manipulate data in-memory

 Offers the interactivity of OLAP or DW column store

 Supports
 Batch and Interactive queries
 Real-time streaming
 Machine learning
 Graph processing

 Integrated with Power BI service

4
3
Why incorporate Streaming Data?

• Real Time Decisions for:

• Manufacturing Process

• Financial trading

• Web analytics

• Operational Analytics

4
4
Hybrid Data Warehouse Environment

4
5
What’s in the Data Lake?

 Archives

 Operational data

 Journals and Audits

 Logs

 Analytics Sandbox

4
6
Relational Database aren’t standing
still…

4
7
SQL Server 2016 BI Features
 Updateable nonclustered columnstore index support with
columnar index in-memory or on-disk row store

 Polybase – distributed queries to Hadoop, blobs, files and relational


data in SQL Server
 reaches across on–premise and cloud sources
 options to import Hadoop data or archive relational data into Hadoop

 Master data management improvements / additions

 Revolution Analytics R baked in – made via T-SQL queries

 SSRS - improvements / additions


 Mobile-friendly reports

4
8
Columnstore

 Data that is:

 Logically organized as a table with row and columns

 Physically stored as columns

 Vertipaq Technology

 All in memory if possible

 If not, uses cached, recently used data

4
9
Columnstore Index

 Technology for storing, retrieving, and managing data using


columnar data format

 Data is:
 Compressed
 Stored in column segments

 Query optimizer considers it for use just like a regular query

 Restrictions on data types that can be used


 e.g. no decimal with precision > 18, binary, text, varchar(max), etc.

5
0
NonClustered Columnstore Index

 Introduced in SQL Server 2012

 Order of columns in the index does not matter

 cannot update table

 To update - drop and rebuild the index or swap out partitions

5
1
Clustered Columstore Index

 New to SQL Server 2014

 Includes all columns in table

 No other indexes allowed on table

 Can update with Insert, Update, Delete operations

 Cannot use unique, primary key or foreign key constraints

 Cannot create on a table with computed or sparse columns

5
2
Columnstore Index Use Case

 Best for data warehouse workloads

 Read-only queries analyzing large sets of aggregated data

 Best for bulk load operations

 Not suited to seek operations for specific data

5
3
SQL Server 2016 SSIS (ETL)

 SSIS designer will support previous versions.

 Support for Power Query as a data source for self-service ETL

 Incremental deployment options

 Custom logging levels

 Package templates for ETL code reuse

5
4
SQL Server 2016 – SSRS (Reporting)
 FINALLY - improvements and additional features

 Mobile-friendly reports, support for browsers on multiple platforms

 Native connectors for latest versions of Microsoft data sources - SQL


Server and SSAS;
 third-party data sources – Oracle Database; Oracle Essbase; SAP BW;
Teradata

 ODBC and OLEDB connectors for many more…

 New report themes and Styles

 New chart types

 Greater control over parameter prompts

 Dynamic parameterized report design options


5
5
SQL Server 2016 - Analysis Services
Tabular Mode
 Improvements in enterprise readiness
 Query engine optimizations
 Direct Query
 Parallel partition processing
 Advanced modeling
 New DAX functions
 DATEDIFF
 GEOMEAN
 PERCENTILE
 PRODUCT
 XIRR
 XNPV

5
6
Data Governance

• Come up with standards and process on handling data

• Implement master data management

• Involve the business in the process

5
7
Understand the New

• Unstructured / distributed database technologies (“NoSQL”)

• Pick the best technology for the purpose

5
8
What Does your DW look like?

• Find that compromise between your available resources and what


YOUR users need

• Design with outcomes in mind

• Deliver initial results quickly, then adapt and iterate

• Build systems to change not last

• Pick a problem and get started!

5
9
References
The Microsoft Modern Data Warehouse
http://download.microsoft.com/download/C/2/D/C2D2D5FA-768A-49AD-8957-
1A434C6C8126/The_Microsoft_Modern_Data_Warehouse_White_Paper.pdf

Josh Fennessy – resource for BI and Hadoop information


http://joshuafennessy.com/

HDInsight
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/

Kimball Group – Dimensional Modeling


http://www.kimballgroup.com/

Biml
http://bimlscript.com/

Announcing Spark for Azure HDInsight


http://blogs.technet.com/b/dataplatforminsider/archive/2015/07/10/microsoft-lights-up-
interactive-insights-on-big-data-with-spark-for-azure-hdinsight-and-power-bi.aspx

6
0
References cont’d

Book: Modeling the Agile Data Warehouse With Data Vault


Hans Hultgren

Data Vault Modeling


http://danlinstedt.com/

Anchor Modeling
http://www.anchormodeling.com/

6
1
Las Cruces & El Paso SQL Server User
Group
Meets Second Thursday of the Month at Noon

New Horizons Training Center


1625 Hawkins Blvd
El Paso, TX

* Talk to us if you’re from Las Cruces – with enough interest we may do


alternating locations!

6
2

You might also like