0% found this document useful (0 votes)
103 views59 pages

2024 Datawarehousing Week 1

This document outlines a course on data warehousing and business intelligence. It introduces key concepts like data warehouses, OLTP databases, and ETL. A data warehouse integrates data from multiple sources into a unified model for analysis. It is subject-oriented, integrated, time-variant, and non-volatile. The course will cover topics such as OLAP, NoSQL databases, and business intelligence applications. Attendance of lectures and completion of exercises is required to pass.

Uploaded by

iu L (Lucky)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views59 pages

2024 Datawarehousing Week 1

This document outlines a course on data warehousing and business intelligence. It introduces key concepts like data warehouses, OLTP databases, and ETL. A data warehouse integrates data from multiple sources into a unified model for analysis. It is subject-oriented, integrated, time-variant, and non-volatile. The course will cover topics such as OLAP, NoSQL databases, and business intelligence applications. Attendance of lectures and completion of exercises is required to pass.

Uploaded by

iu L (Lucky)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Warehousing and

Business Intelligence
Lecturer: Jiaheng Lu

Department of Computer Science


University of Helsinki

1
Outline
• About the course
• Databases and the limits with OLTP databases
• What is a Data Warehouse?
• Components of a Data Warehouse
• ETL phases

13.3.2024 2
Main topics of this course

•This course will cover selected topics of


databases and big data, including:
•Data warehousing
•OLAP and Business intelligence
•NoSQL databases
•Multi-model databases

13.3.2024 3
Prerequisite course
• Introduction to databases
• Bachelor-level
• Suppose that you are familiar with
• Database SQL query formulation
• Transaction and ACID properties
• Hands-on experience to run SQL queries with databases
Schedule of the course 2024
Wednesday Friday
Week 1 Lecture 1: Data warehousing Reading paper together
Week 2 Lecture 2: OLAP and BI (I) Tutorial 1
Week 3 Lecture 3: OLAP and BI (II) Tutorial 2
Week 4 Lecture 4: Big data and data warehouse Tutorial 3
Week 5 Lecture 5: NoSQL databases Lecture 6: Multi-model databases
Week 6 Question and Answer (QA) session Student presentation(26.04)
Week 7 Holiday Student presentation(03.05)

13.3.2024 5
Attendances
• To pass this course, the compulsory requirement is to submit the
answers of four exercises and give an online presentation.

• Attending the reading-paper session and answering the questions are


also compulsory, but all other sessions are optional attendances.

• No final examination
Grading
Parts Points (Up to)
Four exercises 69
Reading-paper 3
Presentation 25
Feedback on Presentation 3

All exercises will be published in Moodle.


Exercise 1 is out.

13.3.2024 7
Grading of this course

• Score the Grade


• <51 Abandoned
• 51-60 1
• 61-70 2
• 71-80 3
• 81-90 4
• 91-100 5

13.3.2024 8
Textbook (1)
• Fundamentals of database systems
• Elmasri Ramez, Navathe Shamkant B.
• 2017 Seventh edition, Global edition.
• This book is available online in our university
library
• Chapter 24, 25, 29

13.3.2024 9
Textbook (2)
• Principles of Database Management: The Practical Guide to Storing,
Managing and Analyzing Big and Small Data

• Part 4: Data Warehousing, Data Governance and (Big) Data Analytics

• https://www.pdbmbook.com/
Recommended books and links
 “The data warehouse toolkit: the complete guide to dimensional
modeling”. John Wiley & Sons, 2013.
→ Author: Ralph Kimball and Margy Ross.

 “Building the data warehouse”. John Wiley & Sons, 2005.


→ Author: William H Inmon

Useful links:

https://www.1keydata.com/datawarehousing/datawarehouse.html
Leaning objective: Week 1
• Understand the limits with OLTP databases
• Can explain the four characters of data warehouse
• Know the main components of a data warehouse
• Can compare different terms including standard DB, data
warehousing, OLTP, heterogeneous databases
• Understand ETL phases and their main issues
Leaning objective: Week 2
• Understand the star schema, snowflake schema, fact constellations
• Can formulate OLAP operations: drilling, rolling, slicing, dicing and
pivoting
• Understand MOLAP, ROLAP and HOLAP
Leaning objective: Week 3
• Advanced SQL queries for data analytics
• Understand Business intelligence (BI) and its application
• Understand independent and dependent data marts
Leaning objective: Week 4
• Understand the difference between Inmon versus Kimball approaches
for data warehousing
• Know the six V’s of big data
• Understand Lambda and Kappa big data architectures
• Know the main products for data warehousing and BI
• Understand the difference between real-time data warehouse and
traditional data warehouse
Leaning objective: Week 5
• Know the difference between ACID and BASE
• Understand CAP theorem
• Understand various data model and the operations, including
relational, semi-structured, graph data models
• Know the four NoSQL database Key-value, document, wide-column
and graph and their different data store approaches
Leaning objective: Week 6
• Know the motivation for multi-model databases and polystores
• Understand the current approaches for multi-model data storage and
query
• Understand a unified categorical model for multi-model database
AI tool for the course
• We follow the university level guidelines.
• In particular, if you use a language model to produce the work you are
returning, you must report in writing which model (e.g. ChatGPT, Bing AI) you
have used and in what way.
• Failing to report the use of a language model as instructed is treated as
cheating.
• Watch an introductory video about data warehouses

• https://www.youtube.com/watch?v=AHR_7jFCMeY
Outline
• About the course
• Databases and the limits with OLTP databases
• What is a Data Warehouse?
• Components of a Data Warehouse

13.3.2024 20
What is a Database?

• A database is a collection of related data.

• For example: names, telephone numbers, and addresses of the


people.

• This collection of data is stored on a hard drive using a database.

13.3.2024 21
Database is more than a random collection of
data
• A database represents some aspect of the real world (not random
data). Changes to the world are reflected in the database.

• A database is a logically coherent collection of data.

13.3.2024 22
DBMS and OLTP
• DBMS is a general-purpose software that facilitates the process of defining,
constructing, manipulating, and sharing databases among users and
applications.

• OLTP (Online Transactional Processing) is a category of data processing


that is focused on transaction-oriented tasks. OLTP typically involves
inserting, updating, and/or deleting small amounts of data in a database.
• Examples of OLTP transactions include: Online banking, Purchasing a book
online, Booking an airline ticket, Order entry.

13.3.2024 23
The limits with OLTP databases
• Operational (OLTP) databases are designed to keep
transactions from daily operations. It is optimized to
efficiently update or create individual records
• Limits:
• Transactional systems were not designed for decision support
analysis
• Data constantly changes on transactional systems and OLTP Lack of
historical data.
What is a Data Warehousing?
• Data warehousing is an architectural model designed to gather data
from various sources into a single unified data model for analysis
purposes.
• Term was introduced in 1990 by William Immon
• In a data warehouse, the data is:
• Subject Oriented
• Integrated
• Time Variant
• Non Volatile
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales.
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.

26
• Image link: https://handbook.magestore.com/books/data-warehouse---tutorial/page/data-
warehouse-tutorial
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
• relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
• Ensure consistency in naming conventions, encoding
structures, attributes, etc. among different data sources
• E.g., currency, tax, etc.
• When data is moved to the warehouse, it is converted.

28
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer than that of
operational systems.
• OLTP database: current value data.
• Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
• Contains an element of time, explicitly or implicitly

29
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from the
operational environment.
• Operational update of data does not occur in the data warehouse
environment.
• Does not require transaction processing, recovery, and
concurrency control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
30
Data Warehouse vs. Heterogeneous DBMS
• Traditional heterogeneous DB integration:
• Build wrappers/mediators on top of heterogeneous databases
• Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual heterogeneous
sites involved, and the results are integrated into a global answer set.
• Data warehouse: high performance
• Information from heterogeneous sources is integrated in advance and stored
in warehouses for direct query and analysis. No wrapper/mediators.

31
Heterogeneous DBMS

Figure source link: https://www.researchgate.net/figure/the-IGN-E-case-of-


heterogeneous-databases_fig8_226823712
Data Warehousing
• Not a product, it is a process
• Combination of hardware and software
• Can often be set up as one VLDB (Very Large Database) or a collection
of subject areas called Data Marts.

Image link: https://corporatefinanceinstitute.com/resources/knowledge/other/data-warehousing/


• Answer some questions online:

• https://pollev.com/jiahenglu471
Components of a Data Warehouse

Components:
• Hardware
• Database Management System
• Front and End Access Tools and other tools
Components of a Data Warehouse - Hardware

• Power - # of Processors, Memory, I/O Bandwidth,


• Availability – Redundant equipment
• Disk Storage - Speed and enough storage for the loaded data set
• Backup Solution - Automated and be able to allow for incremental
backups and archiving older data
Components of a Data Warehouse - DBMS

• Physical storage capacity of the DBMS


• Loading, indexing, and processing speed
• Handle your data needs
• Operational integrity, reliability, and manageability
Components of a Data Warehouse - Front End & Other
Tools
• Query Tools (SQL & GUI based)
• Report Writers
• Metadata Repositories
• OLAP (Online Analytical Processing)
• Data Mining Products
Metadata Repositories

Metadata is Data about Data. Users and developers often need a way to
find information on the data they use.
Information can include:
• Source System(s) of the Data, contact information
• Related tables or subject areas
• Programs or Processes which use the data
• Population rules (Update or Insert and how often)
• Status of the Data Warehouse’s processing and condition
• ……
Data warehouse metadata

Source: https://link.springer.com/referenceworkentry/10.1007/978-0-387-39940-9_912
Data Mining

• Analyzes great amounts of data (usually contained in a Data


Warehouse) and looks for trends in the data
• Technology now allows us to do this better than in the past, enhanced
with machine learning techniques.
Key Data Mining Techniques
• Clustering.
• Association.
• Classification.
• Machine Learning.
• Prediction.
• Deep Neural Networks……..
OLTP vs. Data Warehousing
• Organized by transactions vs. Organized by particular subject
• More number of users vs. less
• Accesses few records vs. entire table
• Smaller database vs. Large database
• Continuous update vs. periodic update
Data Warehouse vs. standard DB
Standard DB Data Warehouse
• Mostly updates Mostly reads
• Many small transactions Queries are long and complex
• Mb - Gb of data Gb - Tb of data
• Current snapshot History
• Index/hash on primary keys Lots of scans
• Raw data Summarized, reconciled data
• Thousands of users Hundreds of users (e.g., decision-
makers, analysts)
• Read slides from page 46 to 55, and answer questions.
ETL phases

Three Steps :

1. Extraction Phase: Get the data

2. Transformation Phase: Make it useful

3. Loading Phase: Save it to the warehouse


ETL (1)

Extraction Phase:
• Source systems export data via files or populates directly when the
databases can “talk” to each other
• Transfers them to the Data Warehouse server and puts it into some
sort of staging area
Issues:
• Warehouse uses relational data model or multi-dimensional
data model (e.g., data cube)
• On the other hand, different data models:
• Relational, hierarchical, graph
• How do we get the data out?
• We will discuss it with multi-model databases in this couse.
ETL (2)

Transformation Phase:
• Takes data and turns it into a form that is suitable for insertion into
the warehouse
• Combines related data
• Removes redundancies
• Use common codes (Commercial Customer)
• Clean spelling mistakes
• Consistency (e.g. PA,Pa,Penna,Pennsylvania)
• Formatting (e.g. addresses)
ETL (3)

Loading Phase:
• Places the cleaned data into the DBMS in its final, useable form
• Compare data from source systems and the Data Warehouse
• Document the load information for the users
Example ETL Process
Item Customer
records records

Split Filter
Filter Filter Group by
Date - Join non -
invalid invalid customer
time match
Customer
Invoice
balance
line items
Invalid Invalid Invalid
dates /times items customers

• This is an example for e-commerce loading


Data Monitors
• Goal: Detect changes of interest and propagate to users
• How?
• Triggers
• Compare query results
• Compare snapshots/dumps
Data Integration
• Receive data (changes) from multiple wrappers and integrate into
warehouse
• Rule-based
• Actions
• Resolve inconsistencies
• Eliminate duplicates
• Summarize data
• etc.
Data Cleansing
• Find (& remove) duplicate tuples
• e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data
• Attribute values that don’t match
• Patch missing, unreadable data
• Notify sources of errors found
Data cleansing example:

• Example link: https://quantdare.com/data-cleansing-and-transformation/


• Answer some questions online:

• https://pollev.com/jiahenglu471
Learning objectives of this lecture
• Understand the limits with OLTP databases
• Can explain the four characters of Data Warehouse
• Know the main components of a Data Warehouse
• Understand ETL phases and their main issues
• Can compare different terms including standard DB, data warehousing,
OLTP, heterogeneous databases
Paper reading on Friday (3 points)
 An Overview of Data Warehousing and OLAP Technology

◼ Please read the paper and questions before attending the


session. Students will be separated into different groups for
discussion and come back together to answer the assigned
questions together.
◼ Please submit your answers to the assigned questions in Moodel
to receive 3 points.
Homework
• Read textbook 1, Chapter 29

• Read textbook 2, Chapter 17

• Work on Exercise 1.

You might also like