Loading Facts
Loading Facts
Lesson 1: Introduction
Using t he Learning Sandbox Environment
Dat a Warehousing
Lesson 2: A Data Warehouse
Fact s and Dimensions
Fact s
Dimensions
T he Dimensional Model
Select ing Fact s and Dimensions
St ar Schema
Lesson 3: Implementing the Dimensional Model, Part I
Creat ing t he Dat e Dimension
Slowly Changing Dimensions
T ype 0 SCD
T ype 1 SCD
T ype 2 SCD
T ype 3 SCD
T ype 4 SCD
Creat ing t he Cust omer Dimension
Snowf lake Schemas
Lesson 4 : Implementing T he Dimensional Model, Part II
Creat ing t he Movie Dimension
Creat ing t he St ore Dimension
Creat ing Fact s
Sales
Cust omerCount
Rent alCount
Lesson 5: Extract, T ransf orm, Load (ET L)
What is ET L?
Logging and Audit ing
Get t ing Dat a int o t he Warehouse
dimDat e
dimCust omer
dimMovie
dimSt ore
Lesson 6: T ools f or ET L
ET L--Past , Present , and Fut ure
Get t ing St art ed wit h T alend Open St udio
Your First T OS Job
Lesson 7: ET L: T he Date Dimension
Job St ruct ure
Loading Dat a f rom Excel
Adding Columns t o our Dat a Flow
Adding Dat a t o dimDat e
If you run int o problems...
Lesson 8: Basic Dimension Processing
Loading dimMovie
Job St ruct ure
Pre and Post Job
Logging
dimMovie
Perf ormance
Lesson 9: SCD Processing
T he Algorit hm: Slowly Changing Dimensions
Implement ing t he Dimensions
dimCust omer
Does our SCD work?
dimSt ore
Lesson 10: Processing Facts, Part I
Orchest rat ion
f act Cust omerCount
Lesson 11: Processing Facts, Part II
f act Sales
Lesson 12: Special Facts
Missing Keys
Debugging t ELT MysqlMap
Handling Missing Keys
Aggregat ing
Deaggregat ing Dat a
Early Arriving Fact s
These instructio ns will o nly sho w up here, at the beginning o f this co urse. Altho ugh yo u pro bably wo n't
need them instructio ns again, feel free to bo o kmark o r print them o ut, just in case.
In this co urse we'll be using Talend Open Studio , an Eclipse based data wareho using to o l. Our versio n o f Talend
Open Studio , o r TOS is nearly identical to the versio n yo u can get fo r free fro m Talend's website. We've added a few
features to enhance yo ur learning experience. Our plug-in allo ws yo u to view the co urse, pro gram yo ur labs, submit
yo ur pro jects, and receive yo ur grades and co mments, all witho ut leaving Talend Open Studio .
We will be using a Terminal Service sessio n o r thin client. A thin client allo ws yo u to co nnect to a remo te terminal
server running Talend Open Studio . The t e rm inal se rve r is a co mputer that se rve s a de skt o p to o ther co mputers
via the netwo rk. Yo u will be using this thin client to access Talend Open Studio o n o ur Windo ws servers:
Yo ur machine will send mo use and keybo ard info rmatio n to o ur server. Our server will send the resulting visual o utput
back to yo ur co mputer. It will se e m just like yo u are running T ale nd Ope n St udio o n yo ur o wn m achine , but
in reality it is being run o n o ur server and returned to yo ur machine. We call o ur system the Le arning Sandbo x. The
Learning Sandbo x is a safe place where yo u can write and execute yo ur o wn pro grams witho ut wo rrying abo ut
breaking yo ur o wn co mputer. It also gives yo u the ability to wo rk fro m anywhere there's an internet co nnectio n. Since
all o f yo ur wo rk is sto red o n o ur server, there are no disks o r USB drives to carry aro und.
In a mo ment yo u will be asked to switch windo ws back to yo ur student start page and to press the Ent e r butto n fo r
yo ur co urse:
After yo u fo llo w the specific instructio ns fo r yo ur machine, return to the student startup windo w.
Make sure the Ope n wit h radio butto n is selected. If yo u like, yo u can also check the Do t his
aut o m at ically f o r f ile s like t his f ro m no w o n checkbo x. No w click OK.
Note
The file names and screen sho ts belo w may differ o n yo ur o wn co mputer. Micro so ft
o ccasio nally updates the RDC pro gram.
Start by do wnlo ading this file
Once yo u've do wnlo aded that RDC20 0 _ALL.dmg (disk image) yo u need to lo cate it and o pen it.
Next yo u will see the fo llo wing screen. Do uble click o n the bo x and fo llo w the install instructio ns.
Click "Co ntinue" and all the default reco mmendatio ns and butto ns fo r each screen:
Switch back to yo ur student start page, and press the enter butto n fo r yo ur co urse. Yo ur bro wser will
do wnlo ad an RDP file. Save this file to yo ur deskto p.
Next, do uble click the RDP file to o pen it. Yo u be asked fo r the USERNAME and PASSWORD fo r the O'Reilly
Scho o l o f Techno lo gy:
If yo u see a warning sign, just click co nnect, it's just Micro so ft trying to make peo ple buy Vista.
Initial Setup
At this po int yo u sho uld see the Talend splash screen:
The next screen yo u will see is the license ackno wledgment. Click Acce pt :
Befo re we can get started with TOS we need to setup a repository connection and then a pro duct. The
repository is where TOS will keep yo ur o bjects. Click o n the ... butto n to manage the repo sito ry co nnectio n:
Enter yo ur email address in the repo sito ry management windo w to create a new repo sito ry, then click OK:
Next, create yo ur pro ject by cho o sing Cre at e a ne w lo cal pro je ct fro m the dro p do wn list and click Go !:
Name yo ur pro ject DBA3, set its language to java, and click Finish:
Next, cho o se yo ur new pro ject fro m the dro p-do wn and click Ope n:
When yo ur pro ject lo ads yo u'll see a screen asking if yo u want to register with Talend. If yo u are interested in
keeping up to date with Talend, enter yo ur email address. Otherwise click Cance l:
The next step may take so me time to co mplete. Under the ho o d, TOS generates Java co de fro m yo ur design
instructio ns. This pro cess requires so me wo rk to co mplete--TOS do es a lo t o f set-up to make everything
wo rk pro perly. While this wo rk is taking place, yo u'll see the fo llo wing screen:
WARNING
Do no t cancel this pro cess. If yo u do , it is highly likely that TOS will no t wo rk pro perly.
Once this pro cess is co mplete, click St art No w to begin using TOS.
The next time yo u lo g in, yo u wo n't need to go thro ugh all o f these steps. TOS will start, and yo u'll be able to
cho o se the DBA 3 pro ject yo u created already. Yo u'll see the "generatio n" screen again, and when TOS
finishes that wo rk, yo u will be ready begin yo urs.
TOS is highly custo mizable - windo ws, palettes, and to o lbars can all be mo ved, resized, and clo sed. We've
also added two reset butto ns to TOS (the red leaves at the to p o f yo ur screen). The first red leaf resets TOS
fo r the first part o f the co urse, and the seco nd red leaf resets TOS fo r the seco nd part o f the co urse. If yo u
accidentally clo se so me aspect o f TOS, o r just want to get back to the beginning, click o n the re d le af to reset
TOS.
If yo u have no t do ne so alre ady, re se t T OS by clicking o n t he f irst re d le af :
Yo ur student start page is o n the to p. Scro ll do wn to find the DBA 3 co urse, and click the Ent e r butto n to view
yo ur syllabus:
Note
Logging Out
When yo u lo g o ut, yo u need to quit TOS instead o f just clo sing the windo w. If yo u disco nnect fro m yo ur sessio n and
then reco nnect using a different co mputer, two co pies o f TOS will be running, po tentially o verwriting each o ther's files.
Yo u can prevent that pro blem by clo sing TOS fro m the File ->Exit menu after yo u're do ne wo rking.
WARNING
Again, make sure yo u quit TOS when yo u are do ne wo rking so yo ur jo bs are saved co rrectly!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
Introduction
DBA 3: Data Warehousing Lesson 1
Course Objectives
Welco me to the third co urse in O'Reilly Scho o l o f Techno lo gy's (OST) DBA series. The co ntent o f this co urse has been written
under the assumptio n that yo u have wo rked thro ugh the first two co urses in the series, and are familiar with MySQL. If yo u'd like
to refresh yo ur memo ry, feel free to go back o ver the first two co urses. Then get ready to take yo ur MySQL kno wledge to the
next level!
In this co urse, yo u'll learn what makes up a data wareho use and gain an understanding o f the dimensio nal mo del. Upo n
co mpletio n o f this co urse, yo u will be able to :
Implement the dimensio nal mo del using standard ETL pro cesses
Demo nstrate understanding o f dimensio n, SCD and fact pro cessing
Query relatio nal data wareho uses using standard SQL co mmands
Develo p a co mplete data wareho use using Talend Open Studio
Fro m beginning to end, yo u will learn by do ing pro jects using Talend Open Studio , an Eclipse based to o l fo r implementing data
wareho uses. Yo u'll co mplete pro jects using Talend, develo ping yo ur o wn co mplete data wareho uses. The pro jects add to yo ur
po rtfo lio and will co ntribute to certificate co mpletio n. Besides a bro wser and internet co nnectio n, all so ftware is pro vided o nline
by the O'Reilly Scho o l o f Techno lo gy.
Lesson Format
We'll try o ut lo ts o f examples in each lesso n. We'll have yo u write co de, lo o k at co de, and edit existing co de. The co de
will be presented in bo xes that will indicate what needs to be do ne to the co de inside.
Whenever yo u see white bo xes like the o ne belo w, yo u'll type the co ntents into the edito r windo w to try the example
yo urself. The CODE TO TYPE bar o n to p o f the white bo x co ntains directio ns fo r yo u to fo llo w:
CODE TO TYPE:
White boxes like this contain code for you to try out (type into a file to run).
If you have already written some of the code, new code for you to add looks like this.
If we want you to remove existing code, the code to remove will look like this.
We may run pro grams and do so me o ther activities in a terminal sessio n in the o perating system o r o ther co mmandline enviro nment. These will be sho wn like this:
INTERACTIVE SESSION:
The plain black text that we present in these INTERACTIVE boxes is
provided by the system (not for you to type). The commands we want you to type look lik
e this.
Co de and info rmatio n presented in a gray OBSERVE bo x is fo r yo u to inspect and absorb. This info rmatio n is o ften
co lo r-co ded, and fo llo wed by text explaining the co de in detail:
OBSERVE:
Gray "Observe" boxes like this contain information (usually code specifics) for you to
observe.
The paragraph(s) that fo llo w may pro vide additio n details o n inf o rm at io n that was highlighted in the Observe bo x.
We'll also set especially pertinent info rmatio n apart in "No te" bo xes:
Note
T ip
No tes pro vide info rmatio n that is useful, but no t abso lutely necessary fo r perfo rming the tasks at hand.
Tips pro vide info rmatio n that might help make the to o ls easier fo r yo u to use, such as sho rtcut keys.
WARNING
Note
Warnings pro vide info rmatio n that can help prevent pro gram crashes and data lo ss.
If yo u have no t read the initial co urse instructio ns Getting Started with Talend Open Studio yet, please go ahead
and do that no w.
Note
Depending o n the width o f yo ur mo nito r, the text o n the tabs may be truncated. Terminal 1 is the left
terminal, and Terminal 2 is the right terminal.
If yo u clicked o n the seco nd red leaf, the terminals will be lo cated lo wer o n the screen:
Change the Co nne ct io n T ype to SSH, set the ho st to co ld.use ract ive .co m , then enter yo ur use rnam e and
passwo rd:
The first time yo u co nnect yo u will see a few o ther warnings. Click o n Ye s fo r all o f them:
Yo u'll also be saving so me o f yo ur SQL queries and do cumentatio n in text files. We'll sto re these in a pro ject
accessible fro m TOS. To add this pro ject, click o n the Navigat io n tab:
Yo u co uld use this view to peek at the files that TOS sto res "under the ho o d." We'll use it to ho ld o ur text files. Rightclick in the white space under the fo lders, and cho o se Ne w -> Pro je ct :
Note
Yo u must name yo ur o bjects exactly as specified in the lesso n to allo w yo ur mento r to lo cate yo ur wo rk
and help yo u if yo u need it.
To create a new text file, right-click o n Do cum e nt at io n and cho o se Ne w -> Ot he r...:
Make sure yo u select Do cum e nt at io n as the parent fo lder. Name yo ur new file dba3_le sso n1_pro je ct 1.t xt - yo u
will add to this file as yo u co mplete yo ur first pro ject. Click o n Finish:
Save yo ur wo rk:
Data Warehousing
At this po int in yo ur database educatio n, yo u are familiar with SQL databases and their capabilities. By far, the mo st
po pular use fo r databases is the sto rage o f o peratio nal data generated thro ugh transactio ns.
In the previo us co urses we examined the database o f a DVD rental sto re. The database was used to keep track o f
custo mers, the DVDs in the sto re's invento ry, and the DVDs that were currently being rented. Tables were designed
A unif ie d and co nsist e nt vie w o f unde rlying dat a (e ve n dat a f ro m e xt e rnal syst e m s):
In this co urse yo u'll learn everything yo u need to kno w abo ut a data wareho use - fro m planning to implementatio n. In
the next lesso n we'll take a fresh lo o k at o ur o peratio nal database and start planning o ur wareho use. See yo u there!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
A Data Warehouse
DBA 3: Data Warehousing Lesson 2
Facts and Dimensions
In the first lesso n we discussed reaso ns to develo p and use a Dat a Ware ho use . The bo tto m line was that o ur video
sto re manager wanted answers to a few go o d questio ns:
Ho w many new custo mers were added this quarter?
What is the mo st po pular rental?
Ho w much revenue did o ur East side sto re generate co mpared to o ur West side sto re?
Ho w do sales this mo nth co mpare to last mo nth o r last year?
Are mo vies that were po pular in the theater po pular rentals as well?
Which custo mers rent the mo st DVDs each mo nth and at which sto re?
We can rewrite so me o f the questio ns so that they share a fo rmat we can use mo re readily in o ur queries:
Ho w m any ne w cust o m e rs did we add by quart e r?
Ho w m any t im e s we re DVDs re nt e d, by DVD and by m o nt h?
Ho w m uch sale s did we do , by st o re and by m o nt h?
Ho w m uch sale s did we do by m o nt h?
Ho w m any t im e s we re DVDs re nt e d, by m o nt h and by t he at e r po pularit y?
Ho w m any t im e s we re DVDs re nt e d, by cust o m e r, m o nt h and st o re ?
Tho ugh the questio ns are slightly different than they were o riginally, they are no w structured like analytical queries, with
f act s and dim e nsio ns.
Facts
Fact s are numbers, and are so metimes referred to as measures. A facts relating to sales co uld be "Sales in
US Do llars" and "Sales in Euro s." Other facts co uld be "Ho urs o f Wo rk," o r "Times Rented."
Fact s have a defined grain - the level o f detail. Fo r example, "Sales in US Do llars" may be daily, o r even
ho urly. If yo u have sales data o n a daily grain, yo u canno t display sales by ho ur. Yo u can, ho wever, co mbine
(aggregate) daily sales to larger grains such as weekly o r mo nthly:
Aggregates are applied to facts in o rder to mo ve to a larger grain. The mo st co mmo n aggregate is SUM.
Other aggregates are Average (AVG), co unt, maximum (MAX) and minimum (MIN). Aggregates take a set o f
data and return a summary o f that data.
Let's experiment with so me aggregates in o ur SQL database. Switch to the SSH mo de, and lo g into yo ur
acco unt. In Unix mo de, use the mysql co mmand to co nnect to the sakila database as the ano nym o us user.
When pro mpted fo r a passwo rd, press enter. In Unix mo de, run the fo llo wing co mmand:
CODE TO TYPE:
cold1:~$ mysql -h sql sakila -u anonymous -p
If yo u have entered everything co rrectly yo u will see this:
OBSERVE:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 28527
Server version: 5.0.62-log Source distribution
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql>
Let's take a lo o k at the tables, using the show tables co mmand. Run this co mmand:
CODE TO TYPE:
mysql> show tables;
OBSERVE:
mysql> show tables;
+----------------------------+
| Tables_in_sakila
|
+----------------------------+
| actor
|
| actor_info
|
| address
|
| category
|
| city
|
| country
|
| customer
|
| customer_list
|
| film
|
| film_actor
|
| film_category
|
| film_list
|
| film_text
|
| inventory
|
| language
|
| nicer_but_slower_film_list |
| payment
|
| rental
|
| sales_by_film_category
|
| sales_by_store
|
| staff
|
| staff_list
|
| store
|
+----------------------------+
23 rows in set (0.00 sec)
The sakila database has 23 tables and views. Let's take a clo ser lo o k at the table called paym e nt . Run the
fo llo wing co mmand:
CODE TO TYPE:
mysql> describe payment;
OBSERVE:
mysql> describe payment;
+--------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+--------------+----------------------+------+-----+-------------------+---------------+
| payment_id
| smallint(5) unsigned | NO
| PRI | NULL
| auto_in
crement |
| customer_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| staff_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| rental_id
| int(11)
| YES | MUL | NULL
|
|
| amount
| decimal(5,2)
| NO
|
| NULL
|
|
| payment_date | datetime
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+--------------+----------------------+------+-----+-------------------+---------------+
7 rows in set (0.00 sec)
Dimensions
Dim e nsio ns are used to filter, catego rize, and label f act s. A fact such as "Sales in US Do llars" might have
dimensio ns fo r Date, Customer, Store, and Movie. Written in English, this might translate to so mething like
this:
On May 25 t h, Rut h Mart ine z re nt e d t he m o vie " Cabin Flash" f ro m t he We st side st o re fo r $ 9 .9 9 .
Or, bro ken into its co mpo nents, it lo o ks like this:
Nam e
Value
May 25th
Ruth Martinez
Cabin Flash
West
Fact
Sales in US Do llars $9 .9 9
The first and mo st impo rtant dim e nsio n used in a wareho use is the date dimensio n. This dimensio n is o ften
presented in a hierarchy:
Year -> Quarter -> Mo nth -> Day
Days can "ro ll up" to a mo nth. Mo nths can "ro ll up" to a quarter, and quarters "ro ll up" to a year. Daily sales
"ro ll up" to mo nthly sales, mo nthly sales "ro ll up" to quarterly sales, and quarterly sales "ro ll up" to yearly
sales:
Ye ar, Quart e r, Mo nt h, and Day are no t dimensio ns themselves. They represent levels in the Dat e
Dates o ften have multiple uses in a wareho use. Fo r DVD rentals, dates are reco rded at least twice: o nce
when a mo vie is rented and again when the mo vie is returned. When the same underlying date dimensio n is
used fo r bo th o f these purpo ses, the dimensio n is kno wn as a role-playing dimensio n.
In the SQL wo rld we specify dimensio ns in the GROUP BY and WHERE clauses. Let's see these clauses in
actio n using o ur examples.
First let's examine the data sto red in o ur database that co rrespo nds to Ruth renting "Cabin Flash" fro m the
West sto re o n May 25th fo r $9 .9 9 . Fo r the sake o f experiment, we happen to kno w that this data is sto red with
a payment_id=491. (Just play alo ng fo r no w.) Run the fo llo wing co mmand:
CODE TO TYPE:
select c.first_name, c.last_name, f.title, p.amount,
DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
from payment p
join customer c on (p.customer_id=c.customer_id)
join rental r on (p.rental_id = r.rental_id)
join inventory i on (r.inventory_id=i.inventory_id)
join film f on (i.film_id=f.film_id)
join store s on (c.store_id=s.store_id)
where p.payment_id=491;
MySQL respo nds with o ur data:
OBSERVE:
mysql> select c.first_name, c.last_name, f.title, p.amount,
-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
-> from payment p
-> join customer c on (p.customer_id=c.customer_id)
-> join rental r on (p.rental_id = r.rental_id)
-> join inventory i on (r.inventory_id=i.inventory_id)
-> join film f on (i.film_id=f.film_id)
-> join store s on (c.store_id=s.store_id)
-> where p.payment_id=491;
+------------+-----------+-------------+--------+-------------+--------+
| first_name | last_name | title
| amount | paymentDate | region |
+------------+-----------+-------------+--------+-------------+--------+
| RUTH
| MARTINEZ | CABIN FLASH |
9.99 | May 25th
| West
|
+------------+-----------+-------------+--------+-------------+--------+
1 row in set (0.09 sec)
Lo o ks go o d! No w let's answer o ur first questio n: How much was rented on May 25th by Ruth Martinez in the
West store? Go ahead and run this co mmand:
CODE TO TYPE:
select c.first_name, c.last_name, p.amount,
DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
from payment p
join customer c on (p.customer_id=c.customer_id)
join store s on (c.store_id=s.store_id)
where day(p.payment_date)=25 and month(p.payment_date)=5
AND c.first_name='RUTH' and c.last_name='MARTINEZ';
The database do es its jo b and returns the requested info rmatio n:
OBSERVE:
mysql> select c.first_name, c.last_name, p.amount,
-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
-> from payment p
-> join customer c on (p.customer_id=c.customer_id)
-> join store s on (c.store_id=s.store_id)
-> where day(p.payment_date)=25 and month(p.payment_date)=5
-> AND c.first_name='RUTH' and c.last_name='MARTINEZ';
+------------+-----------+--------+-------------+--------+
| first_name | last_name | amount | paymentDate | region |
+------------+-----------+--------+-------------+--------+
| RUTH
| MARTINEZ |
0.99 | May 25th
| West
|
| RUTH
| MARTINEZ |
9.99 | May 25th
| West
|
+------------+-----------+--------+-------------+--------+
2 rows in set (0.15 sec)
This is co rrect, but unfo rtunately it isn't exactly what we're after. We actually want o ne ro w o f summarized data
instead o f two ro ws o f detail data. We need to aggregate the am o unt fact, and make sure to GROUP BY o ur
dimensio ns. Run this co mmand:
CODE TO TYPE:
select c.first_name, c.last_name, sum(p.amount),
DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
from payment p
join customer c on (p.customer_id=c.customer_id)
join store s on (c.store_id=s.store_id)
where day(p.payment_date)=25 and month(p.payment_date)=5
AND c.first_name='RUTH' and c.last_name='MARTINEZ'
GROUP BY c.first_name, c.last_name, paymentDate, s.region;
Excellent--no w we have o ur desired result:
OBSERVE:
mysql> select c.first_name, c.last_name, sum(p.amount),
-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
-> from payment p
-> join customer c on (p.customer_id=c.customer_id)
-> join store s on (c.store_id=s.store_id)
-> where day(p.payment_date)=25 and month(p.payment_date)=5
-> AND c.first_name='RUTH' and c.last_name='MARTINEZ'
-> GROUP BY c.first_name, c.last_name, paymentDate, s.region;
+------------+-----------+---------------+-------------+--------+
| first_name | last_name | sum(p.amount) | paymentDate | region |
+------------+-----------+---------------+-------------+--------+
| RUTH
| MARTINEZ |
10.98 | May 25th
| West
|
+------------+-----------+---------------+-------------+--------+
1 row in set (0.00 sec)
We were able to answer o ur questio n using the info rmatio n sto red in o ur current tables. So if that's the case, ho w is a
dat a ware ho use different than o ur existing dat abase ? Read o n...
T he Dimensional Model
So why go to the tro uble o f creating a wareho use when o ur existing database has all the info rmatio n we need? It
seems like we've just invented a few new terms fo r o ur existing data.
In the last lesso n we had several go o d reaso ns fo r creating a wareho use, remember? Data wareho uses pro vide:
a separate system that wo n't interrupt business critical o peratio nal systems.
a single po int o f access fo r all analytical queries.
a unified and co nsistent view o f underlying data (even data fro m external systems).
a straightfo rward way to analyze trends (such as mo nthly sales co mpariso ns).
Our existing database can pro vide answers to so me o f o ur pertinent questio ns, but it do esn't pro vide any o f the
features listed abo ve. Data wareho uses do .
Note
Generally, yo u will create a data wareho use o n a separate physical machine fro m yo ur business critical
databases. Fo r develo pment purpo ses it is o kay to share machines.
Star Schema
No w that we've picked o ur facts and dimensio ns, its time to o rganize o ur data. Data wareho uses are typically
o rganized using a star schema. Facts (measures) are sto red in fact tables at the center o f the star, and the
dimensio ns surro und the measures. Facts have fo reign keys (using the integer data type) to each dimensio n
table:
These separate diagrams might suggest o ur facts and dimensio ns are sto red separately, but that's no t the
case. The dimensio ns are shared:
Yo u may wo nder why we're using separate tables fo r dimensio ns. Co uldn't we just put the mo vie dimensio ns
next to the fact in the same table? Well, we could do this, but we sho uldn't fo r o ne go o d reaso n: perfo rmance.
It is safe to assume that yo ur fact table will beco me very large (millio ns o r even billio ns o f ro ws). Yo ur
dimensio ns may be large as well, but it is unlikely they will be nearly as large as o ur fact tables.
Suppo se yo u have ten millio n ro ws o f fact data and ten tho usand distinct mo vies. Then yo u realize so meo ne
entered a film into yo ur wareho use using the name "The Dude" instead o f the film's actual name, "The Big
Lebo wski." Updating every fact ro w to co rrect that mistake co uld take a very lo ng time. Even a simple query fo r
"The Big Lebo wski" co uld cause the database great pain; text is much mo re difficult to index and search than
integers.
Well, we've co vered a lo t in this lesso n.. In the next lesso n we'll begin to implement o ur fact and dimensio n tables. See
yo u there!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
T ype
date
This setup is a go o d start, but is it the best way to help o ur end users? If yo u recall the sample questio ns the users
gave us, several wanted to see results by month. So ho w wo uld yo u extract info rmatio n abo ut a mo nth fro m a date
type? Yo u co uld use a functio n like month(), but it's pro bably no t reaso nable to expect end users to use that functio n.
A better so lutio n wo uld be to pre-calculate and pre-po pulate the mo st impo rtant date attributes as required by the
users. The best way to determine what's mo st impo rtant is ask the users what they need. So let's suppo se we did ask
them, and used the info rmatio n they supplied to co me up with this structure:
Co lum n
T ype
date_key
date
date
year
smallint
quarter
tinyint
mo nth
tinyint
day
tinyint
week
tinyint
is_we e ke nd bo o lean
is_ho liday
bo o lean
We kept the date co lumn because it can be used to calculate attributes that didn't make it to the table. We are not go ing
to use an auto_increment. No rmally we wo uld use an auto_increment co lumn, but it's much mo re co nvenient to
make the key a co ded fo rmat such as yyyyMMDD. With this fo rmat, a value o f 20080101 wo uld represent January 1st,
20 0 8 .
We included two additio nal co lumns: is_we e ke nd and is_ho liday. These wo uld be useful if we wanted to co mpare
weekend sales o r ho liday sales to weekday sales. We keep the number o f data types required fo r o ur co lumns to a
minimum by co nsulting MySQL's do cumentatio n.
Let's go ahead and implement this table. (We'll po pulate it with data in a future lesso n.) Switch to the terminal mo de,
and lo g into yo ur acco unt. Once lo gged in, co nnect to yo ur o wn MySQL database. Be sure to replace use rnam e and
use rnam e with yo ur o wn user name. Type in the co de belo w at the UNIX pro mpt:
CODE TO TYPE:
cold1:~$ mysql -h sql -p -u username username
Next, create the dimDate table. Run this co mmand:
CODE TO TYPE:
CREATE TABLE dimDate
(
date_key integer NOT NULL,
date date NOT NULL,
year smallint NOT NULL,
quarter tinyint NOT NULL,
month tinyint NOT NULL,
day tinyint NOT NULL,
week tinyint NOT NULL,
is_weekend boolean,
is_holiday boolean,
PRIMARY KEY(date_key)
);
T ype 0 SCD
The mo st basic SCD isn't really a change at all. If yo u do abso lutely no thing to handle a changing dimensio n,
that dimensio n is Type 0. In English, a type 0 translates to , "Do n't do anything when this value changes."
T ype 1 SCD
A Type 1 SCD is o ften the easiest way to acco mmo date changing dimensio ns. In this type, ro ws in the
dimensio n tables are updated when changes o ccur. Suppo se Mary Smith gets married in April and changes
her last name to J o ne s. (She'll keep the same email address fo r no w.)
MARY
SMITH
Em ail
Cit y
MARY
J ONES
Em ail
Cit y
So me changes are less impo rtant than o thers. Name changes are no t always impo rtant to business users.
Fo r their purpo ses, it's irrelevant whether Mary Jo nes used to be kno wn as Mary Smith. But suppo se Mary
Smith mo ves fro m o ne city to ano ther in July. A Type 1 custo mer SCD wo uld simply update the existing ro w
fo r Mary Smith, fo rgetting the previo us city. No w a user wo uld be unable to see sales trends acco rding to
custo mer and city, because all histo rical data co ncerning Mary prio r to July wo uld no w be asso ciated with the
new city.
T ype 2 SCD
Type 1 isn't the best way to handle all slo wly changing dimensio ns tho ugh. Ano ther metho d to track changes
in dimensio ns is to create a new ro w in the dimensio n table when each change o ccurs, and then use be gin
and e nd dates to specify the valid time perio d fo r a ro w.
The database ro w fo r Mary Smith wo uld initially lo o k like this:
Cust o m e r
Ke y
1
First
Nam e
MARY
Last
Nam e
SMITH
Em ail
Cit y
St art
Dat e
0 1-Jan20 0 8
End Dat e
0 1-JAN20 9 9
No w suppo se Mary Smith gets married in April and beco mes Mary Jo nes. Her dimensio n time line wo uld
lo o k like this:
First
Nam e
Last
Nam e
Em ail
Cit y
St art
Dat e
End
Dat e
MARY
SMITH
0 1-JAN20 0 8
0 1-APR20 0 8
MARY
JONES
0 1-APR20 0 8
0 1-JAN20 9 9
No w let's say she mo ves fro m Sasebo to Bellevue in July, her dimensio n time line wo uld lo o k like this:
First
Nam e
Last
Nam e
Em ail
Cit y
St art
Dat e
End
Dat e
MARY
SMITH
0 1-JAN20 0 8
0 1-APR20 0 8
MARY
JONES
0 1-APR20 0 8
0 1-JUL20 0 8
MARY
JONES
0 1-JUL20 0 8
0 1-JAN20 9 9
In each o f the two tables that reflect Mary's new circumstances, there is o ne "current" ro w that has 01-JAN2099 fo r an End Date.
Note
Instead o f using 01-JAN-2099 fo r an end date, so me wareho uses use NULL, but usually it's
better to use a real date instead o f NULL, because real dates can make better use o f indexes.
T ype 3 SCD
Type 2 slo wly changing dimensio ns (SCDs) allo w unlimited changes, but this might be excessive fo r so me
types o f changes. Fo r example, when a po stal co de is changed, even tho ugh it's a fairly mino r change and
do esn't happen that o ften, it wo uld still need to be tracked in the database. In this case, we wo uld cho o se to
use the Type 3 SCD metho d.
Suppo se Mary Smith in Sasebo has her po stal co de changed fro m 3520 0 to 3520 1. The change wo uld lo o k
like this:
First
Nam e
MARY
Last
Nam e
SMITH
Em ail
Cit y
Curre nt
Po st al
Co de
Pre vio us
Po st al
Co de
First
Nam e
MARY
Last
Nam e
Em ail
SMITH
Cit y
Curre nt
Po st al
Co de
Pre vio us
Po st al
Co de
3520 0
The table may o r may no t have an "Effective Date" co lumn to explain when the po stal co de changed.
Type 3 SCDs wo rk well fo r changes that happen infrequently, ho wever this type fails to capture multiple
changes.
T ype 4 SCD
A Type 4 SCD is fairly straightfo rward; the dimensio n table always co ntains up-to -date info rmatio n. Changes
are reco rded in a separate history table. This adds co mplexity to dimensio ns, but may cause co nfusio n
because users must keep in mind that histo rical data is sto red in a separate lo catio n.
Fo r example, suppo se Mary mo ves fro m Sasebo to Bellevue o n July 15. The change wo uld lo o k like this:
MARY
SMITH
Em ail
Cit y
MARY
SMITH
Em ail
Cit y
Change Dat e
In practice, Type 1 and Type 2 are the mo st widely used ways to deal with slo wly changing dimensio ns.
Ro ws do no t have to be co mprised entirely o f a single SCD type. Fo r example, fo r many data wareho uses, the time
that a custo mer name change takes place is no t significant, and the change is po sted fo r that reco rd o n-the-fly. In that
case, the name co lumns wo uld be o f Type 1. Custo mer addresses are usually mo re impo rtant, so tho se co lumns
wo uld be o f Type 2. It's perfectly fine to handle changes in this way.
Once we're co nnected, we're able to see the structure o f the customer table. Run the fo llo wing co mmand against the
sakila database:
CODE TO TYPE:
describe customer;
As lo ng as yo u have typed everything co rrectly, and are co nnected to the sakila database (no t yo ur perso nal database)
yo u'll see this:
OBSERVE:
mysql> describe customer;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| customer_id | smallint(5) unsigned | NO
| PRI | NULL
| auto_increment
|
| store_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| first_name | varchar(45)
| NO
|
| NULL
|
|
| last_name
| varchar(45)
| NO
| MUL | NULL
|
|
| email
| varchar(50)
| YES |
| NULL
|
|
| address_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| active
| tinyint(1)
| NO
|
| 1
|
|
| create_date | datetime
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
9 rows in set (0.00 sec)
The table has a lo t o f info rmatio n. Observe that it co ntains a co lumn called address_id. This indicates that the
address info rmatio n is sto red in a different table. Let's take a lo o k at the address table. No w run the fo llo wing
co mmand against the sakila database:
CODE TO TYPE:
describe address;
Yo u'll see these results:
OBSERVE:
mysql> describe address;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| address_id | smallint(5) unsigned | NO
| PRI | NULL
| auto_increment
|
| address
| varchar(50)
| NO
|
| NULL
|
|
| address2
| varchar(50)
| YES |
| NULL
|
|
| district
| varchar(20)
| NO
|
| NULL
|
|
| city_id
| smallint(5) unsigned | NO
| MUL | NULL
|
|
| postal_code | varchar(10)
| YES |
| NULL
|
|
| phone
| varchar(20)
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
8 rows in set (0.00 sec)
See the co lumn cit y_id? It is a fo reign key to the table city. Take a lo o k at that table. Then run the fo llo wing co mmand
against the sakila database:
CODE TO TYPE:
describe city;
Yo u'll see the fo llo wing structure:
OBSERVE:
mysql> describe city;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| city_id
| smallint(5) unsigned | NO
| PRI | NULL
| auto_increment
|
| city
| varchar(50)
| NO
|
| NULL
|
|
| country_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
4 rows in set (0.00 sec)
It lo o ks like this table references yet ano ther table, using co unt ry_id. Let's take a lo o k at that table as well. Then run
the fo llo wing co mmand against the sakila database:
CODE TO TYPE:
describe country;
CODE TO TYPE:
CREATE TABLE dimCustomer
(
customer_key int NOT NULL AUTO_INCREMENT,
customer_id smallint(5) unsigned NOT NULL,
first_name varchar(45) NOT NULL,
last_name
varchar(45) NOT NULL,
email
varchar(50),
address
varchar(50) NOT NULL,
address2
varchar(50),
district
varchar(20) NOT NULL,
city
varchar(50) NOT NULL,
country
varchar(50) NOT NULL,
postal_code varchar(10),
phone
varchar(20) NOT NULL,
active
tinyint(1) NOT NULL,
create_date datetime NOT NULL,
start_date date NOT NULL,
end_date
date NOT NULL,
PRIMARY KEY(customer_key)
);
Execute the query. If everything went o kay yo u will see this: Query OK, 0 rows affected.
Snowflake Schemas
Fo r o ur custo mer dimensio n, we've taken fo ur tables and co llapsed them into o ne table. Why did we do this?
Sim plicit y.
One o f the go als o f a data wareho use is to create a simple structure that users can query easily. Multiple
tables means multiple jo ins, and added co mplexity. Here we traded disk space fo r simplicity.
We can also wo rk in the o ppo site directio n, using mo re co mplex schemas when o ur purpo se calls fo r that.
Addresses represent such a hierarchy. Co unt rie s have St at e s (o r regio ns), and states have Cit ie s. So me
business users may be interested in seeing sales data by co unt ry, whereas o thers may be interested in
viewing sales data by st at e o r by cit y. One way to deal with this hierarchy is with a snowflake schema.
In a snowflake schema yo u split a dimensio n into o ne "primary" dimensio n table and o ne o r mo re snowflake
tables. It lo o ks like this:
Sno wflake schemas are also an effective way to handle a different type o f pro blem. Suppo se o ur DVD sto re
starts to rent DVDs o ver the internet. Our sto re no w has two types o f custo mers - Internet custo mers and In
Store custo mers. We kno w very little abo ut the In Store custo mers; perhaps we o nly kno w their telepho ne
numbers and ho me addresses. By co mpariso n we kno w a lo t abo ut o ur Internet custo mers; we might have
their email addresses, telepho ne numbers, physical addresses, mo vie preferences, and the number o f times
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
OBSERVE:
mysql> describe film;
+----------------------+--------------------------------------------------------------------+------+-----+-------------------+----------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+----------------------+--------------------------------------------------------------------+------+-----+-------------------+----------------+
| film_id
| smallint(5) unsigned
| NO
| PRI | NULL
| auto_increment |
| title
| varchar(255)
| NO
| MUL | NULL
|
|
| description
| text
| YES |
| NULL
|
|
| release_year
| year(4)
| YES |
| NULL
|
|
| language_id
| tinyint(3) unsigned
| NO
| MUL | NULL
|
|
| original_language_id | tinyint(3) unsigned
| YES | MUL | NULL
|
|
| rental_duration
| tinyint(3) unsigned
| NO
|
| 3
|
|
| rental_rate
| decimal(4,2)
| NO
|
| 4.99
|
|
| length
| smallint(5) unsigned
| YES |
| NULL
|
|
| replacement_cost
| decimal(5,2)
| NO
|
| 19.99
|
|
| rating
| enum('G','PG','PG-13','R','NC-17')
| YES |
| G
|
|
| special_features
| set('Trailers','Commentaries','Deleted Scenes','Behind the Sce
nes') | YES |
| NULL
|
|
| last_update
| timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+----------------------+--------------------------------------------------------------------+------+-----+-------------------+----------------+
13 rows in set (0.01 sec)
This table has two num e ric data types: re nt al_rat e and re place m e nt _co st . These quantities might beco me f act s
sto red in o ur data wareho use that allo w us to answer questio ns like, "What is o ur pro fit (amo unt o f rental inco me,
minus film co st) fo r each mo vie?" But since that and similar questio ns are o utside the sco pe fo r o ur current pro ject,
we'll o mit tho se facts fro m o ur data wareho use and free up so me space.
So it lo o ks like we have two fo reign keys: language _id and o riginal_language _id. Bo th po int to the language table.
Take a lo o k. Run the fo llo wing co mmand against the sakila database:
CODE TO TYPE:
describe language;
Execute the line to see the structure o f language.
OBSERVE:
mysql> describe language;
+-------------+---------------------+------+-----+-------------------+----------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+---------------------+------+-----+-------------------+----------------+
| language_id | tinyint(3) unsigned | NO
| PRI | NULL
| auto_increment |
| name
| char(20)
| last_update | timestamp
| NO
| NULL
| NO
| CURRENT_TIMESTAMP |
+-------------+---------------------+------+-----+-------------------+----------------+
3 rows in set (0.00 sec)
We'll co nso lidate these tables into dimMovie. As fo r changes, they are fairly infrequent in this case, so we'll implement
a Type 1 slo wly changing dimensio n. Switch to the terminal mo de, and lo g into yo ur acco unt. Once yo u're lo gged in,
co nnect to yo ur o wn MySQL database. Be sure to replace use rnam e and use rnam e with yo ur o wn user name. Then
type the fo llo wing at the UNIX pro mpt:
CODE TO TYPE:
cold1:~$ mysql -h sql -p -u username username
Next, run the statement belo w, against yo ur perso nal database, in o rder to create the dimMovie table:
CODE TO TYPE:
CREATE TABLE dimMovie
(
movie_key
int NOT NULL AUTO_INCREMENT,
film_id
smallint(5) unsigned NOT NULL,
title
varchar(255) NOT NULL,
description
text,
release_year
year(4),
language
varchar(20) NOT NULL,
original_language varchar(20),
rental_duration
tinyint(3) unsigned NOT NULL,
length
smallint(5) unsigned NOT NULL,
rating
varchar(5) NOT NULL,
special_features
varchar(60) NOT NULL,
PRIMARY KEY (movie_key)
);
OBSERVE:
mysql> describe store;
+------------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+------------------+----------------------+------+-----+-------------------+---------------+
| store_id
| tinyint(3) unsigned | NO
| PRI | NULL
| auto_incre
ment |
| manager_staff_id | tinyint(3) unsigned | NO
| UNI | NULL
|
|
| address_id
| smallint(5) unsigned | NO
| MUL | NULL
|
|
| last_update
| timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
| region
| varchar(10)
| YES |
| NULL
|
|
+------------------+----------------------+------+-----+-------------------+---------------+
5 rows in set (0.00 sec)
Our versio n o f the sakila database is slightly different than the versio n distributed by MySQL. Our versio n includes a
region co lumn. Our table also includes an addre ss_id co lumn. (Feel free to refer back to the previo us lesso n if yo u
want to go o ver the address table structure again.)
The next interesting aspect to this table is the m anage r_st af f _id co lumn. This co lumn is a fo reign key to staff. Let's
take a lo o k at that table no w. Run the fo llo wing co mmand against the sakila database:
CODE TO TYPE:
describe staff;
Execute the line to see the structure o f staff.
OBSERVE:
mysql> describe staff;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| staff_id
| tinyint(3) unsigned | NO
| PRI | NULL
| auto_increment
|
| first_name | varchar(45)
| NO
|
| NULL
|
|
| last_name
| varchar(45)
| NO
|
| NULL
|
|
| address_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| picture
| blob
| YES |
| NULL
|
|
| email
| varchar(50)
| YES |
| NULL
|
|
| store_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| active
| tinyint(1)
| NO
|
| 1
|
|
| username
| varchar(16)
| NO
|
| NULL
|
|
| password
| varchar(40)
| YES |
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
11 rows in set (0.00 sec)
We'll merge the staff table into a single dimStore dimensio n, and o mit many o f the co lumns fro m staff such as
picture, email, address, username, and passwo rd. Since sto res may change managers, we'll make o ur dimensio n a
Type 2 SCD so we can track management changes accurately o ver time. That will require two additio nal co lumns:
start_date and end_date. Feel free to review the Type 2 SCD sectio n in the third lesso n if yo u like.
Switch terminals so that yo u're using yo ur perso nal database. No w let's create o ur dimensio n! Run the co mmand
belo w against yo ur perso nal database:
CODE TO TYPE:
CREATE TABLE dimStore
(
store_key
int NOT NULL AUTO_INCREMENT,
store_id
smallint(5) unsigned NOT NULL,
address
varchar(50) NOT NULL,
address2
varchar(50),
district
varchar(20) NOT NULL,
city
varchar(50) NOT NULL,
country
varchar(50) NOT NULL,
postal_code
varchar(10),
region
varchar(10),
manager_first_name varchar(45) NOT NULL,
manager_last_name
varchar(45) NOT NULL,
start_date
date NOT NULL,
end_date
date NOT NULL,
PRIMARY KEY (store_key)
);
So lo ng as yo u see the familiar Query OK, 0 rows affected, yo u're all set.
Creating Facts
No w that o ur dimensio ns have been created, we can implement o ur f act s. Fact tables are fairly straightfo rward; they
co ntain fo reign keys to all dimensio n tables, and a single co lumn fo r the fact value.
Let's get started!
Sales
Our sales data will co me fro m the payment table in the sakila database. Let's take a lo o k. Switch back to the
sakila database and run this co mmand:
CODE TO TYPE:
describe payment;
Execute the line to see the structure o f payment:
OBSERVE:
mysql> describe payment;
+--------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+--------------+----------------------+------+-----+-------------------+---------------+
| payment_id
| smallint(5) unsigned | NO
| PRI | NULL
| auto_in
crement |
| customer_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| staff_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| rental_id
| int(11)
| YES | MUL | NULL
|
|
| amount
| decimal(5,2)
| NO
|
| NULL
|
|
| payment_date | datetime
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+--------------+----------------------+------+-----+-------------------+---------------+
7 rows in set (0.00 sec)
We'll pay particular attentio n to the am o unt co lumn. It will be the basis fo r o ur factSales table.
Note
Make sure to review the so urces o f yo ur facts, so yo u do n't implement the wro ng data type.
Switch back to yo ur perso nal database. Let's create o ur fact. Run the co mmand belo w against yo ur perso nal
database:
CODE TO TYPE:
CREATE TABLE factSales
(
sales_key
INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
movie_key
INT NOT NULL,
store_key
INT NOT NULL,
sales_amount
decimal(5,2) NOT NULL,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_movie (movie_key) REFERENCES dimMovie(movie_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (sales_key)
);
Once again, if everything went acco rding to plan, yo u'll see Query OK, 0 rows affected.
A single ro w in factSales will represent the amo unt o f sales fo r a specific date, fo r a specific custo mer, fo r a
specific mo vie, at a specific sto re.
Yo u might think that the primary key sho uld be a co mpo site key acro ss all fo reign keys to the dimensio ns.
After all, these co lumns sho uld uniquely identify a fact ro w, right? But the pro blem with that type o f primary key
is that it tends to be very wide. To start, create a primary key o n the surro gate key alo ne - sale s_ke y. This will
give yo u o ptimum flexibility when evaluating future indexing strategies.
CustomerCount
No w we'll implement o ur factCustomerCount. The factCustomerCount is a tally o f the number o f
custo mers who created acco unts with o ur sto re. This table do es no t have a fo reign key to dimMovie because
the number o f custo mers isn't relative to any particular mo vie.
We'll examine the so urce fo r this data in a future lesso n. Fo r no w, let's create the fact. Make sure yo u are
using yo ur perso nal database. Review the fo llo wing CREATE TABLE statement:
OBSERVE:
CREATE TABLE factCustomerCount
(
customerCount_key INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
store_key
INT NOT NULL,
customer_count
INT NOT NULL,,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (customerCount_key)
);
A single ro w in this table represents a specific custo mer who created an acco unt o n a specific day, at a
specific sto re.
Befo re yo u execute the co mmand, take a clo ser lo o k at the cust o m e r_co unt measure. What values might it
have?
Since cust o m e r_ke y po ints to exactly o ne custo mer, cust o m e r_co unt will always have the
value o f 1.
Since cust o m e r_co unt will always be 1, we co uld o mit the co lumn fro m the table. Ho wever we
will leave it in o ur table since it will make it easier fo r business users to query the table.
Since factCustomerCount do esn't have any "real" facts, it is kno wn as a f act le ss f act . There will
be no measure co lumns in this table, o nly fo reign keys to dimensio ns. Factless facts are go o d at
sto ring events.
Let's create the table. This time we'll specify a de f ault value o f 1 o n cust o m e r_co unt . Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
CREATE TABLE factCustomerCount
(
customerCount_key INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
store_key
INT NOT NULL,
customer_count
INT NOT NULL DEFAULT 1,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (customerCount_key)
);
RentalCount
Our final fact is factRentalCount. It's similar to factCustomerCount in that it is also a f act le ss f act . As
such, we'll also specify a default value fo r the re nt al_co unt co lumn. (We'll po pulate this table in a future
lesso n.) Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE TABLE factRentalCount
(
rentalCount_key INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
movie_key
INT NOT NULL,
store_key
INT NOT NULL,
rental_count
INT NOT NULL DEFAULT 1,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_movie (movie_key) REFERENCES dimMovie(movie_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (rentalCount_key)
);
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
What is ET L?
ET L is an acro nym fo r Extract, T ransfo rm, and Lo ad. It's the pro cess that takes data fro m o ne o r mo re so urce
systems, transfo rms and cleanses that data, and lo ads the result into the data wareho use.
Yo u might be thinking "This sounds simple! We learned about exports and imports in the last course!"
And actually, we did learn abo ut the E and L in the last co urse, but we didn't co ver any T ransfo rmatio ns. An example o f
a Transfo rmatio ns wo uld be co nverting co des like "D" within a so urce system, to a wo rd like "Deleted" within a
destinatio n.:
And we still haven't learned ho w to auto mate data lo ading, o r handle failures auto matically and gracefully. Failure is
no t always an o ptio n, but when yo u're do ing bulk expo rts and lo ads it's no t to o difficult to handle. If an expo rt fails due
to a disk space issue, yo u free up so me space and try again. If an impo rt fails, yo u figure o ut what went wro ng and try
again.
Failure may no t be an o ptio n with ET L. Data wareho uses ho ld a lo t o f data, and mo st o f that data is pro cessed o n a
daily (o r even ho urly) basis and is extremely time sensitive. If yo u miss a day o f pro cessing, yo u may lo se data.
The o nly way to handle a large vo lume o f co mplex data is to have an auto mated ETL pro cess.
Next we'll create the e t lLo g table, which will be used to lo g messages and statistics. Many o f these co lumns are TOSspecific (Talend Open Studio -specific. We'll explain mo re abo ut Talend in the next lesso n). We will see them again
when we implement lo gging in a later lesso n. So me o f this info rmatio n wo n't be useful fo r every wareho use; it is up to
yo u to decide the amo unt and type lo gging yo u need. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE TABLE etlLog
(
run_id integer NOT NULL,
moment datetime NOT NULL,
pid varchar(20),
father_pid varchar(20),
root_pid varchar(20),
system_pid double,
project varchar(50),
job varchar(50),
job_repository_id varchar(255),
job_version varchar(255),
context varchar(50),
priority int,
origin varchar(255),
message_type varchar(255),
message varchar(255),
code int,
duration double,
count int,
reference int,
thresholds varchar(255),
key(run_id)
);
CODE TO TYPE:
ALTER
ALTER
ALTER
ALTER
ALTER
ALTER
ALTER
ALTER
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
Note
ETL pro cesses themselves are typically bro ken into three parts:
1. Initial ho usekeeping such as create a "run", o r clear temp files and tables.
2. Extract, Transfo rm, and Lo ad data.
3. Final ho usekeeping such as end a "run," send email, o r clear temp files and tables.
To do the initial ho usekeeping we will use a sto red pro cedure, called etl_StartRun. This pro cedure will be used to
po pulate the etlRuns table and return the run_id to be used in all ETL pro cesses. It will return the same run_id each
time it is called, until the co rrespo nding "final ho usekeeping" pro cedure etl_EndRun is called. Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_StartRun()
BEGIN
DECLARE current_run_id INTEGER;
SELECT max(run_id) into current_run_id
FROM etlRuns
WHERE end_time IS NULL;
IF current_run_id IS NULL THEN
BEGIN
INSERT INTO etlRuns (start_time) VALUES (now());
SELECT LAST_INSERT_ID() into current_run_id;
END;
END IF;
SELECT 'run_id' as "key", current_run_id as value;
END;
//
DELIMITER ;
With that pro cedure o ut o f the way, we can think abo ut the last part o f the pro cess: a pro cedure to perfo rm final
ho usekeeping. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_EndRun
()
BEGIN
UPDATE etlRuns SET end_time=now() where end_time IS NULL;
END;
//
DELIMITER ;
Note
Mo st co mpanies have data scattered acro ss many different systems, databases, and files. We'll keep
things simple fo r this co urse by restricting o ur data so urces. No matter where yo ur data o riginates fro m,
the pro cess fo r getting it into the data wareho use is the same.
So , ho w will we do cument o ur transfo rmatio n and mapping rules? We 'll use t he e asie st and m o st use f ul
m e t ho d available . This might be a wo rd do cument in so me situatio ns, o r a spreadsheet in ano ther. Fo r this co urse
we'll just use plain text do cuments to describe o ur transfo rmatio ns.
dimDate
Our date dimensio n do esn't really have a so urce o ther than a calendar. So ho w do we co me up with the
data? Pro grammers will co mmo nly use o ne o f these metho ds:
Create a pro gram to po pulate the date table.
Create a spreadsheet with date data in it.
Co py the date dimensio n fro m an existing data wareho use.
Suppo se o ne o f the business users is handy with Excel, and has o ffered to create a spreadsheet fo r yo u. The
spreadsheet will already co ntain all o f the required info rmatio n, including ho lidays and weekends. In this
case, mo st o f the wo rk is do ne fo r us. We o nly need to lo ad the data (which we'll do in the next lesso ns).
dimCustomer
Let's take a clo ser lo o k at dimCustomer. Back in lesso n three we disco vered that a custo mer reco rd is sto red
in several tables in the sakila database: cust o m e r, addre ss, cit y and co unt ry. We're no t planning o n
do ing any transfo rmatio ns o n the data, but suppo se a business user info rms us that ro ws in the custo mer
table where customer_id <= 10 are actually test acco unts that sho uld be excluded fro m the data
wareho use.
Let's write the query we need to extract the data fro m the custo mers table. Switch to the seco nd terminal, and
lo g into the sakila database. Run this co mmand against the sakila database:
CODE TO TYPE:
SELECT
c.customer_id, c.first_name, c.last_name, c.email,
a.address, a.address2, a.district,
ci.city,
co.country,
postal_code,
a.phone,c.active, c.create_date
FROM customer c
JOIN address a on (c.address_id = a.address_id)
JOIN city ci on (a.city_id = ci.city_id)
JOIN country co on (ci.country_id = co.country_id)
WHERE customer_id > 10;
If yo u are co nnected to the sakila database and typed everything co rrectly, yo u'll see lo ts o f results:
OBSERVE:
mysql> SELECT
-> c.customer_id, c.first_name, c.last_name, c.email,
-> a.address, a.address2, a.district,
-> ci.city,
-> co.country,
-> postal_code,
-> a.phone,c.active, c.create_date
-> FROM customer c
-> JOIN address a on (c.address_id = a.address_id)
-> JOIN city ci on (a.city_id = ci.city_id)
-> JOIN country co on (ci.country_id = co.country_id)
-> WHERE customer_id > 10;
+-------------+-------------+--------------+-----------------------------------------+----------------------------------------+----------+---------------------+----------------------------+---------------------------------------+------------+--------------+--------+---------------------+
| customer_id | first_name | last_name
| email
| address
| address2 | district
| city
| country
| postal_c
ode | phone
| active | create_date
|
+-------------+-------------+--------------+-----------------------------------------+----------------------------------------+----------+---------------------+----------------------------+---------------------------------------+------------+--------------+--------+---------------------+
|
218 | VERA
| MCCOY
| [email protected]
| 1168 Najafabad Parkway
|
| Kabol
| Kabul
| Afghanistan
| 40301
| 886649065861 |
1 | 2004-03-19 00:00:00 |
|
441 | MARIO
| CHEATHAM
| [email protected]
| 1924 Shimonoseki Drive
|
| Batna
| Batna
| Algeria
| 52625
| 406784385440 |
1 | 2004-10-07 00:00:00 |
|
69 | JUDY
| GRAY
| [email protected]
| 1031 Daugavpils Parkway
|
| Bchar
| Bchar
| Algeria
| 59025
| 107137400143 |
1 | 2004-02-25 00:00:00 |
|
176 | JUNE
| CARROLL
| [email protected]
| 757 Rustenburg Avenue
|
| Skikda
| Skikda
| Algeria
| 89668
| 506134035434 |
1 | 2004-08-11 00:00:00 |
|
320 | ANTHONY
| SCHWAB
| [email protected]
| 1892 Nabereznyje Telny Lane
|
| Tutuila
| Tafuna
| American Samoa
| 28396
| 478229987054 |
1 | 2004-07-20 00:00:00 |
|
528 | CLAUDE
| HERZOG
| [email protected]
| 486 Ondo Parkway
|
| Benguela
| Benguela
| Angola
| 35202
| 105882218332 |
1 | 2004-01-24 00:00:00 |
...lines ommitted...
|
303 | WILLIAM
| SATTERFIELD | WILLIAM.SATTERFIELD@sakilacustomer.
org
| 687 Alessandria Parkway
|
| Sanaa
| Sanaa
| Yemen
| 57587
| 407218522294 |
1 | 2004-04-22 00:00:00 |
|
213 | GINA
| WILLIAMSON
| [email protected]
| 1001 Miyakonojo Lane
|
| Taizz
| Taizz
| Yemen
| 67924
| 584316724815 |
1 | 2004-08-02 00:00:00 |
|
553 | MAX
| PITT
| [email protected]
| 1917 Kumbakonam Parkway
|
| Vojvodina
| Novi Sad
| Yugoslavia
| 11892
| 698182547686 |
1 | 2004-02-09 00:00:00 |
|
438 | BARRY
| LOVELACE
| [email protected]
| 1836 Korla Parkway
|
| Copperbelt
| Kitwe
| Zambia
| 55405
| 689681677428 |
1 | 2004-09-24 00:00:00 |
+-------------+-------------+--------------+------------------------------------
------+----------------------------------------+----------+---------------------+----------------------------+---------------------------------------+------------+--------------+--------+---------------------+
589 rows in set (0.04 sec)
It lo o ks like this is a go o d query to use to extract custo mer info rmatio n. Save this query - we will use it in a
future lesso n.
dimMovie
The next table we will po pulating is dim Mo vie . Data fro m this table co mes fro m two tables: f ilm and
language . We will have to jo in o n language twice ho wever, since the f ilm table jo ins to language o n
language_id and original_language_id.
Let's write the query needed to extract the data fro m the custo mers table. Run this co mmand against the
sakila database:
CODE TO TYPE:
SELECT f.film_id, f.title, f.description, f.release_year,
l.name as language, orig_lang.name as original_language,
f.rental_duration, f.length, f.rating, f.special_features
FROM film f
JOIN language l on (f.language_id=l.language_id)
JOIN language orig_lang on (f.original_language_id = orig_lang.language_id);
Try executing the query. If yo u typed everything co rrectly, yo u will see the fo llo wing:
OBSERVE:
mysql> SELECT f.film_id, f.title, f.description, f.release_year,
-> l.name as language, orig_lang.name as original_language,
-> f.rental_duration, f.length, f.rating, f.special_features
-> FROM film f
-> JOIN language l on (f.language_id=l.language_id)
-> JOIN language orig_lang on (f.original_language_id = orig_lang.language_
id);
Empty set (0.01 sec)
What happened to the data? We do n't have a WHERE clause, so that can't be the pro blem. But we do have two
jo ins. Let's write ano ther query to find o ut which jo in is failing us. Run this co mmand against the sakila
database:
CODE TO TYPE:
SELECT count(distinct language_id), count(distinct original_language_id)
FROM film f;
Run the query, and o bserve the results:
OBSERVE:
mysql> SELECT count(distinct language_id), count(distinct original_language_id)
-> FROM film f;
+-----------------------------+--------------------------------------+
| count(distinct language_id) | count(distinct original_language_id) |
+-----------------------------+--------------------------------------+
|
1 |
0 |
+-----------------------------+--------------------------------------+
1 row in set (0.00 sec)
It lo o ks like we do n't have any films that have been translated. Perhaps this is a feature in pro gress, o r an o ld
feature that has since been abando ned. Whatever the reaso n, we will need to alter o ur SELECT query to use a
LEFT J OIN instead o f a no rmal jo in. Run this co mmand against the sakila database:
CODE TO TYPE:
SELECT f.film_id, f.title, f.description, f.release_year,
l.name as language, orig_lang.name as original_language,
f.rental_duration, f.length, f.rating, f.special_features
FROM film f
JOIN language l on (f.language_id=l.language_id)
LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.language_id
);
As lo ng as yo u typed everything co rrectly, yo u will see lo ts o f results:
OBSERVE:
mysql> SELECT f.film_id, f.title, f.description, f.release_year,
-> l.name as language, orig_lang.name as original_language,
-> f.rental_duration, f.length, f.rating, f.special_features
-> FROM film f
-> JOIN language l on (f.language_id=l.language_id)
-> LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.lang
uage_id);
+---------+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------------------+-----------------+-------+--------+--------------------------------------------------------+
| film_id | title
| description
| release_year | language | original_language | rental_duration | l
ength | rating | special_features
|
+---------+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------------------+-----------------+-------+--------+--------------------------------------------------------+
|
1 | ACADEMY DINOSAUR
| A Epic Drama of a Feminist And a Mad S
cientist who must Battle a Teacher in The Canadian Rockies
|
2006 | English | NULL
|
6 |
86 | PG
| Deleted Scenes,Behind the Scenes
|
|
2 | ACE GOLDFINGER
| A Astounding Epistle of a Database Adm
inistrator And a Explorer who must Find a Car in Ancient China
|
2006 | English | NULL
|
3 |
48 | G
| Trailers,Deleted Scenes
|
|
3 | ADAPTATION HOLES
| A Astounding Reflection of a Lumberjac
k And a Car who must Sink a Lumberjack in A Baloon Factory
|
2006 | English | NULL
|
7 |
50 | NC-17 | Trailers,Deleted Scenes
|
...lines omitted...
|
998 | ZHIVAGO CORE
| A Fateful Yarn of a Composer And a Man
who must Face a Boy in The Canadian Rockies
|
2006 | English | NULL
|
6 |
105 | NC-17 | Deleted Scenes
|
|
999 | ZOOLANDER FICTION
| A Fateful Reflection of a Waitress And
a Boat who must Discover a Sumo Wrestler in Ancient China
|
2006 | English | NULL
|
5 |
101 | R
| Trailers,Deleted Scenes
|
|
1000 | ZORRO ARK
| A Intrepid Panorama of a Mad Scientist
And a Boy who must Redeem a Boy in A Monastery
|
2006 | English | NULL
|
3 |
50 | NC-17 | Trailers,Commentaries,Behind the Scenes
|
+---------+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------------------+-----------------+-------+--------+--------------------------------------------------------+
1000 rows in set (0.02 sec)
This lo o ks great!
dimStore
The last table we'll wo rk o n po pulating is dim St o re . Data fro m this table co mes fro m many tables: st o re ,
st af f , addre ss, cit y, and co unt ry. Run this co mmand against the sakila database:
CODE TO TYPE:
SELECT s.store_id, a.address, a.address2, a.district,
c.city, co.country, a.postal_code, s.region,
st.first_name as manager_first_name,
st.last_name as manager_last_name
FROM
store s
JOIN staff st on (s.manager_staff_id = st.staff_id)
JOIN address a on (s.address_id = a.address_id)
JOIN city c on (a.city_id = c.city_id)
JOIN country co on (c.country_id = co.country_id);
Run the query, and o bserve the results:
OBSERVE:
mysql> SELECT s.store_id, a.address, a.address2, a.district,
-> c.city, co.country, a.postal_code, s.region,
-> st.first_name as manager_first_name,
-> st.last_name as manager_last_name
-> FROM
-> store s
-> JOIN staff st on (s.manager_staff_id = st.staff_id)
-> JOIN address a on (s.address_id = a.address_id)
-> JOIN city c on (a.city_id = c.city_id)
-> JOIN country co on (c.country_id = co.country_id)
-> ;
+----------+--------------------+----------+----------+------------+-----------+
-------------+--------+--------------------+-------------------+
| store_id | address
| address2 | district | city
| country
|
postal_code | region | manager_first_name | manager_last_name |
+----------+--------------------+----------+----------+------------+-----------+
-------------+--------+--------------------+-------------------+
|
1 | 47 MySakila Drive | NULL
| Alberta | Lethbridge | Canada
|
| West
| Mike
| Hillyer
|
|
2 | 28 MySQL Boulevard | NULL
| QLD
| Woodridge | Australia |
| East
| Jon
| Stephens
|
+----------+--------------------+----------+----------+------------+-----------+
-------------+--------+--------------------+-------------------+
2 rows in set (0.00 sec)
This lo o ks great to o !
Great jo b so far. In the next lesso n w'll practice writing an ETL jo b. See yo u then!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
ETL to o ls keep track o f schemas, a definitio n o f the co lumns in a data flo w that includes data type and sizes. The
schema tracks which co lumns are allo wed to be null, and which co lumns are part o f the key.
Schemas are an impo rtant abstractio n in the ETL wo rld. They allo w us to specify the makeup o f o ur data o nce, and use
that specificatio n in many different co mpo nents. (We'll learn mo re abo ut schemas so o n.)
What do es the future ho ld? No bo dy can say fo r sure, but it is lo o king like we will see faster, easier to use, and mo re
reliable ETL so lutio ns that let us fo cus o n interesting pro blems instead o f mundane co nnectio ns and transfo rmatio ns.
Let's create a sample jo b to see what we can do .
Note
Feel free to resize any widget o n yo ur screen. Yo u can always get back to the default perspective by
clicking o n the red leaf.
On the left side o f the screen yo u will see a tab called "Repo sito ry." The text may be truncated to "Rep," depending o n
the width o f yo ur screen.
The repo sito ry is where TOS sto res all o bjects related to yo ur pro ject. They are:
Business Mo dels - diagrams to do cument pro cesses o r flo ws.
Jo b Designs - implemented pro cesses o r flo ws.
Co ntexts - sets o f variables o r values that are shared acro ss several jo bs.
Co de - bits o f Java co de shared acro ss several jo bs.
SQL Patterns - templates o f SQL co de that can be used as a basis fo r queries in jo bs.
Metadata - data abo ut yo ur data - database co nnectio ns, file layo uts, and descriptio ns o f database tables
and query results.
Do cumentatio n - sto rage fo r wo rd do cuments, spreadsheets and o ther items created o utside o f TOS.
Recycle bin - last sto p fo r trash, just like the recycle bin in Windo ws.
To simplify things, fo r this co urse we wo n't use Business Mo dels, Co de, SQL Patterns, o r Do cumentatio n.
If everything went o kay, yo u will see a blank "canvas" fo r Job ETL Demo 0.1 and a new Palette o n the lo wer left:
Right no w yo ur ETL jo b is blank, so it do esn't do anything. We need to add a data so urce. On the Palette, click File to
expand that catego ry, then click Input .
Note
Click t File Input De lim it e d o nce to select, then mo ve yo ur mo use o ver the canvas. Click the canvas to dro p the
t File Input De lim it e d widget.
So , what's with that red circle with the X thro ugh it? Drag yo ur mo use o ver that circle, and yo u'll see this:
The warning and erro r o ccur because we haven't set any pro perties o n the t File Input De lim it e d widget. Let's do that
no w. Click o nce in the middle o f the t File Input De lim it e d widget, then switch to the Co m po ne nt tab at the bo tto m o f
the screen:
No w yo u'll see the basic aspects o f t File Input De lim it e d that yo u can mo dify. We'll need to change the file to po int
to a sample CSV input. Change the File Name so it lo o ks like this:
CODE TO TYPE:
"C:/talend_files/in/csv/customer1.csv"
WARNING
Make sure yo u type fo rward slashes ( // ) instead o f the usual back slashes ( \\ ). Under the ho o d,
TOS is using Java to run yo ur transfo rmatio n; back slashes are used to delimit special characters
in Java.
Next, we want TOS to skip o ver the header ro w in the file. To do this, change the 0 next to He ade r to a 1. 1 tells TOS to
skip o ne ro w at the beginning o f the file.
No w that we've specified the input file, we need to specify the schema (structure) o f the input file. We do this by clicking
the butto n named "..." next to Edit Sche m a. Yo u may have to scro ll the co mpo nent panel to see the Sche m a.
Note
Read thro ugh the next set o f instructio ns befo re trying them. TOS uses many modal windo ws (windo ws
that are always o n to p o f o ther windo ws), so yo u wo n't be able to scro ll in this lesso n unless yo u clo se
the mo dal windo w.
Click OK to save yo ur changes. The red circle with the X is go ne no w, replaced by a warning sign. The warning still
exists because we do n't have a destinatio n fo r o ur data.
Note
Transfo rmatio ns are no t always necessary. So metimes there isn't anything to do o ther than read data
fro m o ne place and place it so mewhere else.
We do n't really care where the data ends up, since we are just do ing a little test. Instead o f putting the data in a
database so mewhere and then querying the database o r putting the data in a different text file, let's use the t Lo gRo w
widget to display ro ws o n the co nso le.
Click the File gro up to co llapse it, then click the Lo gs & Erro rs gro up. Click o nce o n t Lo gRo w and drag it to the
canvas:
No w bo th co mpo nents have warnings. What's the pro blem? Well, we haven't made any co nnectio ns between the
so urce o f data and the data destinatio n.
To make a co nnectio n, right click o n the data so urce and cho o se Ro w -> Main:
Note
If yo u make a mistake, yo u can always select a co mpo nent and hit the delete key to remo ve it. Yo u can
also right click o n a co mpo nent and cho o se De le t e .
Note
Yo ur canvas do esn't have to lo o k exactly the same as o ur image here. The layo ut is strictly info rmatio nal.
But the co nnectio ns are impo rtant because they define ho w data flo ws thro ugh the jo b.
To run the jo b, click o nce o n the canvas to make sure it is selected, then click the little green "Play" butto n at the to p o f
the screen:
Yo u'll see so me messages and activity, then a who le bunch o f data scro lling o n the co nso le:
Co ngratulatio ns! Yo u've co mpleted yo ur first ETL jo b! In the next lesso n we'll write o ur first real ETL jo b - it will impo rt
an Excel spreadsheet fo r o ur date dimensio n. See yo u there!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
Job Structure
Jo bs in TOS can be as large o r small as yo u want them to be. Yo ur initial jo bs can then execute subsequent jo bs. We
will use this basic structure fo r o ur jo bs:
Since dimensio ns give co ntext to facts, we must pro cess dimensio ns befo re we pro cess facts.
This structure makes it po ssible fo r us to pro cess the vario us co mpo nents o f entire data wareho use separately: we
might o nly pro cess dimensio ns, o nly pro cess facts, o r o nly pro cess a single dimensio n o r fact. Flexibility like this is
great when yo u're develo ping a data wareho use. Why bo ther to pro cess the entire wareho use when yo u are tracking
do wn an issue with a single dimensio n, right?
So me data wareho use tasks (like lo ading o ur date dimensio n) o ccur o nly o nce. We'll create a jo b fo r each o f tho se
types o f tasks, but they will no t execute as part o f o ur no rmal wareho use pro cessing.
Yo ur co mpo nent may be named tFileInputExcel_1, o r tFileInputExcel_2. Its unique name is generated
auto matically by TOS, and isn't really useful to us. We'll rename it so it makes mo re sense. Click in the middle o f the
t File Input Exce l co mpo nent yo u just dro pped o n the canvas, then switch to the Co m po ne nt tab:
No w yo u might see the Basic se t t ings sub tab. To change the name o f the co mpo nent, change to the Vie w sub tab:
The co mpo nent's current name is set to __UNIQUE_NAME__. Click in the label fo rmat bo x, and change the text to a
mo re meaningful value: Dat e Spre adshe e t .
No w we can switch back to the Basic se t t ings sub tab. The current f ile nam e po ints to an invalid spreadsheet. Yo u
can click inside the text bo x to type in the co rrect lo catio n, o r click the
to :
CODE TO TYPE:
C:/talend_files/DateDimension.xls
Note
If yo u type in the file name, be sure to use fo rward slashes instead o f back slashes.
Excel spreadsheets are also kno wn as workbooks, and wo rkbo o ks co ntain sheets. Suppo se yo ur co wo rker tells yo u
that the data fo r the date dimensio n is lo cated o n a sheet called She e t 1. Scro ll do wn thro ugh the basic settings until
yo u see the She e t list . Find the
tho se do uble quo tatio n marks):
butto n and click it. Change the default text to " She e t 1" . (Do n't fo rget to use
Yo ur spreadsheet also co ntains a header ro w. We need to tell TOS abo ut this header ro w, o therwise it wo uld try to
impo rt it as data. Scro ll do wn farther until yo u see the He ade r and Fo o t e r sectio n. Change the He ade r text bo x to 1:
Our co mpo nent is still in erro r because we haven't specified the schema o f o ur Excel file. Typically, we wo uld specify
the schema in the Metadata sectio n o f the repo sito ry, but this spreadsheet is o nly go ing to be used in this single jo b,
so in this instance, we'll keep the schema within o ur co mpo nent. Our co wo rker has already pro vided us with the
schema definitio n:
Co lum n
T ype
date
Date
is_weekend
Bo o lean
is_ho liday
Bo o lean
year
Integer
quarter
Character
mo nth
Integer
week_in_year Integer
day_in_week
Integer
To edit the schema fo r o ur spreadsheet, scro ll to the bo tto m o f the co mpo nent windo w, and click the
butto n
This schema defines the layo ut o f the data in the Excel spreadsheet. It is impo rtant to match the co lumn data types and
o rder pre cise ly. Bad things can happen if we get the o rder wro ng. Fo r instance, a mismatched data type co uld
co nfuse TOS so that it wo uldn't kno w whether to interpret the date "20 0 8 " as a mo nth o r a year. That kind o f co nfusio n
has the po tential to wreak all so rts o f havo c o n o ur wo rk.
When yo u're do ne, yo ur schema sho uld lo o k like this:
Click "OK" to clo se the windo w. TOS puts an asterisk * next the filename when yo ur jo b has changes that have no t
been saved. When yo u make changes to yo ur jo b, get in the habit o f saving them. Yo u can save yo ur file in o ne o f two
ways: use the Save co mmand o n the File menu o r click the flo ppy disk ico n o n the to o lbar:
Let's test o ur co mpo nent to make sure it's wo rking pro perly. We can test it using the t Lo gRo w co mpo nent. In the
Palette, click to expand the Lo gs & Erro rs tab, then click and drag t Lo gRo w to yo ur canvas.
Link the Dat e Spre adshe e t co mpo nent to the t Lo gRo w co mpo nent by right-clicking o n yo ur Dat e Spre adshe e t
co mpo nent, and cho o sing Ro w -> Main: Dro p the link o n t Lo gRo w:
Note
butto n. If everything is set up co rrectly, yo u'll see o utput that lo o ks like this:
OBSERVE:
Starting job dimDate at 16:36 14/10/2008.
01-01-2000|true|false|2000|1|1|1|7
02-01-2000|true|false|2000|1|1|2|1
03-01-2000|false|false|2000|1|1|2|2
04-01-2000|false|false|2000|1|1|2|3
... lines omitted ...
27-12-2050|false|false|2050|4|12|53|3
28-12-2050|false|false|2050|4|12|53|4
29-12-2050|false|false|2050|4|12|53|5
30-12-2050|true|false|2050|4|12|53|6
31-12-2050|true|false|2050|4|12|53|7
Job dimDate ended at 16:36 14/10/2008. [exit code=0]
We are o ff to a great start!
Note
If yo u accidentally delete the wro ng co mpo nent, just cho o se "Undo " fro m the Edit menu.
Next, expand the Palette, and click o n Pro ce ssing. Scro ll do wn until yo u find the t Map co mpo nent. Add it to yo ur
canvas, and feel free to rearrange the o ther co mpo nents:
Rename t Map_1 to so mething mo re meaningful - change it to Add Co lum ns. Next link Dat e Spre adshe e t to Add
Co lum ns. Befo re we link Add Co lum ns to o ur lo gging co mpo nent, let's add o ur new co lumns. Click o nce o n Add
Co lum ns, then change the tab to Co m po ne nt . The windo w will lo o k like this:
Note
The t Map windo w is ano ther modal windo w - so yo u wo n't be able to scro ll co urse co ntent while yo u're
wo rking with t Map. Make sure yo u click OK to clo se the t Map windo w so yo ur pro gress is retained, and
save yo ur jo b o ften!
To edit o ur transfo rmatio n, click o n Basic se t t ings and then click o n the
pro bably want to expand the new windo w to fill yo ur entire screen.
The edito r fo r t Map has three distinct areas: Inputs, Variables, and Outputs:
Each flo w into t Map sho ws up in the Input sectio n. Yo u must have at least o ne input.
The Variables sectio n lets yo u set o r mo dify variables, which is useful when yo u want to create co unters.
The Output sectio n lets yo u define the way input ro ws are passed o n to the next co mpo nent. Yo u may have
multiple o utputs.
Fo r o ur current dimensio n, we wo n't use variables, o nly inputs and o utputs. Click the
sectio n:
No w we need to link o ur input co lumns to o utput co lumns. Click and ho ld o n the dat e input co lumn:
No w drag the co lumn o ver to the dim Dat e o utput that was just created:
Repeat these steps fo r all o f the remaining co lumns except day_in_week -- we are no t using that co lumn. When
yo u're do ne, yo u'll see this:
Next, let's add a new co lumn fo r date_key. Primary keys in data wareho uses are o ften implemented using auto increment co lumns, but it's much mo re co nvenient to have date_key in a co ded fo rmat, such as yyyyMMdd. Using this
fo rmat, a value o f 20080101 wo uld represent January 1st, 20 0 8 .
Name this co lumn dat e _ke y, change its type to int , and uncheck the Nullable checkbo x. Eventually the o rder o f
co lumns in o ur o utput must match o ur actual dimDate table, so we might as well mo ve the dat e _ke y co lumn to the
very to p no w. We can do this using the up and do wn arro ws to the right o f the add butto n:
butto n in the
Because o ur TOS pro ject is based o n Java, expressio ns are also written in Java. This gives us lo ts o f po wer. Yo u can
use Java string functio ns to create so me very po werful expressio ns. The bo tto m half o f the expressio n windo w is a
catalo g o f so me co mmo n expressio ns that yo u can use. We'll use a functio n fro m the TalendDate catego ry to co nvert
o ur date to "yyyyMMdd" fo rmat, then co nvert that string into an integer, ready fo r the database. (Do n't wo rry if yo u're
no t quite an expert using Java - we'll pro vide yo u with the expressio ns yo u need fo r this co urse. If yo u're interested in
learning mo re abo ut Java, check o ut the Java Certificate Series.)
In the expressio n builder, type in this co de:
CODE TO TYPE:
Integer.parseInt(
TalendDate.formatDate("yyyyMMdd",row1.date)
)
Click "OK" to save yo ur expressio n. Next, edit the expressio n fo r the day co lumn. Type the co de belo w into the
expressio n builder:
CODE TO TYPE:
row1.date.getDate()
Next, set the expressio n fo r month_name. Type the co de belo w into the expressio n builder:
CODE TO TYPE:
TalendDate.formatDate("MMMM",row1.date)
Finally, set the expressio n fo r day_name. Type the co de belo w into the expressio n builder:
CODE TO TYPE:
TalendDate.formatDate("EEEE",row1.date)
We've used the date input co lumn to co me up with several o utput co lumns. Graphically, TOS displays this with
multiple arro ws fro m date go ing to different ro ws in the o utput:
With o ur co lumns co mplete, we are free to clo se the map edito r. Click "OK" and then save yo ur jo b.
We are nearly there! Link Add Co lum ns to t Lo gRo w_1.
Hey, it lo o ks like we have a pro blem. There's that little red circle with an X in it:
So far, we've read data fro m o ur Excel spreadsheet and added two new co lumns to the data flo w. The next step is to
depo sit o ur data into the dimDate table.
We'll be co nnecting to MySQL o ften, so it wo uld be go o d to keep MySQL co nnectio n info rmatio n in o ne place. We can
do this using the Me t adat a sectio n o f o ur repo sito ry.
To create a co nnectio n, click to expand the Me t adat a sectio n o f the repo sito ry:
Spaces are no t allo wed in co nnectio n names, so give yo ur co nnectio n this name: Dat aWare ho use . If yo u want to ,
yo u can leave the purpose and description fields blank, keep version set at 0 .1, and leave status unselected. Tho se
co lumns are just additio nal metadata fo r yo ur co nnectio n:
Click the Che ck butto n to try to co nnect to yo ur database. If yo ur co nnectio n is go o d, yo u'll see the message
"DataWarehouse" connection successful. Click Finish to save yo ur co nnectio n.
Note
If yo u edit yo ur database co nnectio n, TOS will ask yo u if yo u want to pro pagate the mo dificatio ns to all
jo bs. Cho o se yes - yo u want yo ur changes to apply everywhere.
Since we are sto ring o ur database co nnectio n info rmatio n in the repo sito ry, we sho uld also sto re o ur table schema in
the repo sito ry. Right-click o n yo ur database co nnectio n, and cho o se Re t rie ve Sche m a:
We do n't need to filter o ur schema, so click Ne xt > at the bo tto m o f the windo w:
In the next windo w, scro ll thro ugh yo ur database until yo u co me acro ss the o bjects fo r this co urse:
dimCusto mer
dimDate
dimMo vie
dimSto re
etlLo g
etlRuns
factCusto merCo unt
factRentalCo unt
factSales
Click Ne xt >. In this final screen yo u can review the schema and make any necessary changes. We do n't need to
change anything, so click Finish:
No w the repo sito ry will sho w the tables asso ciated with o ur co nnectio n:
With o ur co nnectio n setup, we can swap o ut t Lo gRo w_1 with a MySQL o utput. First, delete t Lo gRo w_1 fro m yo ur
canvas. Next, expand the Dat abase s sectio n o f the Palette, then expand the MySQL sub-sectio n:
Drag t MysqlOut put to yo ur canvas, and link it fro m the Add Co lum ns co mpo nent. Yo ur canvas will no w lo o k
so mething like this:
Next, select t MySQLOut put _1, and switch to the Co m po ne nt tab. Change the name o f t MySQLOut put _1 to
dim Dat e T able .
No w switch back to the Basic se t t ings tab. By default, TOS assumes yo u are go ing to save the database co nnectio n
info rmatio n inside o f the co mpo nent. This is kno wn as a Built-In pro perty type.
Our co nnectio n info rmatio n is sto red in the repo sito ry, so change the Pro pe rt y T ype fro m Built-In to Repository. We
o nly have o ne database co nnectio n in the repo sito ry, and TOS selects it fo r us auto matically:
We're almo st do ne. Next, we'll tell o ur o utput co nnectio n where and ho w to place the data. Scro ll do wn the Basic
Se t t ings tab to see the remaining pro perties.
Set the T able to dimDate. We do n't want duplicate ro ws in o ur table, and we want to relo ad o ur T able co mpletely each
time we execute this jo b, so set Act io n o n t able to Clear Table. This essentially executes a DELETE FROM dimDate;
befo re inserting data into dimDate. We want to insert data (witho ut deleting o r updating anything), so set Act io n o n
dat a to Insert. Finally, if there is any pro blem, we want to sto p the jo b immediately, so check the bo x next to Die o n
e rro r:
Save yo ur jo b. No w yo u're ready to run it! It might take a minute o r two to read fro m the Excel spreadsheet and transfer
everything to yo ur database. When the jo b is co mplete, yo u'll see this o utput:
OBSERVE:
Starting job dimDate at 13:29 16/10/2008.
Job dimDate ended at 13:31 16/10/2008. [exit code=0]
We can do uble-check by running a quick query. Switch back to the terminal, run this co mmand:
CODE TO TYPE:
SELECT * from dimDate
LIMIT 0, 10;
If yo u see a red X, o ne o f yo ur co mpo nents co ntains an erro r. Ho ver o ver the co mpo nent and TOS sho uld let
yo u kno w what the pro blem is. If yo ur schemas differ between yo ur co mpo nents, click o n the Sync Co lum ns
butto n o f yo ur last co mpo nent:
TOS is telling us that it was unable to interpret the value fo r 01-Jan-2000. Chances are, the input schema is
inco rrect - either the co lumns are o ut o f o rder o r a data type is inco rrect. Check yo ur schema again.
If yo u do n't see any results fro m the query, o ther than 0 rows in set (0.00 sec) - do uble-check yo ur
o utput schema in t Map. Yo ur co lumns may be o ut o f o rder o r yo u might have an inco rrect data type.
If yo u are unable to find the so urce o f a pro blem, yo u can always co ntact yo ur mento r at
le arn@ o re illyscho o l.co m .
Yo u've acco mplished a lo t in this lesso n - yo u extracted an Excel spreadsheet, transfo rmed the co lumns in the spreadsheet,
and lo aded the data into the wareho use: ET L! In the next lesso n we'll press o n with o ur ETL and the remaining dimensio ns.
See yo u then!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
Loading dimMovie
Job Structure
Befo re we dive into o ur mo vie dimensio n, let's review o ur jo b structure:
Our jo b fo r dimMovie is a sub job, executed by Process Dimensions, which is a sub-jo b executed by Daily
Warehouse Update. This structure allo ws us a great deal o f flexibility; we can relo ad the entire data
wareho use by running a single jo b, o r we can relo ad the facts alo ne by running a single jo b, o r relo ad o nly a
single fact.
Let's set up this structure. Create a new jo b:
Name this jo b Pro ce ssDat aWare ho use . (Remember spaces are no t allo wed in jo b names.)
also lets us sto re generic schemas, which are schemas no t asso ciated with specific co nnectio ns. Let's
create a generic schema that defines the co lumns returned by the etl_StartRun sto red pro cedure.
First, expand the Me t adat a sectio n o f the repo sito ry, then right-click o n Ge ne ric sche m as. Cho o se Cre at e
ge ne ric sche m a:
Finally, name the schema e t lRun instead o f metadata, and add two co lumns to the schema. The co lumns
will be named ke y and value ; they are strings o f length 255 and they allo w nulls. When yo u're do ne, click
Finish:
No w we're ready to start wo rking o n the actual jo b. Drag a t Pre jo b co mpo nent fro m the palette to yo ur
canvas. It's lo cated under the Orche st rat io n sectio n:
Next, drag a t MysqlInput co mpo nent to the canvas. It's lo cated under Dat abase s --> MySQL. Po sitio n it to
the right o f t Pre jo b. Right-click o n t Pre jo b and link T rigge r --> On Co m po ne nt OK fro m t Pre jo b to
t MysqlInput .
Change the name t MysqlInput to e t l_St art Run using the Vie w menu in the Co m po ne nt tab.
Click o n e t l_St art Run, then cho o se the Co m po ne nt tab belo w yo ur canvas. Set these pro perties:
Pro perty Type: Repository - DB (MYSQL):DataWarehouse (select your database connection
from the repository)
Schema - Repository - GENERIC:Auditing - etlRun (select the metadata you just created
from the repository)
Query - "CALL etl_StartRun()" - (do n't fo rget the quo tatio n marks.)
Cho o se the generic schema yo u just created by clicking o n the
the schema that was just added and select e t lRun:
So far, so go o d. Next, dro p a t Co nt e xt Lo ad co mpo nent o n the canvas. This co mpo nent is lo cated under
the Misc tab in the palette. This co mpo nent accepts data and lo ads it to the current context.
Link the main ro w o utput o f e t l_St art Run to t Co nt e xt Lo ad.
To extract the run_id fro m the co ntext, we'll need to give TOS a little mo re info rmatio n. At the bo tto m o f the
windo w, click o n the Co nt e xt s tab. Make sure yo u are o n the Variable s sub-tab, then click the
new entry.
to add a
The Name is run_id, its so urce is built -in, its type is Int e ge r, and the script co de is context.run_id.
When yo u are do ne, the tab sho uld lo o k like this:
Finally, left-click in the blue area surro unding e t l_St art Run to select the sub jo b.
Make sure yo u're o n the Co m po ne nt tab, then click to select the bo x Sho w subjo b t it le . Type in an
appro priate title:
Logging
TOS has two "catcher" co mpo nents: t Lo gCat che r and t St at Cat che r. These co mpo nents listen fo r lo g
messages and statistics and then create a flo w using that info rmatio n. This allo ws us the flexibility to write to
a lo g file, to a database anywhere!
In lesso n 5 we created a table called etlLog. We'll use the catcher co mpo nents and so me transfo rmatio n
lo gic fro m etlLog to save lo gs and stats in o ur table. Let's get started. Drag o ne t Lo gCat che r co mpo nent
and o ne t St at Cat che r co mpo nent to yo ur canvas; they are lo cated under the Lo gs & Erro rs tab:
Take a lo o k at the schema fo r the co mpo nents yo u just dro pped o n the canvas. Select o ne o f them and
switch to the Co m po ne nt tab. Then click the
butto n next to Edit Sche m a. (This is a little misleading
since yo u can't actually edit the schema fo r these co mpo nents.)
Yo u can see that these co mpo nents are similar, but no t exactly the same. We co uld have sent each
co mpo nent to a different database table, but then we wo uld have to do extra wo rk to see all o ur lo g
info rmatio n. With a little wo rk we can transfo rm, then unite the two co mpo nents to lo g into o ur single etlLog
table.
We'll use two t Map co mpo nents to transfo rm the lo g and stat o utput. Drag two t Map co mpo nents fro m the
Pro ce ssing menu o f the palette to yo ur canvas, then link the Main o utputs to the t Map co mpo nents. When
do ne that part o f yo ur canvas sho uld lo o k like this:
Next, do uble click the t Map co mpo nent linked to t Lo gCat che r. Yo ur input to t Map might no t be called ro w2
- that's fine. A mo dal windo w will o pen, yo u can add o utput info rmatio n there. So me co lumns sho uld have
blank expressio ns. Tho se co lumns will be used by the stat catcher o nly. So me co lumns, like run_id, do no t
exist in the input. Yo u'll have to add them manually. Also , be sure yo ur co lumns are in the co rrect o rder. Add
o ne o utput, then add o r link the co lumns belo w:
Expre ssio n
Co lum n
co ntext.run_id
run_id
ro w2.mo ment
mo ment
ro w2.pid
pid
T ype
integer
ro w2.father_pid father_pid
ro w2.ro o t_pid
ro o t_pid
blank
system_pid
Lo ng, length 8
ro w2.pro ject
pro ject
ro w2.jo b
jo b
blank
blank
jo b_versio n
ro w2.co ntext
co ntext
ro w2.prio rity
prio rity
ro w2.o rigin
o rigin
ro w2.type
m e ssage _t ype
ro w2.message message
ro w2.co de
co de
blank
duratio n
Note
Lo ng, length 8
Make sure yo u rename t ype co lumn m e ssage _t ype .
No w is a go o d t im e t o save yo ur wo rk.
Next, do uble click the t Map co mpo nent linked to t Lo gSt at Cat che r. Yo ur input to t Map may no t be called
ro w4 - that's o kay. So me co lumns sho uld have blank expressio ns. Tho se co lumns will be used by the lo g
catcher o nly. Add o ne o utput, then add o r link the fo llo wing co lumns:
Expre ssio n
Co lum n
co ntext.run_id
run_id
ro w4.mo ment
mo ment
ro w4.pid
pid
ro w4.father_pid
father_pid
ro w4.ro o t_pid
ro o t_pid
ro w4.system_pid
system_pid
ro w4.pro ject
pro ject
ro w4.jo b
jo b
jo b_versio n
ro w4.co ntext
co ntext
T ype
integer
blank
prio rity
ro w4.o rigin
o rigin
ro w4.message_type
message_type
ro w4.message
message
blank
co de
ro w4.duratio n
duratio n
integer, length 3
integer, length 3
That lo o ks great! No w we need to take o ur two flo ws o f lo g data and unite them into a single flo w. This is
do ne using the t Unit e co mpo nent, which is under the Orche st rat io n sectio n in the palette. Dro p o ne o nto
yo ur canvas.
Next, link the o utput fro m each o f yo ur map co mpo nents to t Unit e . When yo u're finished, yo ur canvas
sho uld lo o k similar to this (do n't wo rry abo ut names, it's o kay if they're different):
Yo u might see a warning o n yo ur t Unit e co mpo nent. This happens when the schema fo r o ne o f the input
flo ws is different fro m the o thers. Yo u can check o ut the schemas by selecting t Unit e , clicking o n the
Co m po ne nt tab, then clicking o n the
Yo u'll have to do a visual inspectio n to see which co lumn differs, then go back to the co rrect t Map
co mpo nent to fix the issue. If TOS asks yo u whether yo u want to Propagate Changes, cho o se Ye s.
Finally, dro p a t MysqlOut put co mpo nent fro m the Dat abase s menu o f the palette o nto yo ur canvas. Link
the Main o utput o f t Unit e to t MysqlOut put .
Set the database co nnectio n. Click o n the t MysqlOut put Co m po ne nt tab, and go to Basic se t t ings. Set
the Pro pe rt y T ype to Re po sit o ry. TOS will ask if yo u want to take the schema fro m the input co mpo nent.
Yo u do , so cho o se Ye s. Make sure yo u specify that the T able is " e t lLo g" , and that yo u want to Inse rt data.
The Act io n o n t able sho uld remain No ne .
Yo ur canvas sho uld no w lo o k similar to this:
Save yo ur wo rk. With this co de do ne, we are ready to start wo rking o n o ur dimensio n!
dimMovie
Our basic jo b structure is in place, so no w we're free to co ncentrate o n the actual dimensio n pro cessing.
Let's start with dim Mo vie . Yo u might recall fro m lesso n 3 that dim Mo vie is a Type-1 slo wly changing
dimensio n. This means that we do no t track any changes o n the dimensio n - instead we update ro ws in o ur
data wareho use as they change in the so urce system. (We'll learn mo re abo ut this a little later.)
This is the first time we need to co nnect to the sakila database, so we need to add a new shared database
Name the co nnectio n sakila then click Ne xt >. The database type is o nce again MySQL. Leave the Lo gin
and Passwo rd o ptio ns blank, but set the server to sql.o re illyscho o l.co m . The po rt is 330 6 , and the
Database is sakila. Click Che ck to make sure yo u can co nnect to sakila, then click Finish to save yo ur new
co nnectio n.
Dro p a t MysqlInput co lumn o n yo ur canvas. Go to the Co m po ne nt tab, and set the co nnectio n to sakila
fro m the Re po sit o ry using the
At this po int yo u co uld cho o se to save yo ur query in the repo sito ry next to the database co nnectio n. But since
this query will o nly be used in this specific jo b, we'll leave the query in this t MysqlInput co mpo nent. Click o n
the
Back in lesso n 5, we wro te a query to get data o ut o f the mo vie tables. We'll use that same query no w. Just
this o nce, co py and paste this query into the SQL builder windo w:
CODE TO USE:
SELECT f.film_id, f.title, f.description, f.release_year,
l.name as language, orig_lang.name as original_language,
f.rental_duration, f.length, f.rating, f.special_features
FROM film f
JOIN language l on (f.language_id=l.language_id)
LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.language_id
)
To test yo ur query, click o n the runner, o r press ctrl-enter. Yo u'll see, at mo st, 10 0 ro ws o f results:
T ype
Le ngt h
film_id
int
title
String
27
descriptio n
String
20 0
release_year
int
language
String
20
o riginal_language String
20
No t e s
key, no t null
rental_duratio n
Integer
length
Integer
rating
String
special_features
String
60
dimMovie do esn't really need any transfo rmatio ns, since the so urce data o f the dimensio n is very go o d. But
we do need to add o ur audit co lumn to the flo w:
1. Drag a t Map co mpo nent to yo ur canvas.
2. Map the Main o utput o f t MysqlInput to t Map.
To
do
Well then, yo u ask, how does TOS know if the row exists?. Ano ther go o d questio n. TOS lo o ks at the Ke y that
we specified in the schema. In this case, we set the co lumn film_id to be a key co lumn, so this co lumn is
used to check to see whether the ro w exists. All co lumns specified as keys are checked when perfo rming an
insert o r update.
Ano ther questio n yo u might be asking yo urself is, why don't we just delete data from dimMovie and reload the
whole thing?
Yet ano ther go o d questio n. The answer is: f o re ign ke ys. Our primary key fo r dimMovie is an auto increment
field called movie_key. This co lumn will be used to link facts to this dimensio n. If we wipe o ut dimMovie each
time we run o ur lo ad, we'll always have to relo ad all facts as well. This might be o kay in the sho rt term, but at
so me po int we may no t want to relo ad everything in o ur data wareho use. Using "insert o r update" means that
existing data do es no t get deleted, so a movie_key is preserved acro ss runs. This is impo rtant even if the
underlying database do es no t enfo rce fo reign key co nstraints.
No w that we have o ur jo b do ne, it's time to run it! Click o n the
yo u'll o nly see two lines o f o utput:
OBSERVE:
Starting job ProcessDataWarehouse at 13:29 09/06/2009.
Job ProcessDataWarehouse ended at 13:32 09/06/2009. [exit code=0]
Yo u can also check the database to see what has been lo gged fo r yo ur run. Switch to a terminal, and lo g into yo ur
perso nal database. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> select * from etlRuns;
Yo ur results wo n't be exactly the same, but they sho uld lo o k similar to the fo llo wing:
OBSERVE:
mysql> select * from etlRuns;
+--------+---------------------+---------------------+
| run_id | start_time
| end_time
|
+--------+---------------------+---------------------+
|
1 | 2009-06-09 11:49:07 | 2009-06-09 11:49:12 |
+--------+---------------------+---------------------+
1 row in set (0.00 sec)
Next, check the etlLog table. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> select * from etlLog;
Yo u'll see so me results similar to the fo llo wing:
OBSERVE:
mysql> select * from etlLog;
+--------+---------------------+--------+------------+----------+------------+--------+----------------------+-------------------------+-------------+---------+----------+-------+--------------+---------+------+----------+-------+-----------+------------+
| run_id | moment
| pid
| father_pid | root_pid | system_pid | project
| job
| job_repository_id
| job_version | context | priority | o
rigin | message_type | message | code | duration | count | reference | thresholds |
+--------+---------------------+--------+------------+----------+------------+--------+----------------------+-------------------------+-------------+---------+----------+-------+--------------+---------+------+----------+-------+-----------+------------+
|
1 | 2009-06-09 11:49:07 | GOrr6m | GOrr6m
| GOrr6m
|
8356 | DBA3
| ProcessDataWarehouse | _pzsw4GWKEd6GbtKHsp1gXA | 0.1
| Default |
NULL | N
ULL
| begin
| NULL
| NULL |
NULL | NULL |
NULL | NULL
|
|
1 | 2009-06-09 11:49:12 | GOrr6m | GOrr6m
| GOrr6m
|
8356 | DBA3
| ProcessDataWarehouse | _pzsw4GWKEd6GbtKHsp1gXA | 0.1
| Default |
NULL | N
ULL
| end
| success | NULL |
5594 | NULL |
NULL | NULL
|
+--------+---------------------+--------+------------+----------+------------+--------+----------------------+-------------------------+-------------+---------+----------+-------+--------------+---------+------+----------+-------+-----------+------------+
2 rows in set (0.00 sec)
These results sho w that o ur run was succe ssf ul, and to o k 5 5 9 4 milliseco nds (~5 seco nds) to run. This is duplicated
by the etlRuns table, which sho ws the st art and e nd t im e s o f the jo b. We're lo o king pretty go o d!
Performance
When develo ping a so ftware system, it is o ften best to start with a simple so lutio n and mo ve to a mo re co mplex
so lutio n as develo pment go es o n:
Our mo vie dimensio n is very small - it o nly has 10 0 0 ro ws. Since it is so small, it is acceptable to recreate this
dimensio n (alo ng with o ther related facts) fro m scratch each day. This so lutio n is simple and wo rks well fo r small data
wareho uses.
But what if o ur mo vie dimensio n had 50 0 ,0 0 0 ro ws in it? What if o ur so urce system was an ancient co mputer,
requiring 10 ho urs to extract mo vie data? Running a data lo ad fo r 10 ho urs everyday wo uld no t be a great o ptio n; even
if the rest o f the wareho use pro cessing o nly to o k a minute o r two , there wo uld o nly be 14 ho urs left fo r wareho use use.
Sho uld the mo vie dimensio n gro w to be 1,0 0 0 ,0 0 0 ro ws, the wareho use lo ad might take 20 ho urs to co mplete!
When dimensio ns are large, it is necessary to add co mplexity to the wareho use in o rder to reduce lo ad times.
The first step to ward o ptimizing perfo rmance seems straightfo rward: o nly que ry t he so urce syst e m f o r ne w and
change d re co rds. Our audit tables capture the date and time that the dimensio n was updated. That time stamp can be
used to select reco rds in the so urce system. But this pro cess is o ften mo re difficult than it initially seems.
Many so urce systems simply do n't track eno ugh data to make this query wo rk. The film table in the sakila database
has a co lumn called last_update which sho uld get set when the ro w is created, and updated when the ro w changes.
But what if it had a co lumn called date_created instead? Ho w wo uld we kno w when a ro w had changed?
In many situatio ns yo u will have to make changes to so urce systems to make data wareho use lo ads easier to
manage. This might invo lve adding time stamp co lumns, mo difying existing co lumns, o r even creating co mpletely new
tables. Keep this in mind as we mo ve fo rward.
We've co vered a lo t in this lesso n! Stay tuned - in the next lesso n we'll co ntinue wo rking with o ur dimensio ns and
learn ho w to pro cess slowly changing dimensions. See yo u then!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
SCD Processing
DBA 3: Data Warehousing Lesson 9
Welco me back! In the last lesso n we implemented o ur first dimensio n, dimMovie, as a Type 1 slo wly changing dimensio n. In
this lesso n we'll take a lo o k at the remaining dimensio ns, and implement them as Type 2 slo wly changing dimensio ns.
MySQL has two SCD co mpo nents: t MysqlSCD and t MysqlSCDELT . In TOS, co mpo nents that are named ELT are
pro cessed o n the database server itself. This can make the pro cess much faster, because data do esn't have to travel
o utside o f the database server to be pro cessed. The drawback to this pro cess is that all data must be lo cated o n a
single database server - so mething that isn't o ften po ssible.
So ho w do es the SCD co mpo nent wo rk fo r Type 2 dimensio ns? Like this:
OBSERVE:
Check each row of data to determine whether the row exists within the dimension. If it
does not, insert it:
1. Now that the row exists within the dimension, determine whether the specified Type
2 columns have changed:
a. If they have not changed, go on to the next row:
1) If the row has changed:
-Insert a new row into the dimension, with the start date of today and the end date
of 31-DEC-2099.
-Update the prior row, setting the end date to today.
Note
Remember, so me wareho uses use NULL instead o f 31-DEC-2099 as the end date.
This algo rithm isn't extremely difficult to implement, ho wever it wo uld really stink if yo u had to implement it fo r each
dimensio n yo u wanted to pro cess! Fo rtunately, it has been implemented, tested, and o ptimized fo r us already.
Type 3 and Type 4 SCDs are handled in almo st the same way, except that the histo ry is lo cated elsewhere. In Type 3 it
is put into the same ro w; in Type 4 it is put into a different table.
String
45
last_name
String
45
String Yes
50
address
String
50
address2
String Yes
50
district
String
20
city
String
50
co untry
String
50
po stal_co de String
10
pho ne
String
20
active
int
create_date
Date
Note
If yo ur dimCusto mer table is slightly different, then yo u'll need to change yo ur schema. Fo r
example, if yo ur table might allo w NULLs fo r email, yo u sho uld change the schema to allo w
nulls in that co lumn.
Yo u might wo nder why we aren't using the Gue ss Sche m a butto n to have TOS figure o ut the schema fo r us.
It seems like that wo uld be fast and relatively easy. But in practice, the Gue ss Sche m a o ptio n o nly wo rks well
with simple schemas and data types. That's because TOS examines the data that the database returns fro m
the query, in o rder to make decisio ns o n data types and lengths, and igno res the underlying data types set in
the database tables.
That might be o kay fo r so me queries, but it do esn't wo rk fo r o ur dimCustomer query. In this query, currently
the address2 co lumn o nly has NULL values. TOS can't determine a data type when there's no data. Fo r o ther
co lumns, like postal_code, TOS sees o nly values like 90210, so TOS guesses that the data type is
Integer. We kno w that o ther parts o f the wo rld have different po stal co de fo rmats ("SW1A 0 AA" is a valid
po stal co de in the United Kingdo m), so o ur dimCustomer table uses varchar as its data type, no t integer.
It's fine to use Gue ss Sche m a as a starting po int, but yo u still need to verify manually that the co lumns TOS
picks are o f the co rrect data type, nullability, and length.
With the input o ut o f the way, we are free to mo ve o n to the mapping. Drag a t Map co mpo nent to the canvas,
and link the main ro w o f the previo us t MysqlInput co mpo nent to t Map. Just like befo re, add a new o utput,
and add a co lumn called run_id (type: integer, expressio n: context.run_id) to the o utput. Link every input
co lumn to the o utput.
The last step fo r dimCustomer is to add a t MysqlSCD co mpo nent to the canvas. Link the o utput o f t Map to
the input o f t MysqlSCD, and allo w TOS to take the schema fro m the input co mpo nent. If TOS do es no t ask
whether yo u want to use the input schema, click o n Sync Co lum ns.
Set the co nnectio n o n t MysqlSCD to the data wareho use co nnectio n in the repo sito ry, and specify
dimCustomer fo r the table. Once that's do ne, click o n the
In this new windo w yo u'll tell t MysqlSCD ho w to handle every co lumn in the data flo w. To set up the
co mpo nent, drag a co lumn fro m the Unuse d sectio n to a different sectio n. We'll start with So urce Ke ys. Our
so urce key is a single co lumn: customer_id. Drag that co lumn to the So urce Ke ys sectio n. When yo u're
do ne, yo ur screen will lo o k like this:
Next, we'll setup o ur surro gate keys. Our surro gate key do es no t exist in the data flo w. Instead, it will be
created by auto increment in MySQL. Name the surro gate key customer_key, and name the creatio n Auto
increment. When yo u're finished, that sectio n will lo o k like this:
No w we'll specify o ur Type 0 co lumns. Since o ur dimensio n is o nly included fo r auditing and lo gging, run_id
sho uld never be co nsidered part o f the dimensio n. This value changes at each run, so we never want to track
changes o n it. Drag run_id to the Type 0 fields sectio n. That sectio n will no w lo o k like this:
We are no t particularly interested in tracking changes to o ur custo mers' names. And if a custo mer's
create_date changes, it pro bably means the so urce system had an erro r, and the current reco rd is being fixed,
so we do n't want to track changes o n that either. Drag first_name, last_name and create_date to the Type
1 fields sectio n. That sectio n will no w lo o k like this:
Let's check o ut Type 2 changes no w. Type 2 changes require additio nal co nfiguratio n because there are
different ways to track histo ry within the same table (if yo u'd like to review type 2 changes, refer back to lesso n
3).
Drag the remaining co lumns fro m Unuse d to the Type 2 fields sectio n. Then we need to tell TOS ho w we are
keeping histo ry. In this dimensio n, we are using a start and end date, but we are no t using a versio n number
co lumn o r an active flag. Rename the start co lumn st art _dat e , and set its creatio n to J o b st art t im e .
Rename the end co lumn e nd_dat e , change its creatio n to Fixe d ye ar value , and set its co mplement to
20 9 9 . After yo u have made these changes, the Type 2 fields sectio n will lo o k like this:
The last sectio n is fo r Type 3 changes. Here yo u specify the type 3 co lumns and no te the co rrespo nding
histo ry co lumns.
This dimensio n do esn't have any Type 3 fields, so it's left blank:
Click "OK" to clo se the SCD Co mpo nent edito r, then save yo ur changes. When yo u're do ne, yo ur
dim Cust o m e r sub jo b sho uld lo o k like this:
OBSERVE:
Starting job ProcessDataWarehouse at 21:08 30/06/2009.
Job ProcessDataWarehouse ended at 21:08 30/06/2009. [exit code=0]
The jo b ran, but did it po pulate the dimCustomer table? Switch to a terminal, and lo g into yo ur perso nal
database. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> SELECT count(*) FROM dimCustomer;
If yo ur jo b ran successfully, yo u will see this:
OBSERVE:
mysql> SELECT count(*) FROM dimCustomer;
+----------+
| count(*) |
+----------+
|
589 |
+----------+
1 row in set (0.00 sec)
To be sure everything wo rked, take a lo o k at so me ro ws. Run the fo llo wing co mmand against yo ur perso nal
database:
CODE TO TYPE:
mysql> SELECT * FROM dimCustomer
LIMIT 0, 10;
OBSERVE:
mysql> SELECT * FROM dimCustomer
-> LIMIT 0, 10;
+--------------+-------------+------------+-----------+------------------------------------+-------------------------------+----------+--------------+----------------+----------------+-------------+--------------+--------+--------------------+------------+------------+--------+
| customer_key | customer_id | first_name | last_name | email
| address
| address2 | district
| city
| country
| postal_code | phone
| active | create_date
| start_date | end_date
| run_id |
+--------------+-------------+------------+-----------+------------------------------------+-------------------------------+----------+--------------+----------------+----------------+-------------+--------------+--------+--------------------+------------+------------+--------+
|
1 |
218 | VERA
| MCCOY
| VERA.MCCOY@sakilacustome
r.org
| 1168 Najafabad Parkway
|
| KABOL
| Kabul
| Afghanistan
| 40301
| 886649065861 |
1 | 2004-03-19 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
2 |
441 | MARIO
| CHEATHAM | MARIO.CHEATHAM@sakilacus
tomer.org
| 1924 Shimonoseki Drive
|
| BATNA
| Batna
| Algeria
| 52625
| 406784385440 |
1 | 2004-10-07 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
3 |
69 | JUDY
| GRAY
| JUDY.GRAY@sakilacustomer
.org
| 1031 Daugavpils Parkway
|
| BCHAR
| Bchar
| Algeria
| 59025
| 107137400143 |
1 | 2004-02-25 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
4 |
176 | JUNE
| CARROLL
| JUNE.CARROLL@sakilacusto
mer.org
| 757 Rustenburg Avenue
|
| SKIKDA
| Skikda
| Algeria
| 89668
| 506134035434 |
1 | 2004-08-11 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
5 |
320 | ANTHONY
| SCHWAB
| ANTHONY.SCHWAB@sakilacus
tomer.org
| 1892 Nabereznyje Telny Lane
|
| TUTUILA
| Tafuna
| American Samoa | 28396
| 478229987054 |
1 | 2004-07-20 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
6 |
528 | CLAUDE
| HERZOG
| CLAUDE.HERZOG@sakilacust
omer.org
| 486 Ondo Parkway
|
| BENGUELA
| Benguela
| Angola
| 35202
| 105882218332 |
1 | 2004-01-24 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
7 |
383 | MARTIN
| BALES
| MARTIN.BALES@sakilacusto
mer.org
| 368 Hunuco Boulevard
|
| NAMIBE
| Namibe
| Angola
| 17165
| 106439158941 |
1 | 2004-05-31 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
8 |
381 | BOBBY
| BOUDREAU | BOBBY.BOUDREAU@sakilacus
tomer.org
| 1368 Maracabo Boulevard
|
|
| South Hi
ll
| Anguilla
| 32716
| 934352415130 |
1 | 2004-08-29 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
9 |
359 | WILLIE
| MARKHAM
| WILLIE.MARKHAM@sakilacus
tomer.org
| 1623 Kingstown Drive
|
| BUENOS AIRES | Almirant
e Brown | Argentina
| 91299
| 296394569728 |
1 | 2004-08-13 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
10 |
560 | JORDAN
| ARCHULETA | JORDAN.ARCHULETA@sakilac
ustomer.org | 1229 Varanasi (Benares) Manor |
| BUENOS AIRES | Avellane
da
| Argentina
| 40195
| 817740355461 |
1 | 2004-01-15 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
+--------------+-------------+------------+-----------+------------------------------------+-------------------------------+----------+--------------+----------------+----------------+-------------+--------------+--------+--------------------+------------+------------+--------+
10 rows in set (0.01 sec)
If yo u scro ll to the right, yo u might no tice so mething a bit strange. Check o ut the create_date and
start_date (co pied belo w):
OBSERVE:
+--------+---------------------+------------+------------+--------+
| active | create_date
| start_date | end_date
| run_id |
+--------+---------------------+------------+------------+--------+
|
1 | 2004-03-19 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-10-07 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-02-25 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-08-11 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-07-20 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-01-24 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-05-31 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-08-29 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-08-13 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-01-15 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
+--------+---------------------+------------+------------+--------+
The custo mer's reco rd was created back o n March 19 t h, 20 0 4 , but the ro w in the dimensio n has a
start_date o f to day (20 0 9 -0 6 -30 ).
The start_date and end_date co lumns o n this Type 2 SCD indicate that " t his ro w is valid and co rre ct
be t we e n t he dat e s o f J une 30 t h 20 0 9 and J anuary 1st , 20 9 9 ."
The custo mer with customer_id=218 existed o n J anuary 1st , 20 0 7 ..., but that's no t what the ro w tells us
no w.
Yo u may recall that earlier in the lesso n, we set the pro perties fo r the SCD co mpo nent. Specifically, we
renamed the start co lumn to st art _dat e , and set its creatio n to J o b st art t im e .
The pro blem here has to do with histo rical data. The very first time we setup o ur dimensio n, the first valid date
fo r the ro w must be start_date, no t to day's date. Graphically, the time line after the initial lo ad lo o ks like this:
We need to fix this date, so each ro w has a valid start date. Graphically, the time line lo o ks like this:
CODE TO TYPE:
mysql> TRUNCATE TABLE dimCustomer;
When the co mmand executes, yo u'll see Query OK, 0 rows affected (0.00 sec), even tho ugh the table
is no w empty.
Rerun the jo b. When it co mpletes, switch back to the terminal, and run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
mysql> SELECT count(*) FROM dimCustomer;
There sho uld still be 58 9 ro ws in the table:
OBSERVE:
mysql> SELECT count(*) FROM dimCustomer;
+----------+
| count(*) |
+----------+
|
589 |
+----------+
1 row in set (0.00 sec)
Check the create and start date next. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> SELECT customer_key, first_name, last_name, address, district, city, coun
try, create_date, start_date, end_date FROM dimCustomer
LIMIT 0, 10;
The results lo o k much better:
OBSERVE:
mysql> SELECT customer_key, first_name, last_name, address, district, city, coun
try, create_date, start_date, end_date FROM dimCustomer
-> LIMIT 0, 10;
+--------------+------------+-----------+-------------------------------+-------------+-----------------+----------------+---------------------+------------+-----------+
| customer_key | first_name | last_name | address
| distri
ct
| city
| country
| create_date
| start_date | e
nd_date
|
+--------------+------------+-----------+-------------------------------+-------------+-----------------+----------------+---------------------+------------+-----------+
|
1 | VERA
| MCCOY
| 1168 Najafabad Parkway
| KABOL
| Kabul
| Afghanistan
| 2004-03-19 00:00:00 | 2004-03-19 | 2
099-01-01 |
|
2 | MARIO
| CHEATHAM | 1924 Shimonoseki Drive
| BATNA
| Batna
| Algeria
| 2004-10-07 00:00:00 | 2004-10-07 | 2
099-01-01 |
|
3 | JUDY
| GRAY
| 1031 Daugavpils Parkway
| BCHAR
| Bchar
| Algeria
| 2004-02-25 00:00:00 | 2004-02-25 | 2
099-01-01 |
|
4 | JUNE
| CARROLL
| 757 Rustenburg Avenue
| SKIKDA
| Skikda
| Algeria
| 2004-08-11 00:00:00 | 2004-08-11 | 2
099-01-01 |
|
5 | ANTHONY
| SCHWAB
| 1892 Nabereznyje Telny Lane
| TUTUIL
A
| Tafuna
| American Samoa | 2004-07-20 00:00:00 | 2004-07-20 | 2
099-01-01 |
|
6 | CLAUDE
| HERZOG
| 486 Ondo Parkway
| BENGUE
LA
| Benguela
| Angola
| 2004-01-24 00:00:00 | 2004-01-24 | 2
099-01-01 |
|
7 | MARTIN
| BALES
| 368 Hunuco Boulevard
| NAMIBE
| Namibe
| Angola
| 2004-05-31 00:00:00 | 2004-05-31 | 2
099-01-01 |
|
8 | BOBBY
| BOUDREAU | 1368 Maracabo Boulevard
|
| South Hill
| Anguilla
| 2004-08-29 00:00:00 | 2004-08-29 | 2
099-01-01 |
|
9 | WILLIE
| MARKHAM
| 1623 Kingstown Drive
| BUENOS
AIRES | Almirante Brown | Argentina
| 2004-08-13 00:00:00 | 2004-08-13 | 2
099-01-01 |
|
10 | JORDAN
| ARCHULETA | 1229 Varanasi (Benares) Manor | BUENOS
AIRES | Avellaneda
| Argentina
| 2004-01-15 00:00:00 | 2004-01-15 | 2
099-01-01 |
+--------------+------------+-----------+-------------------------------+-------------+-----------------+----------------+---------------------+------------+-----------+
10 rows in set (0.00 sec)
No w that the start date is lo o king go o d, we can disable the sub jo b that updates start_date. Right-click o n
t MysqlRo w and cho o se De act ivat e curre nt sub jo b:
1 | VERA
| Kabul
099-01-01 |
| MCCOY
| Afghanistan
The FIRST NAME and LAST NAME are all in UPPERCASE letters, but the address, district, city, and co untry are
in Mixed Case letters. Let's alter o ur t Map co mpo nent so that the address, district, city, and co untry are in
UPPERCASE letters as well. The next time we run o ur jo b, each ro w sho uld change. That's ho w we'll be able
to tell if o ur SCD co mpo nent is wo rking co rrectly o r no t.
Do uble-click o n the t Map co mpo nent fo r the dimCustomer subjo b. When the map screen o pens, select the
address co lumn o n the o utput, and click o n the
TOS has a built-in functio n fo r making a string uppercase. It's lo cated in the StringHandling catego ry, and is
called UPCASE. Set the expressio n fo r address so it lo o ks like this:
CODE TO TYPE:
StringHandling.UPCASE(row6.address)
Yo ur input may no t be called ro w6 ; make sure to use its existing name.
Click OK to clo se the expressio n builder. Make similar changes fo r the o ther co lumns: address2, city,
country, and district. When yo u are do ne, t Map will lo o k so mething like this:
OBSERVE:
Starting job ProcessDataWarehouse at 21:58 30/06/2009.
Job ProcessDataWarehouse ended at 22:02 30/06/2009. [exit code=0]
Switch to SQL mo de. We'll inspect a single reco rd (customer_id = 218) fro m o ur last SQL query to see
whether it changes. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> SELECT * FROM dimCustomer
WHERE customer_id=218
ORDER BY customer_key;
If yo u typed everything co rrectly, yo u'll see this:
OBSERVE:
mysql> SELECT * FROM dimCustomer
-> WHERE customer_id=218
-> ORDER BY customer_key;
+--------------+-------------+------------+-----------+------------------------------+------------------------+----------+----------+-------+-------------+------------+--------------+--------+---------------------+------------+-----------+--------+
| customer_key | customer_id | first_name | last_name | email
| address
| address2 | district | city | country
| pos
tal_code | phone
| active | create_date
| start_date | end_date
| run_id |
+--------------+-------------+------------+-----------+------------------------------+------------------------+----------+----------+-------+-------------+------------+--------------+--------+---------------------+------------+-----------+--------+
|
1 |
218 | VERA
| MCCOY
| VERA.MCCOY@sakilacustome
r.org | 1168 Najafabad Parkway |
| KABOL
| Kabul | Afghanistan | 403
01
| 886649065861 |
1 | 2004-03-19 00:00:00 | 2004-03-19 | 2009-06-30
|
82 |
|
590 |
218 | VERA
| MCCOY
| VERA.MCCOY@sakilacustome
r.org | 1168 NAJAFABAD PARKWAY |
| KABOL
| KABUL | AFGHANISTAN | 403
01
| 886649065861 |
1 | 2004-03-19 00:00:00 | 2009-06-30 | 2099-01-01
|
83 |
+--------------+-------------+------------+-----------+------------------------------+------------------------+----------+----------+-------+-------------+------------+--------------+--------+---------------------+------------+-----------+--------+
2 rows in set (0.01 sec)
Sure eno ugh, the changes were reco rded!
Since o ur test is o ver, we'll relo ad o ur dimensio n to get it back to a go o d starting po int. Switch to SQL mo de
and run the fo llo wing query:
CODE TO TYPE:
TRUNCATE TABLE dimCustomer;
When the co mmand executes, yo u'll see Query OK, 0 rows affected (0.00 sec). In TOS, enable the
Init ial Lo ad Only subjo b, then run the jo b. Once it co mpletes, disable the Init ial Lo ad Only subjo b, and
right-click o n t MysqlInput to disable the dim Cust o m e r subjo b.
We're in great shape!
dimStore
The subjo b fo r dimStore is nearly identical to dimCustomer. Just like befo re, drag a t MysqlInput
co mpo nent to yo ur canvas. Co nfigure it to use the sakila database. Fo r the query, we'll use this co de (fro m
lesso n 5):
OBSERVE:
SELECT s.store_id, a.address, a.address2, a.district,
c.city, co.country, a.postal_code, s.region,
st.first_name as manager_first_name,
st.last_name as manager_last_name
FROM
store s
JOIN staff st on (s.manager_staff_id = st.staff_id)
JOIN address a on (s.address_id = a.address_id)
JOIN city c on (a.city_id = c.city_id)
JOIN country co on (c.country_id = co.country_id)
Once again, be sure to run yo ur query to make sure yo u typed it in co rrectly. Once that's do ne, click o n the
butto n next to Edit Sche m a, then enter these co lumns:
Co lum n
sto re_id
int
address
String
50
address2
String Yes
50
district
String
20
city
String
50
co untry
String
50
po stal_co de
String
10
regio n
String
20
manager_first_name String
45
manager_last_name String
45
Let's mo ve o n to the mapping. Drag a t Map co mpo nent to the canvas, and link the main ro w o f the previo us
t MysqlInput co mpo nent to t Map. Add a new o utput and add a co lumn called run_id (type: integer,
expressio n: context.run_id) to the o utput. Link every input co lumn to the o utput.
Add a t MysqlSCD co mpo nent to the canvas. Link the o utput o f t Map to the input o f t MysqlSCD, and allo w
TOS to take the schema fro m the input co mpo nent.
Set the co nnectio n o n t MysqlSCD to the data wareho use co nnectio n in the repo sito ry, and specify dimStore
fo r the table. Once that's do ne, click o n the
Co lum ns
store_id
Surro gate
store_key - creatio n Auto Increment
keys
Type 0
fields
run_id
Type 1
fields
Type 2
fields
Use all remaining fields. Rename the start co lumn st art _dat e , and set its creatio n to J o b
st art t im e . Rename the end co lumn e nd_dat e , change its creatio n to Fixe d ye ar value ,
and set its co mpliment to 20 9 9 .
Type 3
fields
Just like we did in dimCustomer, we'll have to run a special update statement the first time we lo ad o ur
dimensio n here. Let's make life a little easier and reuse o ur wo rk; co py the subjo b we created fo r
dimCustomer. Right-click o n the title Init ial Lo ad Only, and cho o se Co py:
Once co pied, right-click o n yo ur canvas and cho o se Past e . The who le subjo b sho uld be pasted o n yo ur
canvas, but it is pro bably no t next to dimStore. Mo ve it so it is next to dimStore:
Enable the subjo b yo u just pasted, then link the On Co m po ne nt OK trigger fro m t MysqlSCD to it:
Our so urce data do es no t give us a valid start date, so we'll have to co me up with o ne. Let's suppo se o ur
business users info rmed us that we co uld use J anuary 1st , 20 0 0 as a valid start date. Click o n t MysqlRo w,
and change the query so it lo o ks like this:
CODE TO TYPE:
UPDATE dimStore SET start_date = '2000-01-01';
Save and run yo ur jo b. If everything went alright, yo u'll see this familiar o utput:
OBSERVE:
Starting job ProcessDataWarehouse at 13:42 16/12/2008.
Job ProcessDataWarehouse ended at 13:42 16/12/2008. [exit code=0]
Let's take a lo o k at the dimStore table. Switch to yo ur terminal and run the fo llo wing co mmand against yo ur
perso nal database:
CODE TO TYPE:
mysql> SELECT * from dimStore;
If yo ur lo ad ran pro perly, yo u'll see the fo llo wing data:
OBSERVE:
mysql> SELECT * from dimStore;
+-----------+----------+--------------------+----------+----------+------------+
-----------+-------------+--------+--------------------+-------------------+-----------+------------+--------+
| store_key | store_id | address
| address2 | district | city
|
country
| postal_code | region | manager_first_name | manager_last_name | sta
rt_date | end_date
| run_id |
+-----------+----------+--------------------+----------+----------+------------+
-----------+-------------+--------+--------------------+-------------------+-----------+------------+--------+
|
1 |
1 | 47 MySakila Drive | NULL
| Alberta | Lethbridge |
Canada
|
| West
| Mike
| Hillyer
| 200
0-01-01 | 2099-01-01 |
85 |
|
2 |
2 | 28 MySQL Boulevard | NULL
| QLD
| Woodridge |
Australia |
| East
| Jon
| Stephens
| 200
0-01-01 | 2099-01-01 |
85 |
+-----------+----------+--------------------+----------+----------+------------+
-----------+-------------+--------+--------------------+-------------------+-----------+------------+--------+
2 rows in set (0.00 sec)
This lo o ks great! Since o ur inital lo ad is co mplete, right-click o n t MysqlRo w in Init ial Lo ad Only and
deactivate the current subjo b.
Since o ur dimensio ns are wo rking well no w, we can reactiavte all o f the subjo bs we disabled earlier. Right-click o n the
t MysqlInput o f the dim Mo vie jo b, then cho o se Act ivat e curre nt subjo b:
Do the same fo r dim Cust o m e r, but be sure to leave the Init al Lo ad Only jo b disabled.
Wo w. We co vered a lo t in this lesso n! In the next lesso n we'll co mbine o ur dimensio ns with facts to fo rm o ur co mplete data
wareho use. See yo u there!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
Orchestration
Mo st machines have at least two pro cesso r co res to o ptimize perfo rmance. TOS is built to take advantage o f that, so
unless we specify o therwise, TOS may execute o ur sub jo bs in parallel. That means o ur dimensio ns may be lo aded at
the same time. Fo r no w, we want to keep o ur wareho use pro cesses simple and we do n't want all o f o ur sub jo bs
execute at the same time. And we definitely want o ur dimensio ns lo aded befo re we even co nsider lo ading o ur f act s.
Note
Yo ur co mpo nents' names will likely differ fro m the examples here. That's fine.
The first co mpo nent in each sub jo b is the "master" co mpo nent - it can be used to trigger executio n o f ano ther sub jo b
after the running sub jo b is co mplete. We will use it to link o ur dimensio n sub jo bs to gether, and eventually to link to
o ur f act sub jo bs.
To start, right-click o n the t MysqlInput co mpo nent fo r dim Mo vie . Select T rigge r and then select On Subjo b OK:
Yo u'll no tice the curso r change. Drag the co nnectio n to the t MysqlInput co mpo nent fo r dim Cust o m e r, and dro p the
co nnectio n:
Do this same pro cess to link dim Cust o m e r to dim St o re , and dim St o re to dim St af f . Yo ur jo b will lo o k like this:
Go o d wo rk so far.
factCustomerCount
The algo rithm fo r po pulating a fact is fairly sho rt. In English, it might read like this: "Fo r e ach ro w, lo o k up t he ke ys
f o r e ach dim e nsio n, re spe ct ing st art and e nd dat e s f o r t he ro w and t he dim e nsio n. Whe n t he ke ys have
be e n f o und, inse rt t he ro w int o t he f act t able ."
Note
Fo r this lesso n we will assume that we can successfully lo o kup each dimensio n key. In a future lesso n
we will study what might happen if a lo o kup fails.
This algo rithm is fairly straightfo rward, but selecting the lo cat io n and m e t ho d used to do the lo o kup can have so me
pretty drastic perfo rmance implicatio ns. The lo o kup is just like a database jo in - yo u specify ho w data is related in
o rder to pro duce a co mbined result.
We have two o ptio ns fo r lo cat io n:
1. the co mputer yo u are using to develo p and run TOS jo bs
2. the MySQL server
If yo u pick o ptio n 1, essentially yo u will mo ve data fro m yo ur so urce and yo ur data wareho use to yo ur co mputer, then
send it back to yo ur data wareho use. This pro cess is reso urce intensive and as such, isn't usually the preferred o ptio n.
If yo u pick o ptio n 2, yo u need to mo ve data fro m yo ur so urce to yo ur data wareho use and place it in so me tempo rary
lo catio n (called a staging table), then let the database server pro cess the data. This takes so me time, but is still usually
much faster than o ptio n 1. And to make things just a bit mo re difficult, mo st ETL to o ls are no t capable o f pro cessing
data this way, leaving yo u (the develo per) to write a who le mess o f SQL to make this happen. Fo rtunately, TOS has
special ELT co mpo nents that can be used to make the pro cess a bit easier.
With that specified, link t MysqlInput to t Map, then o pen t Map to edit its co nnectio ns.
We will use this t Map just like we used the map co mpo nents fo r dimensio n pro cessing. We'll add o ur auditing co lumn
called run_id. Next add an o utput and then the run_id co lumn. Finally, drag all o f the input co lumns to the o utput.
When yo u are do ne, yo ur t Map sho uld lo o k like this:
Next, co nnect t Map to t MysqlOut put . Set the co nnectio n pro perties o n t MysqlOut put fro m the repo sito ry to the
data wareho use co nnectio n, and set the table to " st age Fact Cust o m e rCo unt " (in quo tatio n marks).
We co uld manually create the st age Fact Cust o m e rCo unt table, but then we wo uld have to delete its data and dro p
its indexes befo re we po pulate it with data. Instead o f do ing that, we can have the t MysqlOut put co mpo nent Dro p
t able if e xist s and cre at e , which acco mplishes the same thing. To do this, set the action on table to Dro p t able if
e xist s and cre at e .
No w let's specify when we want to execute o ur sub jo b. Right-click o n the t MysqlInput co mpo nent o f dim St af f , and
co nnect the On Subjo b OK trigger to the t MysqlInput o f the st age Fact Cust o m e rCo unt sub jo b.
When yo u are do ne, yo ur jo b will lo o k so mething like this:
No w we can begin adding indexes to stageFactCustomerCount and clearing data fro m factCustomerCount.
The tMysqlRo w co mpo nent in TOS can o nly execute o ne statement, but we need to execute five ALTER TABLE
statements and o ne TRUNCATE TABLE statement. We co uld include several tMysqlRo w co mpo nents, o r write a single
sto red pro cedure that executes everything we need, using just o ne tMysqlRo w co mpo nent. We'll use a sto red
pro cedure to simplify things.
Switch to the terminal, and run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_preFactCustomerCount ()
BEGIN
ALTER TABLE stageFactCustomerCount add index(create_date);
ALTER TABLE stageFactCustomerCount add index(customer_id);
ALTER TABLE stageFactCustomerCount add index(store_id);
TRUNCATE TABLE factCustomerCount;
END
//
Add a t MysqlRo w co mpo nent to yo ur canvas. Set its database co nnectio n to the data wareho use, and set its query to
the fo llo wing:
CODE TO TYPE:
call etl_preFactCustomerCount();
This pro cedure canno t execute befo re we po pulate stageFactCustomerCount with data. To do that, right-click o n the
t MysqlInput co mpo nent o f the st age Fact Cust o m e rCo unt sub jo b and link its On Subjo b OK trigger to
t MysqlRo w.
We are nearly ready to po pulate factCustomerCount, but we have o ne mo re small step to co mplete first. We are
planning o n using the ELT co mpo nents to lo ad o ur fact table, but tho se co mpo nents need up-to -date schemas fo r all
tables.
Back in lesso n seven, we created o ur data wareho use database co nnectio n, and let TOS read the schema fo r several
o f o ur tables. We have made several changes since then, so we need to update o ur schema.
It wo uld be nice to include stageFactCustomerCount, but that table hasn't been created yet. The easiest way to create
the table is to execute o ur jo b. Do so by clicking o n the
We do n't need to filter o ur schema, so click Ne xt > at the bo tto m o f the windo w:
In the next windo w, scro ll thro ugh yo ur database until yo u co me acro ss the o bjects fo r this co urse:
dimCusto mer
dimDate
dimMo vie
dimStaff
dimSto re
etlLo g
etlRuns
factCusto merCo unt
factRentalCo unt
factRentalDuratio n
factSales
stageFactCusto merCo unt
Make sure all o f tho se o bjects are checked, then click Ne xt >.
In the final screen, select each table o n the left and then click o n the Re t rie ve Sche m a butto n. Yo u might see a dialo g
that lo o ks like this:
Click "OK." Repeat this pro cess fo r each table to make sure yo u have the latest schema fo r all o f them. When yo u are
do ne, click Finish.
The repo sito ry will still sho w the tables asso ciated with o ur co nnectio n:
With o ur schemas updated, we are finally ready to po pulate f act Cust o m e rCo unt .
Start by putting a t ELT MysqlMap co mpo nent o n yo ur canvas. The t ELT MysqlMap co mpo nent acts as the
"co ntro ller," so yo u'll want to po sitio n it in the middle so mewhere. To ensure this co mpo nent executes after the
staging table is lo aded with indexes, link the On Subjo b OK trigger o f the previo us t MysqlRo w to t ELT MysqlMap:
Set the database co nnectio n fo r t ELT MysqlMap fro m the repo sito ry, to the data wareho use.
Our next task is to add t ELT MysqlInput co mpo nents fo r each so urce table. Dro p a t ELT MysqlInput co mpo nent
o nto the canvas. Edit its pro perties - set its schema to the st age Fact Cust o m e rCo unt table fro m the repo sito ry,
under the "Db Co nnectio ns" and "DataWareho use:"
With the metadata set, we can link o ur t ELT MysqlInput co mpo nent to t ELT MysqlMap. Right-click o n
t ELT MysqlInput and cho o se Link, then st age Fact Cust o m e rCo unt :
Repeat this pro cess fo r the three remaining tables - dim Dat e , dim St o re and dim Cust o m e r. When yo u're do ne,
yo ur jo b sho uld lo o k so mething like this:
We have inputs to o ur t ELT MysqlMap "co ntro ller" co mpo nent, but what abo ut o utputs? We o nly need o ne o utput fo r
this applicatio n, so drag a single t ELT MysqlOut put co mpo nent to the canvas. Right-click o n t ELT MysqlMap,
cho o se Link, and then cho o se *ne w o ut put *:
Do n't wo rry abo ut warnings and erro rs just yet - we are no t do ne co nfiguring o ur co mpo nents. Yo u do want to set the
database co nnectio n fo r t ELT MysqlOut put no w, ho wever, set it to yo ur perso nal database.
t ELT MysqlMap is very similar to t Map, with a co uple o f exceptio ns:
1. t Map has a variables sectio n, but t ELT MysqlMap do es no t. This is because t ELT MysqlMap is
executed o n the Mysql server directly, which do es no t have any understanding o f variables fro m TOS.
2. t ELT MysqlMap uses SQL fo r its expressio ns, whereas t Map uses Java fo r its expressio ns.
Our links are no w co mplete, but o ur co mpo nent is in erro r. This is because we haven't specified ho w we want TOS to
co mbine o ur inputs into a single o utput. Do uble-click o n t ELT MysqlMap.
Select the st age Fact Cust o m e rCo unt table, and specify ss as the alias:
This ss alias will be translated into the SQL statement. Check it o ut -- at the bo tto m o f the windo w, click o n Ge ne rat e d
SQL Se le ct que ry f o r 't able ' o ut put :
OBSERVE:
SELECT
FROM
stageFactCustomerCount ss
No w fo r the next table, add an alias dimDate. Name the alias dd.
At this po int, dim Dat e is no t jo ined to st age Fact Cust o m e rCo unt . To specify ho w these tables are jo ined we will
do two things. First, click o n the triangle dro p do wn to change the jo in type fro m (IMPLICIT JOIN) to INNER JOIN:
Next, check the Explicit J o in bo x fo r the date ro w under dim Dat e , specify = as the Ope rat o r, and specify the
Fo re ign Co lum n as DAT E(ss.cre at e _dat e ). When yo u are do ne yo u'll see the jo in:
Note
Note
There isn't any way to save yo ur jo b when yo u are inside o f the t ELT MysqlMap edito r. Be sure to click
OK and save yo ur wo rk o ften - yo u'll lo o se it if yo u accidentally hit Cancel!
To make mo re ro o m o n yo ur screen yo u can co llapse the dim Dat e table by clicking o n the Minimize/Maximize
butto n:
Next, add an alias called dc fo r the dim Cust o m e r table. Co llapse the o ther tables to make ro o m. Then do the
fo llo wing:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the cust o m e r_id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.cust o m e r_id.
But wait ! What abo ut the Type-2 co lumns: start_date and end_date?
Great questio n!
It isn't eno ugh to specify the customer_id fo r this jo in, because there co uld be multiple ro ws in dimCustomer with the
same customer_id. We also need to jo in o n start_date and end_date.
We'll jo in tho se co lumns against dimDate because it has a nice date co lumn with the exact type that exists in
dimCustomer. As fo r the jo in o perato r, we are lo o king fo r a ro w in dimCustomer that has a start_date less than o r
equal to create_date and an end_date greater than create_date.
On a time line, we need to pick the ro w in dimDate fo r a specific time perio d:
The query at the bo tto m o f the windo w sho uld lo o k like this:
OBSERVE:
SELECT
FROM
stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
Finally, add an alias called ds fo r the dim St o re table. Once again, this dimensio n is a Type-2. Execute the fo llo wing
steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the st o re _id ro w.
3. Set the o perato r to =
Note
Are yo u missing a table? Make sure yo u set the jo in type to INNER JOIN.
The o nly thing left to do is to specify o ur o utputs. But befo re we do , let's review the structure o f factCustomerCount.
Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
explain factCustomerCount;
Yo u'll see these results:
OBSERVE:
mysql> explain factCustomerCount;
+-------------------+---------+------+-----+---------+----------------+
| Field
| Type
| Null | Key | Default | Extra
|
+-------------------+---------+------+-----+---------+----------------+
| customerCount_key | int(11) | NO
| PRI | NULL
| auto_increment |
| date_key
| int(11) | NO
|
| NULL
|
|
| customer_key
| int(11) | NO
|
| NULL
|
|
| store_key
| int(11) | NO
|
| NULL
|
|
| customer_count
| int(11) | NO
|
| 1
|
|
| run_id
| int(11) | NO
|
| 1
|
|
+-------------------+---------+------+-----+---------+----------------+
6 rows in set (0.05 sec)
We need to specify all o f these co lumns in t ELT MysqlMap. Two co lumns are no t present - o ur surrogate key,
customerCount_key, and o ur customer_count co lumn.
ORDER IS IMPORT ANT fo r these co lumns, so we'll start by adding customerCount_key. To add this co lumn, click
the
butto n:
Name the co lumn customerCount_key, and make sure yo u keep the Nullable bo x checked. Under the Expre ssio n
co lumn abo ve, type in NULL:
Note
To reiterate, o rde r m at t e rs fo r this sectio n, so use care when adding these co lumns. Yo u can always
reo rder the co lumns using the up and do wn arro w butto ns at the bo tto m o f the windo w.
Expand dim Dat e , then drag dat e _ke y o ver to the o utput table, right underneath cust o m e rCo unt _ke y:
The next co lumn is cust o m e rCo unt . Add this by clicking the
butto n. Name the co lumn customerCount, and
make sure yo u keep the Nullable bo x checked. Under the Expre ssio n co lumn abo ve, type in 1.
Finally, drag the run_id co lumn to the o utput. Yo ur co mpleted map will lo o k like this:
Note
Yo u can igno re the warning o n t ELT MysqlMap - TOS is just co nfused by the table aliases.
To fix this erro r, do uble click t ELT MysqlOut put , then click o n Sync Co lum ns:
As lo ng as yo ur co mpo nents are no t in erro r, yo u'll see o utput that lo o ks like this:
OBSERVE:
Starting job ProcessDataWarehouse at 21:07 22/12/2008.
Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.store_ke
y , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date =
DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_i
d AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds O
N( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date
))
--> 589 rows inserted.
Job ProcessDataWarehouse ended at 21:07 22/12/2008. [exit code=0]
Yo u did it! f act Cust o m e rCo unt no w has data in it!
We co vered a who le lo t in this lesso n! We'll finish implementing the rest o f the facts in the next lesso n. See yo u in a bit!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
factSales
It was back in lesso n fo ur when we implemented the table fo r factSales. Let's review its structure by reviewing the
CREATE TABLE statement we used:
OBSERVE:
CREATE TABLE factSales
(
sales_key
INT NOT NULL
date_key
INT NOT NULL
customer_key
INT NOT NULL
movie_key
INT NOT NULL
store_key
INT NOT NULL
sales_amount
decimal(5,2)
PRIMARY KEY (sales_key)
);
AUTO_INCREMENT,
REFERENCES dimDate,
REFERENCES dimCustomer,
REFERENCES dimMovie,
REFERENCES dimStore,
NOT NULL,
There is o nly o ne numeric value in this f act : sales_amount. The rest are fo reign keys that po int to dimensio ns.
To pro cess o ur sales f act , we will perfo rm the fo llo wing steps:
1. Mo ve data fro m the so urce system into the staging table called stageFactSales.
2. Create indexes o n the staging table.
3. Use the ELT family o f co mpo nents in TOS to po pulate factSales.
Let's get started by po pulating o ur staging table, stageFactSales. To do this, drag these three co mpo nents to yo ur
canvas: t MysqlInput , t Map and t MysqlOut put .
Edit the pro perties fo r t MysqlInput , setting its co nnectio n to the sakila database. Next, edit its query. Use this query to
retrieve sales data:
CODE TO TYPE:
select p.payment_id, p.amount, p.payment_date,
p.customer_id, i.film_id, i.store_id, p.staff_id
from
payment p
join rental r on ( p.rental_id = r.rental_id )
join inventory i on ( r.inventory_id = i.inventory_id )
WHERE p.customer_id > 10
Note
Yo u might be wo ndering abo ut the differences between f lo at , do uble , de cim al and int e ge r numbers.
Flo at and do uble are approximate numeric types, meaning the co mputer may no t sto re yo ur value
exactly like yo u enter it "under the ho o d." So , even if two flo at numbers lo o k the same to yo u, the
co mputer may no t sto re them the same way, and they may no t be equal to o ne ano ther, at least
acco rding to the co mputer. De cim al and integer are exact numeric types, so what yo u see is what the
co mputer sto res.
With that specified, link t MysqlInput to t Map, then o pen t Map to edit its co nnectio ns.
We will use this t Map just like we used the map co mpo nents fo r dimensio n pro cessing. Add o ur auditing co lumn
called run_id. Add an o utput, then add the run_id co lumn. Finally, drag all o f the input co lumns to the o utput. When
yo u're do ne, yo ur t Map sho uld lo o k so mewhat like this:
Next, co nnect t Map to t MysqlOut put . Set the co nnectio n pro perties o n t MysqlOut put to the data wareho use
co nnectio n fro m the repo sito ry, and set the table to " st age Fact Sale s" (within quo tatio n marks).
We co uld manually create the st age Fact Sale s table, but then we wo uld have to delete its data and dro p its indexes
befo re we po pulate it with data. Instead o f do ing that, we can have the t MysqlOut put co mpo nent Dro p t able if
e xist s and cre at e , which acco mplishes the same thing. To do this, set the action on table to Dro p t able if e xist s
and cre at e .
Right-click o n the t MysqlInput co mpo nent o f f act Re nt alCo unt , and co nnect the On Subjo b OK trigger to the
t MysqlInput o f the st age Fact Sale s sub jo b.
When yo u're do ne, yo ur jo b will lo o k so mething like this:
Next, we need to create o ur sto red pro cedure to create o ur indexes and clear the factSales table. Switch to the
terminal, and run the fo llo wing query:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_preFactSales ()
BEGIN
ALTER TABLE stageFactSales add index(payment_id);
ALTER TABLE stageFactSales add index(payment_date);
ALTER TABLE stageFactSales add index(customer_id);
ALTER TABLE stageFactSales add index(film_id);
ALTER TABLE stageFactSales add index(store_id);
TRUNCATE TABLE factSales;
END
//
No w, add a t MysqlRo w co mpo nent to yo ur canvas. Set its database co nnectio n to the data wareho use, and set its
query to the fo llo wing:
CODE TO TYPE:
call etl_preFactSales();
This pro cedure canno t execute befo re we po pulate stageFactSales with data. Right-click o n the t MysqlInput
co mpo nent o f the st age Fact Sale s sub jo b and link its On Subjo b OK trigger to t MysqlRo w.
In the last lesso n, we updated o ur table schema to include stageFactCustomerCount. We need to do the same fo r
stageFactSales. First, run yo ur jo b to create the table. Click o n the
As lo ng as yo u do n't have any erro rs in yo ur jo b, yo u'll see so mething like the fo llo wing o utput:
OBSERVE:
Starting job ProcessDataWarehouse at 13:00 21/12/2008.
Job ProcessDataWarehouse ended at 13:00 21/12/2008. [exit code=0]
After yo ur jo b runs successfully, yo u can update the database schema by right-clicking o n the Dat a Ware ho use
co nnectio n in the metadata sectio n o f TOS, and cho o sing Re t rie ve Sche m a:
We do n't need to filter o ur schema, so click Ne xt > at the bo tto m o f the windo w:
In the next windo w, scro ll thro ugh yo ur database until yo u co me acro ss the o bjects fo r this co urse, then check
st age Fact Sale s.
In the final screen, select st age Fact Sale s o n the left and then click o n the Re t rie ve Sche m a butto n.
No w we're ready to create o ur map. Drag a t ELT MysqlMap co mpo nent to yo ur canvas. To ensure this co mpo nent
executes after the staging table is lo aded with indexes, link the On Subjo b OK trigger o f the previo us t MysqlRo w to
t ELT MysqlMap:
Note
Be sure to Set the database co nnectio n fo r t ELT MysqlMap to the data wareho use, fro m the repo sito ry.
Next, we need to add t ELT MysqlInput co mpo nents fo r each so urce table. Start by adding a co mpo nent fo r
st age Fact Sale s:
With the metadata set, we can link o ur t ELT MysqlInput co mpo nent to t ELT MysqlMap. Right-click o n
t ELT MysqlInput and cho o se Link, then st age Fact Sale s:
We have inputs to o ur t ELT MysqlMap "co ntro ller" co mpo nent, but what abo ut o utputs? We o nly need o ne o utput fo r
this applicatio n, so drag a single t ELT MysqlOut put co mpo nent to the canvas. Right-click o n t ELT MysqlMap,
cho o se Link, then cho o se *ne w o ut put *:
Select the st age Fact Sale s table, and specify ss as the alias:
This ss alias will be translated into the SQL statement. At the bo tto m o f the windo w, click o n Ge ne rat e d SQL Se le ct
que ry f o r 't able ' o ut put :
OBSERVE:
SELECT
FROM
stageFactSales ss
No w add an alias to the next table, dimDate. Name the alias dd.
dim Dat e is no t jo ined to st age Fact Sale s no w. To specify ho w these tables are jo ined we will do two things: first,
click o n the triangle dro p do wn to change the jo in type fro m (IMPLICIT JOIN) to INNER JOIN:
Next, check the Explicit J o in bo x fo r the date ro w under dim Dat e , specify = as the Ope rat o r, and specify the
Fo re ign Co lum n as DAT E(ss.paym e nt _dat e ). When yo u are do ne yo u'll see the jo in:
No w is a go o d time to save. (Actually, it's almo st always a go o d time to save.) Click "OK" to clo se the map windo w,
then save yo ur wo rk.
Let's mo ve o n to o ur next input. Add an alias fo r the table dim Mo vie , called dm .
With dimDate o ut o f the way, we can specify the jo in to dim Mo vie . No w take these steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the f ilm _id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.f ilm _id.
When yo u are do ne, yo u're jo in will sho w:
We're almo st there! Add an alias called dst fo r the dim St af f table. Co llapse the o ther tables to make ro o m. This
dimensio n is also a Type-2, so we'll have to jo in o n start_date and end_date again. No w execute these steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the st af f _id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.st af f _id.
5. Check the Explicit J o in bo x fo r the st art _dat e and e nd_dat e ro ws.
6 . Fo r st art _dat e , set the o perato r to <= .
7. Fo r e nd_dat e , set the o perato r to >.
8 . Fo r bo th co lumns, set the Fo re ign co lum n / e xpre ssio n to dd.dat e .
Finally, add an alias called ds fo r the dim St o re table. Once again, this dimensio n is a Type-2 Execute the fo llo wing
steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the st o re _id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.st o re _id.
5. Check the Explicit J o in bo x fo r the st art _dat e and e nd_dat e ro ws.
6 . Fo r st art _dat e , set the o perato r to <= .
7. Fo r e nd_dat e , set the o perato r to >.
8 . Fo r bo th co lumns, set the Fo re ign co lum n / e xpre ssio n to dd.dat e .
Once yo u are do ne with all that, yo ur query will lo o k like this:
OBSERVE:
SELECT
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )
Note
Are yo u missing a table? Make sure yo u set the jo in type to INNER JOIN.
The o nly thing left to do no w is to specify o ur o utputs. Befo re we do that tho ugh, let's review the structure o f
factSales. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
explain factSales;
Yo u'll see the fo llo wing results:
OBSERVE:
mysql> explain factSales;
+--------------+--------------+------+-----+---------+----------------+
| Field
| Type
| Null | Key | Default | Extra
|
+--------------+--------------+------+-----+---------+----------------+
| sales_key
| int(11)
| NO
| PRI | NULL
| auto_increment |
| date_key
| int(11)
| NO
|
| NULL
|
|
| customer_key | int(11)
| NO
|
| NULL
|
|
| movie_key
| int(11)
| NO
|
| NULL
|
|
| store_key
| int(11)
| NO
|
| NULL
|
|
| staff_key
| int(11)
| NO
|
| NULL
|
|
| sales_amount | decimal(6,2) | YES |
| NULL
|
|
| run_id
| int(11)
| NO
|
| NULL
|
|
+--------------+--------------+------+-----+---------+----------------+
8 rows in set (0.10 sec)
We need to specify all eight o f these co lumns in t ELT MysqlMap. The o nly co lumn that isn't present is o ur surrogate
key, sales_key. To add this co lumn, click the
butto n:
Name the co lumn sales_key, and make sure yo u keep the Nullable bo x checked. Under the Expre ssio n co lumn
abo ve, type in NULL.
Note
One mo re time: o rde r m at t e rs fo r this sectio n, so use care when adding these co lumns. Yo u can
always reo rder the co lumns using the up and do wn arro w butto ns at the bo tto m o f the windo w.
First, expand the dim Dat e , then drag dat e _ke y o ver to the o utput table, right under sale s_ke y.
2. mo vie_key
3. sto re_key
4. staff_key
5. amo unt
6 . run_id
Take a lo o k at the generated query - it sho uld lo o k so mething like this:
OBSERVE:
SELECT
null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_key , ss.
amount , ss.run_id
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )
We're do ne with t ELT MysqlMap, so click "OK" to clo se the windo w, and save yo ur wo rk.
There is o ne thing we sho uld check befo re we go much further - we sho uld run EXPLAIN o n the generated query to see
ho w it will run. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
EXPLAIN SELECT
null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_key , ss.
amount , ss.run_id
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )
OBSERVE:
mysql> explain SELECT
-> null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_ke
y , ss.amount , ss.run_id
-> FROM
-> stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
-> INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
-> INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_d
ate <= dd.date AND dc.end_date > dd.date )
-> INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <=
dd.date AND dst.end_date > dd.date )
-> INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd
.date AND ds.end_date > dd.date );
+----+-------------+-------+------+------------------------------+---------+---------+--------------------+-------+-------------+
| id | select_type | table | type | possible_keys
| key
| key_len |
ref
| rows | Extra
|
+----+-------------+-------+------+------------------------------+---------+---------+--------------------+-------+-------------+
| 1 | SIMPLE
| dst
| ALL | NULL
| NULL
| NULL
|
NULL
|
2 |
|
| 1 | SIMPLE
| ds
| ALL | NULL
| NULL
| NULL
|
NULL
|
2 |
|
| 1 | SIMPLE
| dm
| ALL | NULL
| NULL
| NULL
|
NULL
| 1000 |
|
| 1 | SIMPLE
| ss
| ref | customer_id,film_id,store_id | film_id | 4
|
certjosh.dm.film_id |
16 | Using where |
| 1 | SIMPLE
| dc
| ALL | NULL
| NULL
| NULL
|
NULL
| 1178 | Using where |
| 1 | SIMPLE
| dd
| ALL | NULL
| NULL
| NULL
|
NULL
| 18628 | Using where |
+----+-------------+-------+------+------------------------------+---------+---------+--------------------+-------+-------------+
6 rows in set (0.08 sec)
mysql>
This lo o ks pretty bad - o ur query isn't using any indexes. That's because we never indexed any co lumns (o ther than
the primary key) when we created o ur dimensio ns.
Fo rtunately this pro blem has a quick fix -- we'll add indexes to mo st o f o ur co lumns. We'll o mit start_date and
end_date fro m the indexes fo r no w, just to keep things sho rter. Run the fo llo wing co mmand against yo ur perso nal
database:
CODE TO TYPE:
alter
alter
alter
alter
alter
table
table
table
table
table
OBSERVE:
mysql> alter table dimCustomer add index(customer_id);
Query OK, 1178 rows affected (0.08 sec)
Records: 1178 Duplicates: 0 Warnings: 0
mysql> alter table dimDate add index(date);
Query OK, 18628 rows affected (0.15 sec)
Records: 18628 Duplicates: 0 Warnings: 0
mysql> alter table dimMovie add index(film_id);
Query OK, 1000 rows affected (0.09 sec)
Records: 1000 Duplicates: 0 Warnings: 0
mysql> alter table dimStaff add index(staff_id);
Query OK, 2 rows affected (0.09 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql> alter table dimStore add index(store_id);
Query OK, 2 rows affected (0.05 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql>
Let's try the EXPLAIN again. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
EXPLAIN SELECT
null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_key , ss.
amount , ss.run_id
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date );
It lo o ks like o ur query is using mo st o f the indexes we created. The dimStaff and dimStore sho uld no t be a pro blem,
since there are o nly two ro ws in bo th o f tho se tables.
OBSERVE:
mysql> EXPLAIN SELECT
-> null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_ke
y , ss.amount , ss.run_id
-> FROM
-> stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
-> INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
-> INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_d
ate <= dd.date AND dc.end_date > dd.date )
-> INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <=
dd.date AND dst.end_date > dd.date )
-> INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd
.date AND ds.end_date > dd.date );
+----+-------------+-------+------+------------------------------+-------------+--------+-------------------------+-------+-------------+
| id | select_type | table | type | possible_keys
| key
| key_le
n | ref
| rows | Extra
|
+----+-------------+-------+------+------------------------------+-------------+--------+-------------------------+-------+-------------+
| 1 | SIMPLE
| ss
| ALL | customer_id,film_id,store_id | NULL
| NULL
| NULL
| 15766 |
|
| 1 | SIMPLE
| dm
| ref | film_id
| film_id
| 2
| certjosh.ss.film_id
|
1 | Using where |
| 1 | SIMPLE
| dd
| ref | date
| date
| 3
| func
|
1 | Using where |
| 1 | SIMPLE
| dc
| ref | customer_id
| customer_id | 4
| certjosh.ss.customer_id |
2 | Using where |
| 1 | SIMPLE
| dst
| ALL | staff_id
| NULL
| NULL
| NULL
|
2 | Using where |
| 1 | SIMPLE
| ds
| ALL | store_id
| NULL
| NULL
| NULL
|
2 | Using where |
+----+-------------+-------+------+------------------------------+-------------+--------+-------------------------+-------+-------------+
6 rows in set (0.99 sec)
There is o ne last thing we need to do befo re we run o ur jo b. There is currently an erro r o n t ELT MysqlOut put :
Note
Yo u can igno re the warning o n t ELT MysqlMap -- TOS is just co nfused by the table aliases.
To fix this erro r, do uble-click t ELT MysqlOut put , then click o n Sync Co lum ns:
As lo ng as yo ur co mpo nents are no t in erro r, yo u'll see o utput that lo o ks like this:
OBSERVE:
Starting job ProcessDataWarehouse at 18:56 21/12/2008.
Inserting with :
INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , dm.movie_key , ds.s
tore_key , dst.staff_key , ss.amount , ss.run_id FROM stageFactSales ss INNER JOIN d
imDate dd ON( dd.date = DATE(ss.payment_date) ) INNER JOIN dimMovie dm ON( dm.film_
id = ss.film_id ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND
dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStaff dst ON( d
st.staff_id = ss.staff_id AND dst.start_date <= dd.date AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date ))
--> 15766 rows inserted.
Job ProcessDataWarehouse ended at 18:56 21/12/2008. [exit code=0]
Great jo b! We're really ro lling no w. Keep it up and see yo u sho rtly!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
Special Facts
DBA 3: Data Warehousing Lesson 12
Hello ! In the last lesso n we implemented o ur f act s. Our data wareho use is fairly straightfo rward. We read data fro m o ne
database, do a few basic transfo rmatio ns, and lo ad the data into the wareho use.
The DVD Rental sto re o nly tracks sales. It do esn't have to wo rry abo ut invo icing custo mers, shipping pro ducts, internet o rders,
o r tracking co sts. Other businesses are much mo re co mplex. In this lesso n, we'll investigate o ptio ns fo r dealing with o ther
types o f data and situatio ns that o ccur in data wareho uses.
Missing Keys
In the last lesso n we lo aded o ur fact tables, assuming that we co uld reso lve all o f the fo reign keys required fo r the
dimensio ns.
But what if a key canno t be fo und? Let's use o ur stageFactCustomerCount and factCustomerCount tables to help
us wo rk o n this pro blem . First, run yo ur jo b to make sure yo ur tables co ntain data. Take a lo o k at the o utput (so me
lines have been o mitted):
OBSERVE:
Starting job ProcessDataWarehouse at 14:24 23/12/2008.
Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.store_ke
y , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date =
DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_i
d AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds O
N( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date
))
--> 589 rows inserted.
We kno w that 5 89 ro ws were put into factCustomerCount. Ho w many were in stageFactCustomerCount? Run this
co mmand against yo ur perso nal database:
CODE TO TYPE:
select count(*) from stageFactCustomerCount;
Oh my! It lo o ks like stageFactCustomerCount has 5 9 9 ro ws!
OBSERVE:
mysql> select count(*) from stageFactCustomerCount;
+----------+
| count(*) |
+----------+
|
599 |
+----------+
1 row in set (0.05 sec)
mysql>
Our jo in has excluded 10 ro ws. Ho w do we find the missing ro ws?
This query will return ro ws fro m stageFactCustomerCount that do no t have co rrespo nding ro ws in either
dimDate, dimCustomer o r dimStore. No n-matching ro ws will have a NULL dat e _ke y, cust o m e r_ke y, o r a
NULL st o re _ke y.
We can use MySQL's IFNULL functio n to be sure o ur jo in was successful. If the jo in failed, dm.movie_key will
be null, so the IFNULL functio n will return -1. If the jo in wo rked, dm.movie_key will have a value, which will be
returned by IFNULL.
Next, change the expressio n o n the f act Sale s o utput fro m dm.movie_key to IFNULL(dm.movie_key, -1):
OBSERVE:
Starting job ProcessDataWarehouse at 14:24 23/12/2008.
Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.s
tore_key , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd
ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.custom
er_id = ss.customer_id AND dc.start_date <= dd.date AND dc.end_date > dd.date
) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <=
dd.date AND ds.end_date > dd.date ))
--> 589 rows inserted.
Inserting with :
INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , IFNULL(dm.mo
vie_key , -1), ds.store_key , dst.staff_key , ss.amount , ss.run_id FROM stage
FactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) ) LEFT
OUTER JOIN dimMovie dm ON( dm.film_id = ss.film_id ) INNER JOIN dimCustomer
dc ON( dc.customer_id = ss.customer_id AND dc.start_date <= dd.date AND dc.e
nd_date > dd.date ) INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AN
D dst.start_date <= dd.date AND dst.end_date > dd.date ) INNER JOIN dimStore
ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_dat
e > dd.date ))
--> 15767 rows inserted.
Job ProcessDataWarehouse ended at 14:24 23/12/2008. [exit code=0]
Switch back to MySql mo de to see if yo ur change wo rked. Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
select * from factSales where movie_key=-1;
If yo ur jo b ran successfully, yo u'll see this:
OBSERVE:
mysql> select * from factSales where movie_key=-1;
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
| sales_key | date_key | customer_key | movie_key | store_key | staff_key | sale
s_amount | run_id |
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
|
6858 | 20050101 |
884 |
-1 |
1 |
1 |
999.99 |
-1 |
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
1 row in set (0.08 sec)
mysql>
It lo o ks like yo ur left jo in saved the day!
Aggregating
Data wareho uses are built with the presumptio n that data needs to be aggregated in different ways. If the increment we
are using in o ur data wareho use is o ne day, and we query the wareho use to see sales in May, we need to SUM(Sales)
fo r each day in May.
This wo rks fo r mo st types o f facts, but what if the fact under co nsideratio n is an acco unt balance? Acco unt balances
are usually sto red as po int in t im e values. Take a lo o k:
Dat e
De script io n
5 9 2.20
25.9 0
5 6 6 .30
19 .50
5 4 6 .80
0 1/0 4 Co nsulting Wo rk
150 0 .0 0 20 4 6 .80
The acco unt balance o n 0 1/0 3 is 5 4 6 .80 , and the acco unt balance o n 0 1/0 4 is 20 4 6 .80 . If to day is January 4th, the
acco unt balance fo r the mo nth o f January is 20 4 6 .80 , no t 59 2.20 + 56 6 .30 + 546 .8 0 + 20 46 .8 0 = 3752.10 . Likewise,
the acco unt balance fo r 20 0 8 is also 20 4 6 .80 , no t 3752.10 .
The pro per aggregate fo r an acco unt balance is no t SUM, it is LAST.
If we take a lo o k at MySQL's gro up by functio ns yo u'll no tice there is MAX, MIN, and o f co urse SUM, ho wever there is no
LAST o r FIRST. This is because LAST and FIRST are no t currently suppo rted by MySQL.
Getting aro und this pro blem is tricky with MySQL. Our o nly o ptio n is to use ORDER BY to so rt the results by date, so
the o ldest reco rd will appear first, and to LIMIT o ur result to o ne ro w. A sample query to get the last value fro m
factSales wo uld lo o k like this:
CODE TO TYPE:
SELECT *
FROM factSales
ORDER BY date_key DESC
LIMIT 0, 1;
Deaggregating Data
We already saw that aggregating certain types o f data can po se so me pro blems. In so me situatio ns, data may already
be aggregated, causing a different type o f pro blem.
Suppo se the DVD sto re started shipping packages. Shippers usually want to kno w the weight o f packages, since it is
used to calculate shipping co st. If so meo ne o rders fo ur DVDs at the same time, tho se fo ur DVDs are co mbined into a
single package and sent to the custo mer. Their o rder may lo o k so mething like this:
T it le
DADDY PITTSBURGH
Price
9 .9 9
TITANIC BOONDOCK
5.9 9
NEWTON LABYRINTH
5.9 9
APOLLO TEEN
9 .9 9
5.9 0
== T o t al ==
37 .86
Our business user wants to kno w, fo r example, ho w much shipping co sts were fo r the mo vie APOLLO TEEN?
Lo o king at o ur data, we o nly kno w that it co st $ 5 .9 0 to ship APOLLO TEEN alo ng with DADDY PITTSBURG, TITANIC
BOONDOCK and NEWTON LABYRINTH.
Our business user uses this fo rmula to calculate shipping co sts:
Shipping o n an it e m = Shipping & Handling / # o f It e m s
With this fo rmula in mind, we can calculate the shipping & handling o n each individual item in the o rder:
T it le
Price
Shipping
DADDY PITTSBURGH
9 .9 9 5 .9 0 /4 = 1.4 7 5
TITANIC BOONDOCK
5.9 9 5 .9 0 /4 = 1.4 7 5
NEWTON LABYRINTH
5.9 9 5 .9 0 /4 = 1.4 7 5
APOLLO TEEN
9 .9 9 5 .9 0 /4 = 1.4 7 5
== T o t al ==
37 .86
The custo mer's purchases are entered into the data wareho use o n the mo rning o f January 2nd, ho wever no details
exist in the custo mer dimensio n (under card #10 59 259 ) until January 5th.
This type o f situatio n is slightly different than Missing Ke ys, because the keys are no t exactly missing, they're just late.
Ho w do yo u deal with this type o f situatio n? First, yo u may have to alter the rules o n dimCustomer to allo w NULLs in
mo st co lumns. The o nly data we kno w abo ut "late" custo mers is the card number (because it is the primary key).
Next, change yo ur fact pro cess like so :
1. Go thro ugh every fact ro w, check to see if the custo mer exists in dimCustomer (perhaps using the LEFT
JOIN/IS NULL techniques fro m earlier in this lesso n).
2. Fo r every missing dimensio n ro w, insert a new ro w into the dimensio n, po ssibly using default values
such as "PENDING ACCOUNT" fo r first and last name.
3. Once we are certain the co rrespo nding dimensio n exists, add the fact ro w to the table.
Note
We assume that dimCustomer has already been pro cessed and is up-to -date. If yo ur dimensio n isn't
current, then a lo t o f fact data is go ing to appear to be late!
10 59 259 PENDING
ACCOUNT
NULL
NULL
0 1-Jan
End Dat e
31-DEC-20 9 9
So , what happens o n January 5th, when the dimensio n data finally catches up to the fact data?
No t hing.
Our existing dimensio n pro cess will see the custo mer's details, and update the dimensio n acco rdingly. After January
5th, there will be two ro ws in the database fo r card #10 59 259 :
Card #
Addre ss
Pho ne
NULL
St art Dat e
10 59 259 PENDING
ACCOUNT
NULL
0 1-Jan
10 59 259 Sue
Sho pper
End Dat e
0 5-Jan
31-DEC-20 9 9
This reflects histo ry, exactly as it happened. Between January 1st and January 5th we didn't kno w the details fo r the
custo mer with card #10 59 259 , and after January 5th we did.
We co vered a lo t o f info rmatio n in this lesso n! Go o d jo b. In the next lesso n we'll examine so me co mmo n queries that peo ple
run against data wareho uses. See yo u there!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
Note
In the next co urse, DBA 4, yo u'll learn all abo ut using MDX - Multiple Dimensio n Expressio ns - to query data
wareho uses, so stay tuned!
Viewing Data
In the first lesso n, we discussed the go als o f a data wareho use. We want to create:
a separate system that wo n't interrupt business critical o peratio nal systems.
a single po int o f access fo r all analytical queries.
a unified, co nsistent view o f underlying data (even data fro m external systems).
a straightfo rward way to analyze trends (to see the way sales co mpare fro m mo nth to mo nth).
Our current structure o f dim e nsio ns and f act s is co nsistent and straightfo rward, but we can do better. Co lumns like
run_id are no t impo rtant to end users, and may even be co nfusing to them. Tables like etlRuns do n't matter to
anyo ne except database administrato rs, so they sho uld be hidden fro m end users.
Ease o f use aside, views also pro vide o ther benefits:
So me info rmatio n in the data wareho use may be very sensitive, so views can be used to pro vide ro w level
security.
Views can be used to pro vide co nsistency to the data wareho use, especially as underlying tables gro w in
co mplexity and undergo changes.
Co lumns can be renamed to make things easier to understand.
Fo r o ur views, we will:
name them with a co mmo n prefix - Fact _ fo r fact tables, and Dim e nsio n_ fo r dimensio n tables.
o mit surro gate keys fro m fact tables (such as sales_key).
o mit start_date and end_date fro m Type-2 slo wly changing dimensio ns.
o mit run_id fro m all tables.
keep fo reign key co lumns (the _key co lumns) unchanged.
keep keys fro m so urce systems (like customer_id) unchanged.
rename fact and dimensio n co lumns to mo re readable equivalents (fo r example, customer_count wo uld
beco me Customer Count).
Let's get started and create a view fo r o ur factCustomerCount table. In this view we will o mit the customerCount_key
and run_id. Switch to MySQL mo de, and run this co mmand against yo ur perso nal database::
CODE TO TYPE:
CREATE VIEW Fact_CustomerCount
AS
SELECT date_key, customer_key, store_key, customer_count as `Customer Count`
FROM factCustomerCount;
OBSERVE:
mysql> CREATE VIEW Fact_CustomerCount
-> AS
-> SELECT date_key, customer_key, store_key, customer_count as "Customer Count"
-> FROM factCustomerCount;
Query OK, 0 rows affected (0.06 sec)
mysql>
It all lo o ks go o d so far. To test o ut this query, let's check o ut the first ten ro ws. Run this co mmand against yo ur
perso nal database:
CODE TO TYPE:
SELECT * from Fact_CustomerCount
LIMIT 0, 10;
If yo u typed everything co rrectly, yo u'll see this familiar MySQL result: Query OK, 0 rows affected (0.13 sec).
With a co uple o f f act s o ut o f the way, let's turn o ur attentio n to dim e nsio ns. Start with dimCustomer. Run this
co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Dimension_Customer
AS
SELECT customer_key, customer_id, first_name as `First Name`, last_name as `Last Name`,
Email, Address, address2 as "Address 2",
District, City, Country, postal_code as `Postal Code`,
Phone, Active, create_date as `Create Date`
FROM dimCustomer;
Great! Let's mo ve o n to dimDate. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Dimension_Date
AS
SELECT date_key, Date, Year, Quarter, Month, month_name as `Month Name`, Day,
day_name as `Day of Week`, week as `Week In Year`,
is_weekend as `Is Weekend`, is_holiday as `Is Holiday`
FROM dimDate;
The next view we'll create is fo r dimMovie. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Dimension_Movie
AS
SELECT movie_key, film_id, Title, Description, release_year as `Release Year`,
Language, original_language as `Original Language`,
rental_duration as `Rental Duration`, Length,
Rating, special_features as `Special Features`
FROM dimMovie;
The last view we'll create fo r no w is fo r the dimStore table. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Dimension_Store
AS
SELECT store_key, store_id, Address, address2 as `Address 2`, District,
City, Country, postal_code as `Postal Code`, Region,
manager_first_name as `Manger First Name`,
manager_last_name as `Manager Last Name`
FROM dimStore;
Great ! We're ready to get started with o ur queries!
Answering Questions
We'll use a standard template fo r querying the data wareho use. In the text belo w, blue signifies f act s, and re d
signifies dim e nsio ns. It isn't necessary to use all parts o f o ur template, especially if we aren't interested in limiting o ur
query using a WHERE clause.
OBSERVE:
SELECT columns from dimension tables,
SUM( fact columns )
FROM fact view
INNER JOIN dimension view 1 on (fact column = dimension column )
INNER JOIN dimension view 2 on (fact column = dimension column )
.
.
WHERE Limits to dimensions
limits to facts
GROUP BY dimension columns
ORDER BY dimension columns, fact columns
LIMIT 0, 5 (optional "top 5" results)
In lesso n two we discussed questio ns that wo uld be po sed by management. We rewro te these questio ns so that they
were in the fo rmat o f f act s and dim e nsio ns. No w let's try to answer so me o f them!
First up: Ho w m any ne w cust o m e rs did we add by quart e r?
To answer this questio n, we'll need data fro m the Fact _Cust o m e rCo unt and Dim e nsio n_Dat e tables. Let's write a
query using the template we already created. Altho ugh o ur questio n do es no t specify a particular so rting o rder, we'll
so rt by Quart e r. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
SELECT dd.Quarter,
SUM( `Customer Count` ) as `Customer Count`
FROM Fact_CustomerCount fc
INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
GROUP BY dd.Quarter
ORDER BY dd.Quarter;
MySQL will reply with yo ur answer:
OBSERVE:
mysql> SELECT dd.Quarter,
-> SUM( `Customer Count` ) as `Customer Count`
-> FROM Fact_CustomerCount fc
-> INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
-> GROUP BY dd.Quarter
-> ORDER BY dd.Quarter;
+---------+----------------+
| Quarter | Customer Count |
+---------+----------------+
| Q1
|
158 |
| Q2
|
140 |
| Q3
|
143 |
| Q4
|
148 |
+---------+----------------+
4 rows in set (0.06 sec)
mysql>
That's pretty slick! We didn't even have to figure o ut the specific quarter each custo mer registered.
Note
If yo u see an erro r that lo o ks like this: ERROR 130 5 (4 20 0 0 ): FUNCT ION ce rt jo sh.SUM do e s no t
e xist , make sure yo u do no t have any spaces between SUM and (. SUM(column). This wo rks in MySQL,
but SUM (column) will return an erro r.
No w suppo se we want to extend this query, so it answers these questio ns: Ho w m any ne w cust o m e rs did we add
by quart e r and by m o nt h? Let's go back to o ur template, and add a new co lumn. Run this co mmand against yo ur
OBSERVE:
mysql> SELECT dd.Quarter, dd.`Month Name`,
-> SUM( `Customer Count` ) as `Customer Count`
-> FROM Fact_CustomerCount fc
-> INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
-> GROUP BY dd.Quarter, dd.`Month Name`
-> ORDER BY dd.Quarter, dd.`Month`;
+---------+------------+----------------+
| Quarter | Month Name | Customer Count |
+---------+------------+----------------+
| Q1
| January
|
54 |
| Q1
| February
|
47 |
| Q1
| March
|
57 |
| Q2
| April
|
44 |
| Q2
| May
|
45 |
| Q2
| June
|
51 |
| Q3
| July
|
46 |
| Q3
| August
|
51 |
| Q3
| September |
46 |
| Q4
| October
|
48 |
| Q4
| November
|
51 |
| Q4
| December
|
49 |
+---------+------------+----------------+
Okay, let's mo ve o n to a new questio n: What was t he am o unt o f sale s re ve nue we e arne d, by st o re and by by
m o nt h? To answer these questio ns, we'll need to use the Fact _Sale s, Dim e nsio n_St o re , and Dim e nsio n_Dat e
views. This time we will try an alternate ORDER BY syntax; we'll specify the co lumns by position instead o f by name.
Fo r this query, 1 is the first co lumn, ds.Addre ss, and 2 is the seco nd co lumn, ds.`Mo nt h Nam e `. Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
SELECT ds.Address, dd.`Month Name`,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
GROUP BY 1, 2
ORDER BY ds.Address, dd.`Month`;
Once again, o ur wareho use answers o ur questio ns right away:
OBSERVE:
mysql> SELECT ds.Address, dd.`Month Name`,
-> SUM( `Sales Amount` ) as `Sales Amount`
-> FROM Fact_Sales fs
-> INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
-> INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
-> GROUP BY ds.Address, dd.`Month Name`
-> ORDER BY 1, 2;
+--------------------+------------+--------------+
| Address
| Month Name | Sales Amount |
+--------------------+------------+--------------+
| 28 MySQL Boulevard | February
|
270.09 |
| 28 MySQL Boulevard | May
|
2328.30 |
| 28 MySQL Boulevard | June
|
4829.30 |
| 28 MySQL Boulevard | July
|
13873.70 |
| 28 MySQL Boulevard | August
|
11910.70 |
| 47 MySakila Drive | January
|
999.99 |
| 47 MySakila Drive | February
|
238.11 |
| 47 MySakila Drive | May
|
2418.35 |
| 47 MySakila Drive | June
|
4640.01 |
| 47 MySakila Drive | July
|
14020.33 |
| 47 MySakila Drive | August
|
11740.45 |
+--------------------+------------+--------------+
11 rows in set (0.70 sec)
Note
The sakila database is a rando m set o f data. That's the reaso n there were no sales fo r "47 MySakila
Drive" in March.
No w suppo se we want to find o ut the t o p f ive sale s, by sto re and by mo nth. Let's give it a try! Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
SELECT ds.Address, dd.`Month Name`,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
GROUP BY ds.Address, dd.`Month Name`
ORDER BY 1, 2
LIMIT 0, 5;
Once again, MySQL answered o ur questio ns, but it isn't exactly the info rmatio n we want:
OBSERVE:
mysql> SELECT ds.Address, dd.`Month Name`,
-> SUM( `Sales Amount` ) as `Sales Amount`
-> FROM Fact_Sales fs
-> INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
-> INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
-> GROUP BY ds.Address, dd.`Month Name`
-> ORDER BY 1, 2
-> LIMIT 0, 5;
+--------------------+------------+--------------+
| Address
| Month Name | Sales Amount |
+--------------------+------------+--------------+
| 28 MySQL Boulevard | August
|
11910.70 |
| 28 MySQL Boulevard | February
|
270.09 |
| 28 MySQL Boulevard | July
|
13873.70 |
| 28 MySQL Boulevard | June
|
4829.30 |
| 28 MySQL Boulevard | May
|
2328.30 |
+--------------------+------------+--------------+
5 rows in set (0.39 sec)
We retrieved the to p five results, but we didn't o rder by Sale s Am o unt , and then in descending o rder fro m there. Run
this co mmand against yo ur perso nal database:
CODE TO TYPE:
SELECT ds.Address, dd.`Month Name`,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
GROUP BY ds.Address, dd.`Month Name`
ORDER BY 3 DESC, 1, 2
LIMIT 0, 5;
Bad Joins
There is ano ther mo re serio us pro blem that may take place in o ur data wareho use - bad jo ins.
Suppo se yo u co me into the o ffice o ne day, and are asked to answer a questio n we've seen many times
befo re: Ho w m any ne w cust o m e rs did we add by quart e r? Let's appro ach this questio n again. Run this
co mmand against yo ur perso nal database:
CODE TO TYPE:
SELECT dd.Quarter,
SUM( `Customer Count` ) as `Customer Count`
FROM Fact_CustomerCount fc
INNER JOIN Dimension_Date dd on (fc.date_key = fc.date_key)
GROUP BY dd.Quarter
ORDER BY dd.Quarter;
Yo u pro bably no ticed right away that so mething was strange. The query takes a very lo ng time to return
results, and when it finally do es, they lo o k really strange:
OBSERVE:
mysql> SELECT dd.Quarter,
-> SUM( `Customer Count` ) as `Customer Count`
-> FROM Fact_CustomerCount fc
-> INNER JOIN Dimension_Date dd on (fc.date_key = fc.date_key)
-> GROUP BY dd.Quarter
-> ORDER BY dd.Quarter;
+---------+----------------+
| Quarter | Customer Count |
+---------+----------------+
| Q1
|
2711167 |
| Q2
|
2733549 |
| Q3
|
2763588 |
| Q4
|
2763588 |
+---------+----------------+
4 rows in set (14.04 sec)
Co mpare these results to the results we calculated previo usly:
OBSERVE:
+---------+----------------+
| Quarter | Customer Count |
+---------+----------------+
| Q1
|
158 |
| Q2
|
140 |
| Q3
|
143 |
| Q4
|
148 |
+---------+----------------+
So what happened here? It was a bad jo in. Instead o f writing (fc.date_key = fc.date_key) fo r o ur jo in
criteria, we sho uld have written (fc.date_key = dd.date_key).
Back in o ur query we fo rgo t to specify ho w Fact _Cust o m e rCo unt jo ins to Dim e nsio n_Dat e . This caused
MySQL to return the cartesian product o f tho se two tables instead o f the pro perly jo ined results.
The real danger behind bad jo ins is that they can o ften go undetected. This example is extreme - o ur
business users wo uld likely kno w there is a pro blem with the query, since the results fo r Q1 are o ver 17,0 0 0
times the actual value. That is pretty far o ff! But what if o ur co mpany typically added 3,0 0 0 ,0 0 0 new custo mers
in a quarter? Then 2,711,16 7 wo uldn't seem so far o ff at all.
The best way to prevent bad jo ins is to have many different peo ple review each query written against the data
wareho use. No query to o l can tell yo u if yo ur jo in is bad, o r if yo ur query is o therwise written inco rrectly.
Incorrect Filtering
No w suppo se yo ur bo ss wants to kno w which mo vies had sales greater than $10 .0 0 . Yo u sit do wn at yo ur
desk, and write a quick query to find the answer the questio n. Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
SELECT dm.Title,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key)
WHERE fs.`Sales Amount` > 10
GROUP BY 1;
It lo o ks like fifty mo vies have had sales greater $10 .0 0 :
OBSERVE:
mysql> SELECT dm.Title,
-> SUM( `Sales Amount` ) as `Sales Amount`
-> FROM Fact_Sales fs
-> INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key)
-> WHERE fs.`Sales Amount` > 10
-> GROUP BY 1;
+---------------------------+--------------+
| Title
| Sales Amount |
+---------------------------+--------------+
| !!! MISSING MOVIE !!!
|
999.99 |
| AMERICAN CIRCUS
|
43.96 |
| AUTUMN CROW
|
10.99 |
| BACKLASH UNDEFEATED
|
10.99 |
| BEAST HUNCHBACK
|
21.98 |
| BEHAVIOR RUNAWAY
|
21.98 |
| BILKO ANONYMOUS
|
43.96 |
| BRIGHT ENCOUNTERS
|
10.99 |
| CARIBBEAN LIBERTY
|
43.96 |
| CASUALTIES ENCINO
|
10.99 |
| DAUGHTER MADIGAN
|
10.99 |
| DOORS PRESIDENT
|
10.99 |
| FLASH WARS
|
10.99 |
| FLINTSTONES HAPPINESS
|
44.96 |
| FOOL MOCKINGBIRD
|
32.97 |
| GARDEN ISLAND
|
10.99 |
| HUSTLER PARTY
|
32.97 |
| INNOCENT USUAL
|
21.98 |
| ISHTAR ROCKETEER
|
10.99 |
| KING EVOLUTION
|
21.98 |
| KISSING DOLLS
|
43.96 |
| MAIDEN HOME
|
21.98 |
| MIDSUMMER GROUNDHOG
|
22.98 |
| MINDS TRUMAN
|
32.97 |
| MINE TITANS
|
55.95 |
| NIGHTMARE CHILL
|
10.99 |
| PANIC CLUB
|
10.99 |
| PATHS CONTROL
|
10.99 |
| PINOCCHIO SIMON
|
10.99 |
| RANGE MOONWALKER
|
21.98 |
| SATISFACTION CONFIDENTIAL |
10.99 |
| SATURDAY LAMBS
|
54.95 |
| SCORPION APOLLO
|
23.98 |
| SECRETS PARADISE
|
10.99 |
| SHOW LORD
|
22.98 |
| STING PERSONAL
|
33.97 |
| STRANGER STRANGERS
|
10.99 |
| SUIT WALLS
|
21.98 |
| SUNRISE LEAGUE
|
21.98 |
| TEEN APOLLO
|
21.98 |
| TELEGRAPH VOYAGE
|
65.94 |
| TIES HUNGER
|
11.99 |
| TITANIC BOONDOCK
|
21.98 |
| TORQUE BOUND
|
43.96 |
| TRAP GUYS
|
33.97 |
| TYCOON GATHERING
|
43.96 |
| VIRTUAL SPOILERS
|
44.96 |
| WIFE TURN
|
43.96 |
| WONDERLAND CHRISTMAS
|
10.99 |
| ZORRO ARK
|
10.99 |
+---------------------------+--------------+
50 rows in set (0.10 sec)
At first glance this answer appears to be co rrect, but is this result exactly what we wanted? No t quite. Let's take
a lo o k at the query again, with an English translatio n fo r each line:
OBSERVE:
SELECT dm.Title,
--Show the movie title
SUM( `Sales Amount` ) as `Sales Amount`
--And total sales per movie
FROM Fact_Sales fs
--from Fact_Sales
INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key) --and from Dimens
ion_Movie
WHERE fs.`Sales Amount` > 10
--Where Sales Amount in Fact_Sales is
greater than 10
GROUP BY 1;
Instead o f returning a result o f mo vies with total sales greater than $10 .0 0 , we have returned a result o f
mo vies with one-time sales greater than $10 .0 0 . We filtered o ur data inco rrectly.
So ho w do we fix this erro r? One way wo uld be to use a sub-query. We'll calculat e t he t o t al sale s f o r
e ach m o vie , then lim it t ho se re sult s t o Sale s Am o unt > 10 . Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
SELECT Title, `Sales Amount`
FROM (SELECT dm.Title,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key)
GROUP BY 1
) as subQuery
WHERE `Sales Amount` > 10;
This query returns many mo re mo vies in the result - it lo o ks like nearly every mo vie has generated sales
greater than $10 .0 0 .
OBSERVE:
+-----------------------------+--------------+
| Title
| Sales Amount |
+-----------------------------+--------------+
| !!! MISSING MOVIE !!!
|
999.99 |
| ACADEMY DINOSAUR
|
35.78 |
| ACE GOLDFINGER
|
52.93 |
| ADAPTATION HOLES
|
32.89 |
| AFFAIR PREJUDICE
|
91.77 |
...lines omitted...
| WRATH MILE
|
23.86 |
| WRONG BEHAVIOR
|
62.80 |
| WYOMING STORM
|
72.87 |
| YENTL IDAHO
|
130.78 |
| YOUTH KICK
|
12.95 |
| ZHIVAGO CORE
|
14.91 |
| ZOOLANDER FICTION
|
67.84 |
| ZORRO ARK
|
214.69 |
+-----------------------------+--------------+
947 rows in set (0.74 sec)
This type o f pro blem is difficult to spo t, especially when o ur first query seems to wo rk. It's impo rtant to have
peers review queries to make sure everything is written co rrectly.
As usual, we've co vered a lo t in this lesso n! Yo u're do ing really great, and yo u're nearly do ne with this co urse. The next lesso n
will be a descriptio n o f yo ur final pro ject. See yo u then!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.
Final Project
DBA 3: Data Warehousing Lesson 14
Northwind T raders
Northwind Traders is a sample database that Micro so ft distributed with its Access pro duct. It's an o ld database (the
newest dates in it are fro m 19 9 6 ), but is a great example o f database design.
This pro ject uses an SQLite versio n o f the No rthwind Traders database. Unlike MySQL, SQLite packages an entire
database (tables, views, indexes, etc.) into a single file.
Here's a diagram o f the tables fro m the database that get used in this pro ject:
Fo r yo ur final pro ject, yo u'll use Northwind Traders as a data so urce to design, implement, and po pulate a data
wareho use.
Yo u are required to implement these dimensio ns:
Date
Emplo yees
Custo mers
Suppliers
Pro ducts
Orders
and these facts:
Order Unit Price
Note
If yo u type in the file name, be sure to use fo rward slashes instead o f back slashes:
C:/talend_files/Nwind.db
Instead o f using MySQL co mpo nents as the so urce fo r yo ur data, yo u'll need to use the SQLite co mpo nents. Yo u can
use the SQL Builder to o l in TOS to examine the tables in No rthwind and see the vario us data types o f co lumns. To use
the SQL Builder, click o n the
Yo ur data wareho use will be lo cated in yo ur existing MySQL database. To distinguish tables fo r yo ur final pro ject fro m
o ther tables, use the prefix fp_ fo r yo ur table names. Fo r example, yo u might name yo ur date dimensio n: fp_dimDate.
fp_dimDate
Yo ur date dimensio n sho uld be called f p_dim Dat e .
The dates in Northwind Traders range fro m 19 9 4 to 19 9 6 . Make sure yo ur date dimensio n has all o f the
required dates in it. Yo u can use the file c:/talend_files/NwindDates.xls to lo ad yo ur date dimensio n if
yo u like. This date dimensio n is no t the same as the o ne used in the class - its co lumns are: date,
is_weekend, year, quarter, month, and day.
fp_dimEmployees
Use the fo llo wing query to po pulate f p_dim Em plo ye e s:
CODE TO TYPE:
SELECT EmployeeID, LastName, FirstName, Title, TitleOfCourtesy,
BirthDate, HireDate, Address, City, Region, PostalCode, Country,
HomePhone, Extension
FROM Employees;
Note
The dates in the SQLite are strings and we need to co nvert them to dates in o ur tMap
co mpo nent befo re writing them o ut to MySQL. Assuming that the HireDate co lumn is identified
in the expressio n co lumn o f the tMap o utputs sectio n as ro w2.HireDate, we wo uld use this
expressio n:
TalendDate.parseDate("dd-MMM-yyyy", row2.HireDate)
In the Schema edito r in the lo wer right side set HireDate's Type to Date and the Date Pattern to
"yyyy-MM-dd". Apply this same technique to o ther dates enco untered in this pro ject.
fp_dimCustomers
Use the fo llo wing query to po pulate f p_dim Cust o m e rs, a T ype -2 SCD:
CODE TO TYPE:
SELECT CustomerID, CompanyName, ContactName, ContactTitle, Address, City,
Region, PostalCode, Country, Phone, Fax
FROM Customers;
fp_dimSuppliers
Use the fo llo wing query to po pulate f p_dim Supplie rs, a T ype -2 SCD:
CODE TO TYPE:
SELECT SupplierID, CompanyName, ContactName, ContactTitle,
Address, City, Region, PostalCode, Country, Phone, Fax, HomePage
FROM Suppliers;
fp_dimProducts
Use the fo llo wing query to po pulate f p_dim Pro duct s, a T ype -2 SCD:
CODE TO TYPE:
SELECT Products.ProductID, Products.ProductName, Products.Discontinued,
Categories.CategoryName, Categories.Description as CategoryDescription
FROM Products
INNER JOIN Categories on Products.CategoryID = Categories.CategoryID;
fp_dimOrders
Use the fo llo wing query to po pulate f p_dim Orde rs:
CODE TO TYPE:
SELECT Orders.OrderID, Customers.CompanyName as CustomerName, Customers.ContactN
ame,
Orders.OrderDate, Orders.RequiredDate,
Orders.ShipName, Orders.ShipAddress, Orders.ShipCity,
Orders.ShipRegion, Orders.ShipPostalCode, Orders.ShipCountry
FROM Orders
INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID;
Note
Note
There is no dates fo r this fact. We want to jo in o n the "latest" values fo r the Pro duct and Supplier dimensio ns.
To do so yo u need to add the expressio n end_date='2099-01-01' to yo ur jo in.
As always, feel free to co ntact yo ur mento r if yo u have any questio ns.
Thanks fo r playing, have fun, and go o d luck with this last pro ject. It's been great wo rking with yo u!
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.