DW - Chap 6
DW - Chap 6
DW - Chap 6
Having designed the logical model of the dimensional data store and normalized data store in the previous chapter, this chapter will cover how we implement those data store physically as SQL Server databases.
1. For each database we will create six filegroups, located on six different physical disks on RAID 5,
with one transaction log file located in the log disk of RAID 1. 2. Filegroups are a collection of database files. Set the database default location in the server property to match this. This is necessary so that when we create a new database (for example a new DDS), SQL Server will put the database data and log files in the correct locations. This is done by right-clicking the SQL Server name in Management Studio and choosing properties. Click Database settings, and modify the default database locations. 3. Remember to put the stage log file on a different disk from the NDS and DDS logs;
Using Views
A view is a database object with columns and rows like a table but not persisted on disks. A view is created using a SELECT statement that filters columns and rows from a table or combination of several tables using a JOIN clause. In data warehousing views are used for three purposes: a. To create conform dimensions in dimensional data stores: Conformed dimensions mean they are either the same dimension table or one is the subset of the other. As I discussed in Chapter 1, dimension A is said to be a subset of dimension B when all the columns of dimension A exist in dimension B all the rows of dimension A exist in dimension B.
b. To shield users from physical tables, making it simpler for the users as well as to restrict
access: A view creates a virtual layer on the top of the physical tables. Users do not access the physical tables anymore. Instead, they are accessing the views. This additional layer enables us to create a virtual table that gets data from several tables. This additional layer also enables us to restrict users from accessing certain columns or certain rows. c. To increase the availability, particularly to make data warehouse up and running when we are populating it. This is used in a user facing data stores such as the DDS. For every dimension table and every fact table, we create two physical tables and one view that selects from one of the two tables. When table 1 is being populated, the view selects from table 2. Conversely, when table 2 is being populated, the view selects from table 1.
Partitioning
Out of many things that can improve the performance of a data warehouse, I would say partitioning is the most important. In SQL Server 2005 and 2008, we have a new feature for partitioning for a physical table. Previously, in SQL Server 2000, we could only partition a view; we could not partition a table.
Vertical partitioning is splitting a table vertically into smaller tables, with each table containing some columns of the original table. Horizontal partitioning is splitting a table horizontally into several smaller tables, with each table containing some rows of the original table. Lets partition our Subscription Sales table. In this example, we have a fact table storing customer subscriptions. We will partition the table into monthly partitions. In other words, each month will go into its own partition. In this example, we will use January until December 2008. This is quite a long chapter. Database design is the cornerstone of data warehousing. We will build the ETL and applications on this foundation, so we must get it right. In this chapter, we discussed the details of the hardware platform and system architecture, the disk space calculation, database creation and table and view creation. We also covered the top three factors that can improve data warehouse performance: summary table, partitioning, and indexing. We need to make sure they are set correctly from the beginning other words, when we create the databases, not later when we have performance problems. Now that we have built the databases, in the next two chapters, you will learn how to extract data from the source systems and populate our NDS and DDS databases, widely know as ETL, which stands for extract, transform, load.