Module 1 Intro To Data Wirehousing
Module 1 Intro To Data Wirehousing
A data warehouse is a centralized storage system that allows for the storing, analyzing, and interpreting
of data in order to facilitate better decision-making. Transactional systems, relational databases, and
other sources provide data into data warehouses on a regular basis.
A data warehouse can be defined as a collection of organizational data and information extracted from
operational sources and external data sources. The data is periodically pulled from various internal
applications like sales, marketing, and finance; customer-interface applications; as well as external
partner systems. This data is then made available for decision-makers to access and analyze.
So what is data warehouse? For a start, it is a comprehensive repository of current and historical
iKey Characteristics of Data Warehouse
Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc. For example, if you want
to analyze your company’s sales data, you need to build a data warehouse that concentrates on sales.
Such a warehouse would provide valuable information like ‘who was your best customer last year?’ or
‘who is likely to be your best customer in the coming year?’
Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format. The data
must be stored in the warehouse in a consistent and universally acceptable manner in terms of naming,
format, and coding. This facilitates effective data analysis.
Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous data is
not erased when current data is entered. This helps you to analyze what has happened and when.
Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or implicitly.
An example of time variance in Data Warehouse is exhibited in the Primary Key, which must have an
element of time like the day, week, or month.
Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries.
Data warehouse is an example of an OLAP system or an online database query answering system. OLTP
is an online database modifying system, for example, ATM. Learn more about the OLTP vs.
OLAP differences.
Structured data is information that has been formatted and transformed into a well-defined
data model. The raw data is mapped into predesigned fields that can then be extracted and
read through SQL easily. SQL relational databases, consisting of tables with rows and columns,
are the perfect example of structured data.
The relational model of this data format utilizes memory since it minimizes data redundancy.
However, this also means that structured data is more inter-dependent and less flexible. Now
let’s look at more examples of structured data.
This type of data is generated by both humans and machines. There are numerous examples
of structured data generated by machines, such as POS data like quantity, barcodes, and
weblog statistics. Similarly, anyone who works on data would have used spreadsheets once in
their lifetime, which is a classic case of structured data generated by humans. Due to the
organization of structured data, it is easier to analyze than both semi-structured and
unstructured data.
Your data sets may not always be structured or unstructured; semi-structured data or partially
structured data is another category between structured and unstructured data. Semi-structured
data is a type of data that has some consistent and definite characteristics. It does not confine into
a rigid structure such as that needed for relational databases. Organizational properties like
metadata or semantics tags are used with semi-structured data to make it more manageable;
however, it still contains some variability and inconsistency.
An example of data semi-structured format is delimited files. It contains elements that can break
down the data into separate hierarchies. Similarly, in digital photographs, the image does not
have a pre-defined structure itself but has certain structural attributes making them semi-
structured. For instance, if an image is taken from a smartphone, it would have some structured
attributes like geotag, device ID, and DateTime stamp. After being stored, images can also be
assigned tags such as ‘pet’ or ‘dog’ to provide a structure.
On some occasions, unstructured data is classified as semi-structured data because it has one or
more classifying attributes.
Unstructured data is defined as data present in absolute raw form. This data is difficult to
process due to its complex arrangement and formatting. Unstructured data management may
take data from many forms, including social media posts, chats, satellite imagery, IoT sensor
data, emails, and presentations, to organize it in a logical, predefined manner in a data storage.
In contrast, the meaning of structured data is data that follows predefined data models and is
easy to analyze. Structured data examples would include alphabetically arranged names of
customers and properly organized credit card numbers. After understanding the definition of
unstructured data, let’s look at some examples.
Unstructured data can be anything that’s not in a specific format. This can be a paragraph from
a book with relevant information or a web page. An example of unstructured data could also be
Log files that are not easy to separate. Social media comments and posts need to be analyzed.
38,P-R-38636-6-45,P-R-39105-1-11,P-R-38036-1-5,P-R-35697-1-13,P-R-35087-1-
27,P-R-34341-1-9,P-R-33341-1-15,P-R-33110-1-29,P-R-31345-1-693,P-R-29076-1-
6,P-R-28767-1-8,P-R-28540-2-8,P-R-28312-1-10,P-R-28069-1-27,P-R-28032-1-9,P-R-
26562-1-12,P-R-26527-5-20,P-R-26164-1-11,P-R-25785-1-30,P-R-25095-9-70,P-R-
23504-1-15,P-R-19719-5-41203
Wed Sep 23 2020 05:21:01 GMT+0500
Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc. For example, if you want
to analyze your company’s sales data, you need to build a data warehouse that concentrates on sales.
Such a warehouse would provide valuable information like ‘who was your best customer last year?’ or
‘who is likely to be your best customer in the coming year?’
Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format. The data
must be stored in the warehouse in a consistent and universally acceptable manner in terms of naming,
format, and coding. This facilitates effective data analysis.
Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous data is
not erased when current data is entered. This helps you to analyze what has happened and when.
Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or implicitly.
An example of time variance in Data Warehouse is exhibited in the Primary Key, which must have an
element of time like the day, week, or month.
Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries.
Data warehouse is an example of an OLAP system or an online database query answering system. OLTP
is an online database modifying system, for example, ATM. Learn more about the OLTP vs.
OLAP differences.
An operational system is a method used in data warehousing to refer to a system that is used to process
the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is
updated continuously as new information is loaded into the warehouse.
To serve this purpose DW should be loaded at regular intervals. The data into the system is gathered from one
or more operational systems, flat files, etc. The process which brings the data to DW is known as ETL
Process. Extraction, Transformation, and Loading are the tasks of ETL.
#1) Extraction: All the preferred data from various source systems such as databases, applications, and flat
files is identified and extracted. Data extraction can be completed by running jobs during non-business hours.
#2) Transformation: Most of the extracted data can’t be directly loaded into the target system. Based on the
business rules, some transformations can be done before loading the data.
For Example, a target column of data may expect two source columns of concatenated data as input.
Likewise, there may be complex logic for data transformation that needs expertise. Some data that does not
need any transformations can be directly moved to the target system.
The transformation process also corrects the data, removes any incorrect data and fixes any errors in the data
before loading it.
#3) Loading: All the gathered information is loaded into the target Data Warehouse tables.
With the above steps, extraction achieves the goal of converting data from different formats from different
sources into a single DW format, that benefits the whole ETL processes. Such logically placed data is more
useful for better analysis.
Full Extraction: As the name itself suggests, the source system data is completely extracted to the
target table. Each time this kind of extraction loads the entire current source system data without
considering the last extracted time stamps. Preferably you can use full extraction for the initial loads or
tables with fewer data.
Incremental Extraction: The data which is added/modified from a specific date will be considered for
incremental extraction. This date is business-specific as last extracted date (or) last order date etc. We
can refer to a timestamp column from the source table itself (or) a separate table can be created to
track only the extraction date details. Referring to the timestamp is a significant method during
Incremental extraction. Logics without timestamp may fail if the DW table has large data.
#2) Physical Extraction Methods
Depending on the source systems’ capabilities and the limitations of data, the source systems can provide the
data physically for extraction as online extraction and offline extraction. This supports any of the logical
extraction types.
Online Extraction:: We can directly connect to any source system databases with the connection
strings to extract data directly from the source system tables.
Offline Extraction:: We will not directly connect to the source system database here, instead the
source system provides data explicitly in a pre-defined structure. Source systems can provide data in
the form of Flat files, Dump files, Archive logs and Tablespaces.
ETL tools are best suited to perform any complex data extractions, any number of times for DW though they
are expensive.
Mostly you can consider the “Audit columns” strategy for the incremental load to capture the data changes. In
general, the source system tables may contain audit columns, that store the time stamp for each insertion (or)
modification.
The timestamp may get populated by database triggers (or) from the application itself. You must ensure the
accuracy of the audit columns’ data even if they are loading by any means, to not to miss the changed data for
incremental loads.
During the incremental load, you can consider the maximum date and time of when the last load has happened
and extract all the data from the source system with the time stamp greater than the last load time stamp.
The transformation process with a set of standards brings all dissimilar data from various source systems into
usable data in the DW system. Data transformation aims at the quality of the data. You can refer to the data
mapping document for all the logical transformation rules.
Based on the transformation rules if any source data is not meeting the instructions, then such source data is
rejected before loading into the target DW system and is placed into a reject file or reject table.
The transformation rules are not specified for the straight load columns data (does not need any change) from
source to target. Hence, data transformations can be classified as simple and complex. Data transformations
may involve column conversions, data structure reformatting, etc.
Given below are some of the tasks to be performed during Data Transformation:
#1) Selection: You can select either the entire table data or a specific set of columns data from the source
systems. The selection of data is usually completed at the Extraction itself.
There may be cases where the source system does not allow to select a specific set of columns data during the
extraction phase, then extract the whole data and do the selection in the transformation phase.
#2) Splitting/joining: You can manipulate the selected data by splitting or joining it. You will be asked to split
the selected source data even more during the transformation.
For example, if the whole address is stored in a single large text field in the source system, the DW system
may ask to split the address into separate fields as a city, state, zip code, etc. This is easy for indexing and
analysis based on each component individually.
Whereas joining/merging two or more columns data is widely used during the transformation phase in the DW
system. This does not mean merging two fields into a single field.
For Example, if information about a particular entity is coming from multiple data sources, then gathering the
information as a single entity can be called as joining/merging the data.
#3) Conversion: The extracted source systems data could be in different formats for each data type, hence all
the extracted data should be converted into a standardized format during the transformation phase. The same
kind of format is easy to understand and easy to use for business decisions.
#4) Summarization: In some situations, DW will look for summarized data rather than low-level detailed data
from the source systems. Because low-level data is not best suited for analysis and querying by the business
users.
For example, sales data for every checkout may not be required by the DW system, daily sales by-product (or)
daily sales by the store is useful. Hence summarization of data can be performed during the transformation
phase as per the business requirements.
#5) Enrichment: When a DW column is formed by combining one or more columns from multiple records, then
data enrichment will re-arrange the fields for a better view of data in the DW system.
#6) Format revisions: Format revisions happen most frequently during the transformation phase. The data
type and its length are revised for each column.
For example, a column in one source system may be numeric and the same column in another source system
may be a text. To standardize this, during the transformation phase the data type for this column is changed to
text.
#7) Decoding of fields: When you are extracting data from multiple source systems, the data in various
systems may be decoded differently.
For example, one source system may represent customer status as AC, IN, and SU. Another system may
represent the same status as 1, 0 and -1.
During the data transformation phase, you need to decode such codes into proper values that are
understandable by the business users. Hence, the above codes can be changed to Active, Inactive and
Suspended.
#8) Calculated and derived values: By considering the source system data, DW can store additional column
data for the calculations. You have to do the calculations based on the business logic before storing it into DW.
#9) Date/Time conversion: This is one of the key data types to concentrate on. The date/time format may be
different in multiple source systems.
For example, one source may store the date as November 10, 1997. Another source may store the same date
in 11/10/1997 format. Hence, during the data transformation, all the date/time values should be converted into
a standard format.
#10) De-duplication: In case the source system has duplicate records, then ensure that only one record is
loaded to the DW system.
Transformation Flow Diagram:
How To Implement Transformation?
Depending on the complexity of data transformations you can use manual methods, transformation tools (or)
combination of both whichever is effective.
The maintenance cost may become high due to the changes that occur in business rules (or) due to the
chances of getting errors with the increase in the volumes of data. You should take care of metadata initially
and also with every change that occurs in the transformation rules.
Practically Complete transformation with the tools itself is not possible without manual intervention. But the data
transformed by the tools is certainly efficient and accurate.
To achieve this, we should enter proper parameters, data definitions, and rules to the transformation tool as
input. From the inputs given, the tool itself will record the metadata and this metadata gets added to the overall
DW metadata.
If there are any changes in the business rules, then just enter those changes to the tool, the rest of the
transformation modifications will be taken care of by the tool itself. Hence a combination of both methods is
efficient to use.
Data Loading
Extracted and transformed data gets loaded into the target DW tables during the Load phase of the ETL
process. The business decides how the loading process should happen for each table.
Look at the below example, for a better understanding of the loading process in ETL:
#1) During the initial load, the data which is sold on 3rd June 2007 gets loaded into the DW target table because
it is the initial data from the above table.
#2) During the Incremental load, we need to load the data which is sold after 3 rd June 2007. We should consider
all the records with the sold date greater than (>) the previous date for the next day. Hence, on 4th June 2007,
fetch all the records with sold date > 3rd June 2007 by using queries and load only those two records from the
above table.
On 5th June 2007, fetch all the records with sold date > 4th June 2007 and load only one record from the above
table.
#3) During Full refresh, all the above table data gets loaded into the DW tables at a time irrespective of the sold
date.
The loaded data is stored in the respective dimension (or) fact tables. The data can be loaded,
appended, or merged to the DW tables as follows:
#4) Load: The data gets loaded into the target table if it is empty. If the table has some data exist, the existing
data is removed and then gets loaded with the new data.
For example,
Existing Table Data
Employee Name Role
John Manager
Revanth Lead
Ronald Developer
Changed Data
Employee Name Role
John Manager
Rohan Director
Chetan AVP
Das VP
John Manager
Rohan Director
Chetan AVP
Das VP
#5) Append: Append is an extension of the above load as it works on already data existing tables. In the target
tables, Append adds more data to the existing data. If any duplicate record is found with the input data, then it
may be appended as duplicate (or) it may be rejected.
What is a Star Schema?
Star Schema in data warehouse, in which the center of the star can have one fact table and
a number of associated dimension tables. It is known as star schema as its structure
resembles a star. The Star Schema data model is the simplest type of Data Warehouse
schema. It is also known as Star Join Schema and is optimized for querying large data sets.
In the following Star Schema example, the fact table is at the center which contains keys to
every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other
attributes like Units sold and revenue.
Example of
Star Schema Diagram
In the following Snowflake Schema example, Country is further normalized into an individual
table.
Denormalized Data structure and query also run faster. Normalized Data Structure.
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.
As you can see in above example, there are two facts table
1. Revenue
2. Product.
Figure 1 shows horizontal partitioning or sharding. In this example, product inventory data is divided into
shards based on the product key. Each shard holds the data for a contiguous range of shard keys (A-G
and H-Z), organized alphabetically. Sharding spreads the load over more computers, which reduces
contention and improves performance.
Figure 1 - Horizontally partitioning (sharding) data based on a partition key.
Vertical partitioning
The most common use for vertical partitioning is to reduce the I/O and performance costs associated with
fetching items that are frequently accessed. Figure 2 shows an example of vertical partitioning. In this
example, different properties of an item are stored in different partitions. One partition holds data that is
accessed more frequently, including product name, description, and price. Another partition holds
inventory data: the stock count and last-ordered date.