Data Is A Collection of Facts, Such As Numbers, Words, Measurements, Observations or Just Descriptions of
Data Is A Collection of Facts, Such As Numbers, Words, Measurements, Observations or Just Descriptions of
Data Is A Collection of Facts, Such As Numbers, Words, Measurements, Observations or Just Descriptions of
Prachi
Syllabus:
UNIT-I
Data Warehouse: Need for data warehouse, Definition, Goals of data Warehouse , Challenges faced
during Warehouse Construction, Advantages, Types of Warehouse: Data Mart, Virtual Warehouse and
Enterprise Warehouse. Components of Warehouse: Fact data, Dimension data, Fact table and
Dimension table, Designing fact tables. Pre-requisite Phases: Extract, Transform and load process.
Warehouse Schema for multidimensional data: star, snowflake and galaxy schemas
-----------------------------------------------------------------------------------------------------------------------------------------
-
-----------------------------------------------------------------------------------------------------------------------------------------
-
DATA
Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of
things. Data is
• Raw
• unorganized facts
• unprocessed
• not clean
INFORMATION
Information is the processed, organized and structured data or presented in a given context so as to make it useful
1
Unit 1: Data warehouse and Data Mining Ms. Prachi
WAREHOUSE
Definition:
A large building where raw materials or manufactured goods may be stored prior to their distribution
for sale
2
Unit 1: Data warehouse and Data Mining Ms. Prachi
A data warehouse(DW) is a repository of information collected from multiple heterogeneous sources (Flat files,
Database, tables, records, reports, online transactions etc) and managed to provide meaningful and useful decisions.
Defination of data warehouse according to William H. Inmon, a leading architect in the construction of data warehouse
systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision making process”
1. Subject oriented
Data warehouse is organized around major subject areas like
▪ customer
▪ supplier
▪ product
▪ sales
▪ policy
▪ claims
It does not focus on the day to day transactions rather help the decision makers to extract useful and valuable
information on their subject.
2. Integrated
Integrate multiple heterogeneous sources such as databases, flat files, and online transaction records. To make the data
consistent data cleaning and data integration techniques are applied on the naming conventions, attributes etc.
3. Time variant
Historical data is stored in the data warehouse to provide information based on the historical perspective. Eg.
Data from past 5-10 years is collected over time for time series analysis.
3
Unit 1: Data warehouse and Data Mining Ms. Prachi
4. Non-volatile
Data in the warehouse is permanent and cannot be erased or deleted when new data is inserted. It does not
require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations
in data accessing: initial loading of data and access of data.
Advantages / Need of Data warehouse
• Improved Business intelligence: Data Warehouse helps to integrate many sources of data to
make business decisions based on the complete data
• Time to access: quick to access critical information as all data is in centralized location
• Analysis & Reporting: Data warehouse helps to reduce total turnaround time for analysis and
reporting
• Historical intelligence: Data warehouse stores a large amount of historical data. This helps
users to analyze different time periods and trends to make future predictions
1. EnablesHistorical Insight No business can survive without a large and accurate
storehouse of historical data, from sales and inventory data to personnel and intellectual
property records. If a business executive suddenly needs to know the sales of a key
product 24 months ago, the rich historical data provided by a data warehouse make this
possible. Also important, a data warehouse can add context to this historical data by
listing all the key performance trends that surround this retrospective research. This kind
of efficiency cannot be matched by a legacy database.
2. Enhances Conformity And Quality Of Data Your business generates data in myriad
different forms, including structured and unstructured data, data from social media, and
data from sales campaigns. A data warehouse converts this data into the consistent
formats required by your analytics platforms. Moreover, by ensure this conformity, a data
warehouse ensures that the data produced by different business divisions is at the same
quality and standard – allowing a more efficient feed for analytics.
3. Boosts Efficiency It’s very time consuming for a business user or a data scientist to have
to gather data from multiple sources. It’s far more advantageous for this data to be
gathered in one place, hence the benefit of a data warehouse. Additionally, if for instance
your data scientist needs data to run a fast report, they don’t need to get the assistance
from tech support to perform this task. A data warehouse makes this data readily
available – in the correct format – improving efficiency of the entire process.
4. Increase The Power And Speed Of Data Analytics Business intelligence and data
analytics are the opposite of instinct and intuition. BI and analytics require high quality,
standardized data – on time and available for rapid data mining. A data warehouse
enables this power and speed, allowing competitive advantage in key business sectors,
ranging from CRM to HR to sales success to quarterly reporting.
5. Drives Revenue A tech pundit opined that “data is the new oil,” referring to the high dollar
value of data in today’s world. Creating more standardized and better quality data is the
key strength of a data warehouse, and this key strength translates clearly to significant
revenue gains. The data warehouse formula works like this: Better business intelligence
helps with better decisions, and in turn better decisions create a higher return on
investment across any sector of your business. Most important, these revenue gains
build on themselves over time, as better decisions strengthen the business. In short, a
high quality, fully scalable data warehouse can be seen as less of a cost and more of an
investment – one that adds exponential value like few other investments that businesses
make. × 2/21/22, 1:10 PM Top 10 Benefits of a Data Warehouse | Datamation
https://www.datamation.com/big-data/top-10-benefits-of-a-data-warehouse/ ¾
6. Scalability The top key word in the cloud era is “scalable” and a data warehouse is a
critical component in driving this scale. A topflight data warehouse is itself scalable, and
also enables greater scalability in the business overall. That is, today’s sophisticated data
warehouse are built to scale, handling ever more queries as the business grows (though
this will require more supporting hardware). Additionally, the efficiency in data flow
enabled by a data warehouse greatly boosts a business’s growth – this growth is the
core of business scalability.
7. Interoperates With On-Premise And Cloud Unlike the legacy databases of yesteryear,
today’s data warehouses are built with multicloud and hybrid cloud in mind. Many data
warehouses are now fully cloud-based, and even those that are built for onpremise
typically will interoperate well with the cloud-based portion of a company’s infrastructure.
As an additional important side point: this cloud-based focus also means that mobile
users are better able to access the data warehouse – this is beneficial for sales reps in
particular.
8. Data Security A number of key advances in data warehouse have enhanced their security,
which enhances the overall security of company data. Among these advances are
techniques like a “slave read only” set up, which blocks malicious SQL code, and
encrypted columns, which protects confidential data. Some businesses set up custom
user groups on their data warehouses, which can include or exclude various data pools,
and even give permission on a row by row basis.
9. Much Higher Query Performance And Insight The constant business intelligence queries
that are part of today’s business can put a major strain on an analytics infrastructure,
from the legacy databases to the data marts. Having a data warehouse to more
effectively handle queries removes some of the pressure on the system. Furthermore,
since a data warehouse is specifically geared to handle massive levels of date and
myriad complex queries, it’s the high functioning core of any business’s data analytics
practice.
10. Provides Major Competitive Advantage This is absolutely the bottom line benefit of a
data warehouse: it allows a business to more effectively strategize and execute against
other vendors in its sector. With the quality, speed and historical context provided by a
data warehouse, the greater insight in data mining can drive decisions that create more
sales, more targeted products, and faster response times. In short, a data warehouse
improves business decision making, which in turn gives any business a key competitive
advantage
Many organizations use this information to support business decision-making activities, including (1) Increasing customer
focus, which includes the analysis of customer buying patterns (such as buying preference, buying time, budget cycles,
and appetites for spending)
(2) Repositioning products and managing product portfolios by comparing the performance of sales by quarter, by year,
and by geographic regions in order to fine-tune production strategies
(3) Analyzing operations and looking for sources of profit
(4) Managing customer relationships
● Financial services
● Banking services
● Consumer goods
● Retail sectors
● Controlled manufacturing
● Airline:
● In the Airline system, it is used for operation purpose like crew assignment, analyses
of route profitability, frequent flyer program promotions, etc.
● Banking:
● It is widely used in the banking sector to manage the resources available on desk
effectively. Few banks also used for the market research, performance analysis of the
product and operations.
● Healthcare:
● Healthcare sector also used Data warehouse to strategize and predict outcomes,
generate patient’s treatment reports, share data with tie-in insurance companies,
medical aid services, etc.
● Public sector:
● In the public sector, data warehouse is used for intelligence gathering. It helps
government agencies to maintain and analyze tax records, health policy records, for
every individual
● Investment and Insurance sector:
● In this sector, the warehouses are primarily used to analyze data patterns, customer
trends, and to track market movements.
● Retain chain
● In retail chains, Data warehouse is widely used for distribution and marketing. It also
helps to track items, customer buying pattern, promotions and also used for
determining pricing policy.
● Telecommunication:
● A data warehouse is used in this sector for product promotions, sales decisions and to
make distribution decisions.
● Hospitality Industry
● This Industry utilizes warehouse services to design as well as estimate their
advertising and promotion campaigns where they want to target clients based on their
feedback and travel patterns.
4
Unit 1: Data warehouse and Data Mining Ms. Prachi
Enterprise warehouse
Data Mart
1. A DATA MART is focused on a single functional area of an organization and contains a subset of data stored in a Data
Warehouse.
2. Designed for use by a specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or finance
3. Collects information from small number of sources
4. Small in size and more flexible compared to warehouse
5
Unit 1: Data warehouse and Data Mining Ms. Prachi
Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
Dependent data marts are
sourced directly from enterprise data warehouses.
A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The
scope is confined to specific selected subjects. For example, a marketing data mart may confine its
subjects to customer, item, and sales. The data contained in data marts tend to be summarized. Data
marts are usually implemented on low-cost departmental servers that are Unix/Linux or Windows
based. The implementation cycle of a data mart is more likely to be measured in weeks rather than
months or years. However, it may involve complex integration in the long run if its design and planning
were not enterprise-wide. Depending on the source of data, data marts can be categorized as
independent or dependent. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated locally within a
particular department or geographic area. Dependent data marts are sourced directly from enterprise
data warehouses.
This model creates a virtual view of databases, allowing the creation 0f • “virtual
warehouse” as opposed to a physical warehouse. In a virtual warehouse, you have a
logical description of all the databases and their structures, individuals who want to get
information from those databases do not have to know anything about them.
This approach creates a single “virtual database” from all the resources. The data
resources can be local or remote. In this type of data warehouse, the data is not
moved from the sources: Instead, the users are given direct access to the data. Direct
access to the data is sometimes through simple SQL queries, view definition, or data-
access middleware.
With this approach, it is possible to access remote data sources including major
RDBMSs. The virtual data warehouse scheme lets a client application access data
distributed across multiple data sources through a single SQL statement, a single
interface. All data sources are accessed as though they are local users and their
applications do not even need. to know the physical location of the data.
There is a great benefit in starting with a virtual warehouse since many organizations
do not want to replicate information in the physical data warehouse. Some
organizations decide to provide both by creating a data warehouse containing
summary-level data with access to legacy data for transaction details.
A virtual database is easy and fast, but it is not without problems. Since the queries
must compete with the production data transactions, their performance can be
considerably degraded. Since there is no metadata, no summary data, or history; all
the queries must be •repeated, creating an additional burden or. the system. Above
all, there is no clearing and refreshing process, which involves causing the queries to
become very complex.
Pre-requisite Phases:
Extract, Transform and load process
Data warehouse systems use back-end tools and utilities to populate and refresh their data . These tools and
utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and
external sources.
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse
format.
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and
builds indices and partitions.
Refresh, which propagates the updates from the data sources to the warehouse.
What is ETL?
ETL is a process that extracts the data from different source systems, then
transforms the data (like applying calculations, concatenations, etc.) and finally
loads the data into the Data Warehouse system. Full form of ETL is Extract,
Transform and Load.
It’s tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from
the truth and requires a complex ETL process. The ETL process requires active
inputs from various stakeholders including developers, analysts, testers, top
executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse
system needs to change with business changes. ETL is a recurring activity (daily,
weekly, monthly) of a Data warehouse system and needs to be agile, automated,
and well documented.
In this ETL tutorial, you will learn-
What is ETL?
Why do you need ETL?
ETL Process in Data Warehouses
Step 1) Extraction
Step 2) Transformation
Step 3) Loading
ETL Tools
Best practices ETL process
26.6M
343
What is Database SQL
Step 1) Extraction
In this step of ETL architecture, data is extracted from the source system into the staging
area. Transformations if any are done in staging area so that performance of source
system in not degraded. Also, if corrupted data is copied directly from the source into
Data warehouse database, rollback will be a challenge. Staging area gives an
opportunity to validate extracted data before it moves into the Data warehouse.
Data warehouse needs to integrate systems that have different
Step 2) Transformation
Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and changes data such that insightful BI reports can be generated.
It is one of the important ETL concepts where you apply a set of functions on extracted
data. Data that does not require any transformation is called as direct move or pass
through data.
In transformation step, you can perform customized operations on data. For instance, if
the user wants sum-of-sales revenue which is not in the database. Or if the first name
and the last name in a table is in different columns. It is possible to concatenate them
before loading.
Data Integration Issues
Following are Data Integrity Problems:
1. Different spelling of the same person like Jon, John, etc.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various
applications for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes.
Validations are done during this stage
Filtering – Select only certain columns to load
Using rules and lookup tables for Data standardization
Character Set Conversion and encoding handling
Conversion of Units of Measurements like Date Time Conversion, currency
conversions, numerical conversions, etc.
Data threshold validation check. For example, age cannot be more than two
digits.
Data flow validation from the staging area to the intermediate tables.
Required fields should not be left blank.
Cleaning ( for example, mapping NULL to 0 or Gender Male to “M” and Female to
“F” etc.)
Split a column into multiples and merging multiple columns into a single column.
Transposing rows and columns,
Use lookups to merge data
Using any complex data validation (e.g., if the first two columns in a row are
empty then it automatically reject the row from processing)
Step 3) Loading
Loading data into the target datawarehouse database is the last step of the ETL process.
In a typical Data warehouse, huge volume of data needs to be loaded in a relatively short
period (nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the
point of failure without data integrity loss. Data Warehouse admins need to monitor,
resume, cancel loads as per prevailing server performance.
Types of Loading:
Initial Load — populating all the Data Warehouse tables
Incremental Load — applying ongoing changes as when needed periodically.
Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
Load verification
Ensure that the key field data is neither missing nor null.
Test modeling views based on the target tables.
Check that combined values and calculated measures.
Data checks in dimension table as well as history table.
Check the BI reports on the loaded fact and dimension table.
ETL Tools
There are many Data Warehousing tools are available in the market. Here, are some
most prominent one:
1. MarkLogic:
MarkLogic is a data warehousing solution which makes data integration easier and
faster using an array of enterprise features. It can query different types of data like
documents, relationships, and metadata.
https://www.marklogic.com/product/getting-started/
2. Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of Data
Warehouse solutions for both on-premises and in the cloud. It helps to optimize
customer experiences by increasing operational efficiency.
https://www.oracle.com/index.html
3. Amazon RedShift:
Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective tool to analyze
all types of data using standard SQL and existing BI tools. It also allows running complex
queries against petabytes of structured data.
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Data warehouse Tools.
2. Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).
3. The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
4. Mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging
rules, and security
5. Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and
scheduling of refresh, update, and replication cycles.
6. Business metadata, which include business terms and definitions, data ownership
information, and charging policies.
Major task of online operational database systems is to perform online transaction and query processing. They cover
most of the day to day operations such as transactions, purchase, inventory etc.
Components of Warehouse:
8
Data Cube allows the data in the data warehouse to be modeled in multiple
dimensions Data in the data warehouse is multi dimensional having measures and
dimensions Measure are the values which can be quantified and aggregated eg. SUM(),
AVG()
Fact data/facts
A fact is a quantitative piece of information - such as a sale or a download by which one can analyze the
relationships between dimensions. Facts are stored in fact tables, and have a foreign key relationship with a
number of dimension tables.
Dimensions
Provide description to the objects in the fact table. They give the context to analyse the numerical
content of the facts. It stores the data about the ways by which the data in the fact table can be
analyzed.
These are the entities with respect to which organization wants to keep records
Example
Suppose there is an electronics store which has created a sales data warehouse. It wants to maintain
records based on dimensions item, branch, locations.
These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations
at which the items were sold. Each dimension may have a table associated with it, called a dimension table, which
further describes the dimension. eg the dimension table of item is given below.
Item
Item_Id
Item_name
Brand
Type
Cost
Fact Table
It is a collection of facts , measurements, metrics. facts for the sales data warehouse include
dollars_sold, units_sold, amount_budget. The fact table contains
the names of the facts, or measures, as well as keys to each of the related dimension tables. Let
9
Unit 1: Data warehouse and Data Mining Ms. Prachi
if we add the 4th dimension supplier
10
Unit 1: Data warehouse and Data Mining Ms. Prachi
Difference between Fact Table and
Dimension Table
A reality or fact table’s record could be a combination of attributes from totally different
dimension tables. The Fact Table or Reality Table helps the user to investigate the business
dimensions that helps him in call taking to enhance his business.
On the opposite hand, Dimension Tables facilitate the reality table or fact table to gather
dimensions on that the measures needs to be taken.
The main difference between fact table or reality table and the Dimension table is that
dimension table contains attributes on that measures are taken actually table.
Difference between Fact Table and Dimension Table:
5. The attribute format of fact table While the attribute format of dimension table
S.NO Fact Table Dimension Table
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph
resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact
table.
Example 4.1 Star schema. A star schema for AllElectronics sales is shown in Figure 4.6. Sales are
considered along four dimensions: time, item, branch, and location. The schema contains a central
fact table for sales that contains keys to each of the four dimensions, along with two measures:
dollars sold and units sold. To minimize the size of the fact table, dimension identifiers (e.g., time key
and item key) are system-generated identifiers. Notice that in the star schema, each dimension is
represented by only one table, and each table contains a set of attributes. For example, the location
dimension table contains the attribute set {location key, street, city, province or state, country}. This
constraint may introduce some redundancy. For example, “Urbana” and “Chicago” are both cities in
the state of Illinois, USA. Entries for such cities in the location dimension table will create redundancy
among the attributes province or state and country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL,
USA). Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a
lattice (partial order).
Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The resulting
schema graph forms a shape similar to a snowflake. The major difference between the snowflake and
star schema models is that the dimension tables of the snowflake model may be kept in normalized
form to reduce redundancies. Such a table is easy to maintain and saves storage space. However,
this space savings is negligible in comparison to the typical magnitude of the fact table. Furthermore,
the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to
execute a query. Consequently, the system performance may be adversely impacted. Hence,
although the snowflake schema reduces redundancy, it is not as popular as the star schema in data
warehouse design
where some dimension tables are normalized, thereby further splitting the data into
additional tables to avoid redundancy. eg.
Example 4.2 Snowflake schema. A snowflake schema for AllElectronics sales is given in Figure 4.7.
Here, the sales fact table is identical to that of the star schema in Figure 4.6. The main difference
between the two schemas is in the definition of dimension tables. The single dimension table for item
in the star schema is normalized in the snowflake schema, resulting in new item and supplier tables.
For example, the item dimension table now contains the attributes item key, item name, brand, type,
and supplier key, where supplier key is linked to the supplier dimension table, containing supplier key
and supplier type information. Similarly, the single dimension table for location in the star schema can
be normalized into two new tables: location and city. The city key in the new location table links to the
city dimension. Notice that, when desirable, further normalization can be performed on province or
state and country in the snowflake schema shown in Figure 4.7.
GALAXY
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and
hence is called a galaxy schema or a fact constellation.
Example 4.3 Fact constellation. A fact constellation schema is shown in Figure 4.8. This schema
specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star
schema (Figure 4.6). The shipping table has five dimensions, or keys—item key, time key, shipper
key, from location, and to location—and two measures—dollars cost
and units shipped. A fact constellation schema allows dimension tables to be shared between fact
tables. For example, the dimensions tables for time, item, and location are shared between the sales
and shipping fact tables.