0% found this document useful (0 votes)
16 views104 pages

Unit 1 DWDM Complete

The document provides an overview of data warehousing and data mining, defining a data warehouse as a centralized repository of integrated data from multiple sources designed to support business intelligence and decision-making. It discusses the historical context, need, characteristics, and benefits of data warehouses, as well as the differences between data warehouses, databases, data marts, and data lakes. Additionally, it outlines the steps involved in implementing a data mart, emphasizing the importance of data quality, historical insight, and improved business analytics.

Uploaded by

itachi1804uchiha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views104 pages

Unit 1 DWDM Complete

The document provides an overview of data warehousing and data mining, defining a data warehouse as a centralized repository of integrated data from multiple sources designed to support business intelligence and decision-making. It discusses the historical context, need, characteristics, and benefits of data warehouses, as well as the differences between data warehouses, databases, data marts, and data lakes. Additionally, it outlines the steps involved in implementing a data mart, emphasizing the importance of data quality, historical insight, and improved business analytics.

Uploaded by

itachi1804uchiha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Data Warehouse &

Data Mining
BCA VI
DATA WAREHOUSING
A Data Warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and usually
residing at a single site.
W.H. Inmon is the father of data warehouse.
Data Warehouse is a subject –oriented, Integrated, time –variant,
nonvolatile collection of data in support of management's decisions.
Data warehouses generalize and consolidate data in
multidimensional space. The construction of data
warehouses involves data cleaning, data integration,
and data transformation, and can be viewed as an
important preprocessing step for data mining.

Data Warehouses are important asset for


organizations to maintain:-
-efficiency
-profitability
-competitive advantages

Goals of Data Warehousing


To help reporting as well as analysis
Maintain the organization's historical information
Be the foundation for decision making.
History of Data warehouse
The idea of data warehousing came to the late 1980's
when IBM researchers Barry Devlin and Paul Murphy
established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to
support an architectural model for the flow of
information from the operational system to decisional
support environments. The concept attempt to address
the various problems associated with the flow, mainly
the high costs associated with it.
In the absence of data warehousing architecture, a
vast amount of space was required to support multiple
decision support environments. In large corporations, it
was ordinary for various decision support environments
to operate independently.
Need of Dataware Housing
Better business analytics: Data warehouse plays an
important role in every business to store and analysis of all the
past data and records of the company. which can further
increase the understanding or analysis of data for the
company.
Faster Queries: The data warehouse is designed to handle
large queries that’s why it runs queries faster than the
database.
Improved data Quality: In the data warehouse the data you
gathered from different sources is being stored and analyzed
it does not interfere with or add data by itself so your quality
of data is maintained and if you get any issue regarding
data quality then the data warehouse team will solve this.
Historical Insight: The warehouse stores all your historical data
which contains details about the business so that one can
analyze it at any time and extract insights from it.
Characteristics of Data
Warehouse
What is a database?
A database is an aggregation of ordered, electronically
recorded data that has been structured/organized. Here
structured Data that follows a pre-established data
format is referred to as structured data and is easier to
assess.
Structured information follows a tabular structure with a
relationship between the various rows and columns.
Why use a database?
• Databases can effectively store vast numbers of records.
• It is incredibly simple and quick to locate data.
• It's simple to add new data and modify or delete available data.
• Data can be easily searched in a database using techniques like indexing,
binary searching, etc.
• Data can be quickly and easily sorted in a database.
• Data can be imported into other applications easily.
• The database is Multi-access means that more than one person can use
the same database at the same time.
• As the database offers extra security patterns for authorized access, the
security of data there is higher than that of physical paper files.
• Transaction administration also makes use of databases. Databases are
used to maintain accuracy and uniformity throughout the transactions,
which are a group of programmes used for a logical process.
What is a data warehouse?
• The purpose of a data warehouse is to support and facilitate
business intelligence (BI) activities, especially analytics.
• Despite being designed only for searches and analysis, data
warehouses frequently contain significant amounts of historical
data.
• Transaction applications and application log files are two
common sources of the data found in data warehouses.
• Many data sets from different sources are centralised and
compiled in a data warehouse.
• With the help of its analytical skills, businesses can get more out
of their data and make better decisions by gaining useful business
insights.
Benefits of data warehouses
• Data warehouses offer companies a variety of advantages. The
following are a few of the most widespread advantages:
• Offer a reliable, central location for storing a lot of past data
• With actionable insights, you can enhance company procedures
and decision-making.
• increase a business's overall return on investment. (ROI)
• Boost data integrity
• BI performance and capabilities can be improved by using
numerous sources.
• Give all company users access to historical data
• Enhance company analytics with AI and machine learning
Difference between Data Ware housing and Database

Parameter Data Warehouse Database

Transactional and
Workloads Analytical
Operational

It is subject-focused since it
provides information on a Removes redundancy and
Characteristics certain topic rather than offers security. It allows for
information about a numerous data views.
company's current activities.

It stores both historical and


The data in the database is
Data Type current data. It is possible that
updated.
the data is out of date.

Might not be updated.


Orientation Depends on the frequency of Real-time
ETL processes.
Parameter Data Warehouse Database

Purpose Designed to analyze Designed to record

A database's tables and joins are


Tables and Tables and joins are straightforward
complicated because they're
Joins since they're denormalized.
normalized.

Data is updated from source systems


Availability It is available in real-time.
when needed.

Technique Analyze data Capture data

Simple transaction queries are Complex queries are utilized for


Query Type
implemented. analytical reasons.
Parameter Data Warehouse Database

Schema Fixed and pre-defined schema Flexible or rigid schema based on the
Flexibility definition for ingest. type of database.

Data scientists and business


Users Application developers
analysts.

Processing It uses OLAP (Online Analytical It makes use of OLTP (Online


Method Processing). Transactional Processing).

Data from any number of apps is Generally confined to a particular


Storage Limit
stored. application.

Data modeling approaches are


ER modeling approaches are employed
employed for designing. It
Usage for designing. It aids in the execution of
permits you to analyze your
basic business procedures
enterprise.

Banking, universities, airlines, finance,


Healthcare sector, airline, retail
telecommunication, manufacturing,
Applications chain, insurance sector, banking,
sales and production, and HR
and telecommunication.
Need of Data Warehouse and Data
Mining
Need of Data Ware House and
Data Mining
1) Business User: Business users require a data warehouse to
view summarized data from the past. Since these people are
non-technical, the data may be presented to them in an
elementary form.
2)Store historical data: Data Warehouse is required to store
the time variable data from the past. This input is made to be
used for various purposes.
3) Make strategic decisions: Some strategies may be
depending upon the data in the data warehouse. So, data
warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from
different sources at a commonplace, the user can effectively
undertake to bring the uniformity and consistency in data.
Need of Data Ware House and
Data Mining
5) High response time: Data warehouse has to be ready for
somewhat unexpected loads and types of queries, which
demands a significant degree of flexibility and quick response
time.
Building Blocks of Data
Warehouse
Source Data Component
Production Data: This type of data comes from the different operating
systems of the enterprise. Based on the data requirements in the data
warehouse, we choose segments of the data from the various operational
modes.
Internal Data: In each organization, the client keeps their "private"
spreadsheets, reports, customer profiles, and sometimes even department
databases. This is the internal data, part of which could be useful in a data
warehouse.
Archived Data: Operational systems are mainly intended to run the current
business. In every operational system, we periodically take the old data and
store it in achieved files.
External Data: Most executives depend on information from external
sources for a large percentage of the information they use. They use
statistics associating to their industry produced by the external department.
Data Staging Component
Data Staging Component
1) Data Extraction: This method has to deal with numerous data
sources. We have to employ the appropriate techniques for each
data source.
2) Data Transformation: As we know, data for a data warehouse
comes from many different sources. If data extraction for a data
warehouse posture big challenges, data transformation present
even significant challenges. We perform several individual tasks as
part of data transformation.
3) Data Loading: Two distinct categories of tasks form data
loading functions. When we complete the structure and
construction of the data warehouse and go live for the first time,
we do the initial loading of the information into the data
warehouse storage. The initial load moves high volumes of data
using up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository.
The data repositories for the operational systems generally
include only the current data. Also, these
data repositories include the data structured in highly normalized
for fast and efficient processing.
Information
Delivery
Component
The information delivery
element is used to enable
the process of subscribing
for data warehouse files
and having it transferred to
one or more destinations
according to some
customer-specified
scheduling algorithm.
Metadata Component and Data Marts
Metadata :Metadata in a data warehouse is equal to the data
dictionary or the data catalog in a database management system. In
the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the
indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to particular selected
subjects. Data in a data warehouse should be a fairly current, but not
mainly up to the minute, although development in the data warehouse
industry has made standard and incremental data dumps more
achievable. Data marts are lower than data warehouses and usually
contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts
for particular kinds of queries and reports.
Management and Control
Component
• The management and control elements coordinate the
services and functions within the data warehouse.
• These components control the data transformation and
the data transfer into the data warehouse storage.
• It moderates the data delivery to the clients.
• Its work with the database management systems and
authorizes data to be correctly saved in the repositories.
• It monitors the movement of information into the
staging method and from there into the data warehouses
storage itself.
The difference between data marts,
data lakes, and data warehouses
A data warehouse is a data management system designed to support
business intelligence and analytics for an entire organization. Data
warehouses often contain large amounts of data, including historical
data. The data within a data warehouse usually is derived from a wide
range of sources, such as application log files and transactional
applications. A data warehouse stores structured data, whose purpose
is usually well-defined.
A data lake allows organizations to store large amounts of structured
and unstructured data (for example, from social media or clickstream
data), and to immediately make it available for real-time analytics,
data science, and machine learning use cases. With a data lake, data is
ingested in its original form, without alteration.
The key difference between a data lake and a data warehouse is that
data lakes store vast amounts of raw data, without a predefined
structure. Organizations do not need to know in advance how the data
will be used.
Data Mart
A data mart is a simple form of
a data warehouse that is focused
on a single subject or line of
business, such as sales, finance,
or marketing.
Given their focus, data marts
draw data from fewer sources
than data warehouses.
Data mart sources can include:
• internal operational systems,
• central data warehouse, and
• external data
Types of Data Marts
• Dependent Data Marts
• Independent Data Marts
Dependent Data Marts
A dependent data marts is a logical subset of a physical subset of a
higher data warehouse. According to this technique, the data marts are
treated as the subsets of a data warehouse.
In this technique, firstly a data warehouse is created from which further
various data marts can be created. These data mart are dependent on the
data warehouse and extract the essential record from it.
In this technique, as the data warehouse creates the data mart; therefore,
there is no need for data mart integration. It is also known as a top-
down approach.
Dependent Data Marts
Independent Data Marts

The second approach is Independent data marts (IDM)


Here, firstly independent data marts are created, and
then a data warehouse is designed using these
independent multiple data marts.
In this approach, as all the data marts are designed
independently; therefore, the integration of data marts
is required.
It is also termed as a bottom-up approach as the data
marts are integrated to develop a data warehouse.
Independent Data Marts
Hybrid Data Marts

It allows us to combine input from sources other than


a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are
needed, such as after a new group or product is added
to the organizations.
The benefits of a data mart
1. A single source of truth. The centralized nature of a data mart
helps ensure that everyone in a department or organization
makes decisions based on the same data.
2. Quicker access to data. Specific business teams and users can
rapidly access the subset of data they need from the enterprise
data warehouse and combine it with data from various other
sources.
3. Faster insights leading to faster decision making. While a data
warehouse enables enterprise-level decision-making, a data mart
allows data analytics at the department level.
4. Simpler and faster implementation. A data mart, in contrast,
is focused on serving the needs of specific business teams,
requiring access to fewer data sets. It therefore is much simpler
and faster to implement
The benefits of a data mart

4. Creating agile and scalable data management. Data marts


provide an agile data management system that works in
tandem with business needs, including being able to use
information gathered in past projects to help with current
tasks.
5. Transient analysis. Some data analytics projects are short-
lived—for example, completing a specific analysis of
online sales for a two-week promotion prior to a team
meeting. Teams can rapidly set up a data mart to
accomplish such a project
Steps in Implementing a Data Mart
The significant steps in implementing a data mart are to design
the schema, construct the physical storage, populate the data
mart with data from source systems, access it to make
informed decisions and manage it over time. So, the steps are:
1. Designing
2. Constructing
3. Populating
4. Accessing
5. Managing
Designing
The design step is the first in the data mart process. This phase
covers all of the functions from initiating the request for a data
mart through gathering data about the requirements and
developing the logical and physical design of the data mart.
It involves the following tasks:
1. Gathering the business and technical requirements
2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data
mart.
Constructing
This step contains creating the physical database and logical
structures associated with the data mart to provide fast and
efficient access to the data.
It involves the following tasks:
1. Creating the physical database and logical structures such
as tablespaces associated with the data mart.
2. creating the schema objects such as tables and indexes
describe in the design step.
3. Determining how best to set up the tables and access
structures.
Populating
This step includes all of the tasks related to the getting data
from the source, cleaning it up, modifying it to the right format
and level of detail, and moving it into the data mart.
It involves the following tasks:
1. Mapping data sources to target data sources
2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata
Accessing
This step involves putting the data to use:
• querying the data,
• analyzing it,
• creating reports,
• charts and
• graphs and publishing.
It involves the following tasks:
Set up and intermediate layer (Meta Layer) for the front-end tool to
use. This layer translates database operations and objects names into
business conditions so that the end-clients can interact with the data
mart using words which relates to the business functions.
Set up and manage database architectures like summarized tables
which help queries agree through the front-end tools execute rapidly
and efficiently.
Managing
This step contains managing the data mart over its lifetime. In
this step, management functions are performed as:
1. Providing secure access to the data.
2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.
Difference between Data Warehouse
and Data Mart
Data Warehouse Data Mart

A Data Warehouse is a vast repository of A data mart is an only subtype of a Data


information collected from various Warehouses. It is architecture to meet the
organizations or departments within a requirement of a specific user group.
corporation.

It may hold multiple subject areas. It holds only one subject area. For
example, Finance or Sales.

It holds very detailed information. It may hold more summarized data.

Works to integrate all data sources It concentrates on integrating data from a


given subject area or set of source systems.

In data warehousing, Fact constellation is In Data Mart, Star Schema and Snowflake
used. Schema are used.

It is a Centralized System.. It is a Decentralized System

Data Warehousing is the data-oriented. Data Marts is a project-oriented.


Three-Tier Data Warehouse
Architecture
Data Warehouses usually have a three-level (tier) architecture that includes:
• Bottom Tier (Data Warehouse Server)
• Middle Tier (OLAP Server)
• Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is almost
always an RDBMS. It may include several specialized data marts and a
metadata repository.
Data from operational databases and external sources (such as user profile data
provided by external consultants) are extracted using application program
interfaces called a gateway. A gateway is provided by the underlying DBMS
and allows customer programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-
DB (Open-Linking and Embedding for Databases), by Microsoft,
and JDBC (Java Database Connection).
Middle-Tier Data Warehouse
Architecture
A middle-tier which consists of an OLAP server for fast querying of
the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational
DBMS that maps functions on multidimensional data to standard
relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular
purpose server that directly implements multidimensional information
and operations.
A top-tier that contains front-end tools for displaying results provided
by OLAP, as well as additional tools for data mining of the OLAP-
generated data.
Metadata Repository
The metadata repository stores information that defines DW objects. It
includes the following parameters and information for the middle and the
top-tier applications:
A description of the DW structure, including the warehouse schema,
dimension, hierarchies, data mart locations, and contents, etc.
Operational metadata, which usually describes the currency level of the
stored data, i.e., active, archived or purged, and warehouse monitoring
information, i.e., usage statistics, error reports, audit, etc.
System performance data, which includes indices, used to improve data
access and retrieval performance.
Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
Summarization algorithms, predefined queries, and reports business data,
which include business terms and definitions, ownership information, etc.
Principles of Data Warehousing
1. Load Performance
Data warehouses require increase loading of new data periodically basis within
narrow time windows; performance on the load process should be
measured in hundreds of millions of rows and gigabytes per hour and must
not artificially constrain the volume of data business.
2. Load Processing
Many phases must be taken to load new or update data into the data
warehouse, including data conversion, filtering, reformatting, indexing,
and metadata update.
3. Data Quality Management
Fact-based management demands the highest data quality. The warehouse
ensures local consistency, global consistency, and referential integrity
despite "dirty" sources and massive database size.
4. Query Performance
Fact-based management must not be slowed by the performance of the
data warehouse RDBMS; large, complex queries must be complete in
seconds, not days.
5. Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size
from a few to hundreds of gigabytes and terabyte-sized data
warehouses.
What is Meta Data?
Metadata is data about the data or documentation about the information which is
required by the users. In data warehousing, metadata is one of the essential
aspects. Metadata includes the following:
• The location and descriptions of warehouse systems and components.
• Names, definitions, structures, and content of data-warehouse and end-users
views.
• Identification of authoritative data sources.
• Integration and transformation rules used to populate data.
• Integration and transformation rules used to deliver information to end-user
analytical tools.
• Subscription information for information delivery to analysis subscribers.
• Metrics used to analyze warehouses usage and performance.
• Security authorizations, access control list, etc.
Metadata is used for building, maintaining, managing, and using the data
warehouses. Metadata allow users access to help understand the content and find
data.
Why is metadata necessary in a
data warehouses?
First, it acts as the glue that links all parts of the data warehouses.
Next, it provides information about the contents and structures to the
developers.
Finally, it opens the doors to the end-users and makes the contents
recognizable in their terms.
Types of Metadata

Metadata in a data warehouse fall into three major parts:


• Operational Metadata (Operational metadata contains all of the
information about the operational data sources.)
• Extraction and Transformation Metadata(this category of metadata
contains information about all the data transformation that takes place
in the data staging area.)
• End-User Metadata(The end-user metadata allows the end-users to
use their business terminology and look for the information in those
ways in which they usually think of the business.)
ETL (Extract,
Transform, and
Load) Process

The mechanism of
extracting information from
source systems and bringing
it into the data warehouse is
commonly called ETL,
which stands
for Extraction,
Transformation and
Loading.
The ETL process requires
active inputs from various
stakeholders, including
developers, analysts, testers,
top executives and is
technically challenging.
Extraction

● Extraction is the operation of extracting information from a source

system for further use in a data warehouse environment. This is the first

stage of the ETL process.

● Extraction process is often one of the most time-consuming tasks in the

ETL.

● The source systems might be complicated and poorly documented, and

thus determining which data needs to be extracted can be difficult.

● The data has to be extracted several times in a periodic manner to supply

all changed data to the warehouse and keep it up-to-date.


Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
mistakes and to recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and defines appropriate associations between values.

The following examples show the essential of data cleaning:

● If an enterprise wishes to contact its users or its suppliers, a complete, accurate


and up-to-date list of contact addresses, email addresses and telephone numbers
must be available.
● If a client or supplier calls, the staff responding should be quickly able to find the
person in the enterprise database, but this need that the caller's name or his/her
company name is listed in the database.
● If a user appears in the databases with two or more slightly different names or
different account numbers, it becomes difficult to update the customer's
information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.

The following points must be rectified in this phase:

● Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show.
● Different formats can be used for individual data. For example, data can be
saved as a string or as three integers.

Transformation processes include:

Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.
Matching that associates equivalent fields in different sources.
● Selection that reduces the number of source fields and records.
ETL Process
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older

file is replaced. Refresh is usually used in combination with static extraction to

populate a data warehouse initially.

2. Update: Only those changes applied to source information are added to the

Data Warehouse. An update is typically carried out without deleting or modifying

pre existing data. This method is used in combination with incremental extraction to

update data warehouses regularly.


Business requirements

First, you need to understand business and client



objectives. You need to define what your client wants
(which many times even they do not know themselves)
Take stock of the current data mining scenario. Factor in

resources, assumption, constraints, and other
significant factors into your assessment.
Using business objectives and current scenario, define

your data mining goals.
A good data mining plan is very detailed and should be

developed to accomplish both business and data mining
goals.
Contents of business requirements
● Purpose, Inscope, Out of Scope, Targeted
Audiences
● Use Case diagrams
● Data Requirements
● Non Functional Requirements
● Interface Requirements
● Limitations
● Risks
● Assumptions
● Reporting Requirements
● Checklists
Dimensional Modeling

Dimensional Modeling (DM) is a data structure technique


optimized for data storage in a Data warehouse.
The purpose of dimensional modeling is to optimize the
database for faster retrieval of data.
The concept of Dimensional Modelling was developed by
Ralph Kimball and consists of “fact” and “dimension”
tables.
A data
— warehouse is usually modelled by a
multidimensional data structure, called a data cube,
in which each dimension corresponds to an
attribute or a set of attributes in the schema, and
each cell stores the value of some aggregate measure
such as count or sum(sales_amount).
A data cube provides a multidimensional view of data
and allows the precomputation and fast access of
summarized data
Elements of Dimensional Data
Model
Fact
Facts are the measurements/metrics or facts from your business
process. For a Sales business process, a measurement would be
quarterly sales number
Dimension
Dimension provides the context surrounding a business process event.
In simple terms, they give who, what, where of a fact. In the Sales
business process, for the fact quarterly sales number, dimensions
would be
● Who – Customer Names
● Where – Location
● What – Product Name
Elements of Dimensional Data Model
Attributes
The Attributes are the various characteristics of the dimension in dimensional
data modeling.
In the Location dimension, the attributes can be
● State
● Country
● Zipcode etc.
Attributes are used to search, filter, or classify facts. Dimension Tables
contain Attributes
Fact Table
A fact table is a primary table in dimension modelling.
A Fact Table contains
1. Measurements/facts
2. Foreign key to dimension table
Elements of Dimensional Data Model

Dimension Table
● A dimension table contains dimensions of a fact.
● They are joined to fact table via a foreign key.
● Dimension tables are de-normalized tables.
● The Dimension Attributes are the various columns in a
dimension table
● Dimensions offers descriptive characteristics of the facts
with the help of their attributes
● No set limit set for given for number of dimensions
● The dimension can also contain one or more hierarchical
relationships
Types of Dimensions in Data
Warehouse

Following are the Types of Dimensions in Data Warehouse:


● Conformed Dimension
● Outrigger Dimension
● Shrunken Dimension
● Role-playing Dimension
● Dimension to Dimension Table
● Junk Dimension
● Degenerate Dimension
● Swappable Dimension
● Step Dimension
Steps of Dimensional Modelling
Identify the Business Process
Identifying the actual business process a data warehouse
should cover. This could be Marketing, Sales, HR, etc. as
per the data analysis needs of the organization.
The selection of the Business process also depends on the
quality of data available for that process.
It is the most important step of the Data Modelling process,
and a failure here would have cascading and irreparable
defects.
To describe the business process, you can use plain text or
use basic Business Process Modelling Notation (BPMN) or
Unified Modelling Language (UML).
Identify the Grain

The Grain describes the level of detail for the business


problem/solution. It is the process of identifying the
lowest level of information for any table in your data
warehouse. If a table contains sales data for every
day, then it should be daily granularity. If a table
contains total sales data for each month, then it has
monthly granularity.
Identify the Dimensions

Dimensions are nouns like date, store, inventory, etc.


These dimensions are where all the data should be
stored. For example, the date dimension may contain
data like a year, month and weekday.
Identify the Fact

This step is co-associated with the business users of the


system because this is where they get access to data stored
in the data warehouse. Most of the fact table rows are
numerical values like price or cost per unit, etc.
Build Schema
In this step, you implement the Dimension Model. A schema is nothing but the
database structure (arrangement of tables). There are two popular schemas
1. Star Schema
The star schema architecture is easy to design. It is called a star schema
because diagram resembles a star, with points radiating from a center. The
center of the star consists of the fact table, and the points of the star is
dimension tables.
The fact tables in a star schema which is third normal form whereas dimensional
tables are de-normalized.
2. Snowflake Schema
The snowflake schema is an extension of the star schema. In a snowflake
schema, each dimension are normalized and connected to more dimension
tables.
Information Packages

It is a new idea for determining and recording


Information requirements for data warehouse.
● Determining requirements for a data warehouse is
based on business dimensions
● The relevant dimension and measurements in that
dimension are captured and kept in a data
warehouse
● This creates an information package for a specific
subject
Benefits of Information Packages
1. Define the common subject areas
2. Design key business metrics
3. decide how data must be presented.
4. Determine how users will aggregate or roll up.
5. Decide the data quantity for user analysis or query.
6. Decide how data will be accessed.
7. Establish data granularity
8. Estimate data warehouse size.
9. Determine the frequency for data refreshing.
10. Ascertain how information must be packaged.
An information Package
Business dimensions

In requirements collection phase, the end users can



provide the measurements which are important to that
department.
They can also give insights of combining the various

pieces of information for strategic decision making.
Managers think of business in terms of business

dimensions
The managers try to evaluate business in different

dimensions.
An information Package
Requirements Definition Document Outline
1. Introduction. State the purpose and scope of the project. Include broad project
justification. Provide an executive summary of each subsequent section.
2. General requirements descriptions. Describe the source systems reviewed.
Include interview summaries. Broadly state what types of information requirements are
needed in the data warehouse.
3. Specific requirements. Include details of source data needed. List the data
transformation and storage requirements. Describe the types of information delivery
methods needed by the users.
4. Information packages. Provide as much detail as possible for each information
package. Include in the form of package diagrams.
5. Other requirements. Cover miscellaneous requirements such as data extract
frequencies, data loading methods, and locations to which information must be
delivered.
6. User expectations. State the expectations in terms of problems and opportunities.
Indicate how the users expect to use the data warehouse.
7. User participation and sign-off. List the tasks and activities in which the users are
expected to participate throughout the development life cycle.
8. General implementation plan. At this stage, give a high-level plan for
implementation.
Data Gathering Methods

The key guiding question to ask to select methods is therefore “How


can I most effectively and efficiently get the data I need to make good
decisions?”
The 5 most common methods for data gathering are,
(a) Document reviews
(b) Interviews
(c) Focus groups
(d) Surveys
(e) Observation or testing.
Basic Principles

Here are some basic principles to keep in mind when selecting


methods. 1. Consider the characteristics of your target population.
2. Use more than one method.
3. Consider time, costs, and other constraints.
Surveys / questionnaire
Surveys are often used to collect numerical data. For example, when
respondents select numbers on a scale to answer questions. They can
also include open questions that require writing an answer. Which to use
depends on different conditions, like level of education.
Respondents with a limited education may not be able to write longer
answers. Interviews can be used to select questions for a survey. After
conducting a few of interviews, typical answers or comments are
identified and converted into questions for a survey.
Example: during some interviews employees indicate that they don’t
have the tools they needed to perform well. This information could
become the following survey question: “Select a number on a scale of 1
to 5 that best reflects how you feel about your work tools (1 = Not
available; 5 = we have all that we need).
Advantages and Disadvantages
Advantages
(a) Can reach a large population.
(b) Many possible variations in their design and use.
(c) Can be completed anonymously.
(d) Can be made easy to complete.
(e) Can be used to gather quantitative and qualitative data.
Disadvantages
(a) Can be difficult and time consuming to develop.
(b) Influenced by education (reading level) and culture.
(c) Can become annoying when not focused or too long.
(d) Requires follow up to get a good response rate.
(e) Often can’t check incomplete or problematic answers.
(f) Often ignored when overused.
Knowledge and skills testing
Observation or testing are appropriate when there is a clear best way
to perform a job, that can be measured specifically (existing
standard).
For example, when a job requires following the same procedure or
applying similar thinking.
They can be difficult to use when jobs require adaptive behavior
(adjusting performance depending on conditions). Tests must be
checked and validated before they are used. This means trying them
with some participants, making adjustments and then using them
with all other participants. Tests must be short and specific, with
only as many questions as needed to assess the whole performance.
They must also reflect real job conditions.
Advantages:
1. Relatively easy to prepare when a good job task analysis is
available.
2. Focused on job performance.
3. Can take advantage of available subject matter expertise.
4. Results are more easily quantifiable and comparable.
Disadvantages:
1. Tests are often intimidating or stressful, and can affect
performance.
2. Good tests require more work to develop than often
expected.
3. Essential knowledge and skills must be clearly reflected to
be valid.
4. Can disrupt normal activities.
5. Tests measure the ability to perform, not the motivation to
do so.
Observation

Observations are about visually confirming what’s going on.


They can be highly structured and use checklists.
for example, to rate what is observed, confirm the procedure
followed or tools used.
Observations can also be open and rely on notes from experts
familiar with the job and with doing observations.
Good notes are essential: the quality of note taking directly
affects the quality of the data gathered. As a rule of thumb, use
formal observation (with check lists, for example) to check job
performance and less formal ones to investigate general behavior
(to assess motivation, for example).
Advantages
1. Allows investigating work under real conditions.
2. Can be discreet and conducted without disrupting work.
3. Allows seeing actual performance rather than what is reported.
4. Allows uncovering unexpected issues that must be addressed.
5. Takes interaction, collaboration or teamwork into account.

Disadvantages
1. Time consuming with larger groups.
2. Influenced by the quality of observations and note taking. Observers
must be trained and use good instruments to record what they observe.
3. The results of one observation cannot be generalized to other
observations (individual performances). More observations are therefore
needed to confirm how more employees perform.
4. Can be difficult to set up (e.g. permissions or scheduling).
5. Being observed can change how some perform so that what is
observed does not reflect typical performance.
6. Some may refuse to be observed or be uncomfortable and resistant.
Document review / critical incident
analysis
This method involves finding and reviewing documents ranging
from letters of complaint, industry reports, policy documents or
more strategic ones, to better understand the problem.
The critical incident report typically describes an important event
that is a problem or that otherwise negatively affects the
organization. For example, reports about accidents or
emergencies.
Unless documents can’t be found or are clearly not useful, every
TNA should identify and review relevant documents. Whenever
possible, the information obtained from documents should be
validated and any serious difference between what different
sources report should be reconciled.
Advantages
1. Uses existing information.
2. Less influenced by changes or unforeseen circumstances.
3. Unobtrusive: no need to disrupt work underway.
4. Allows identifying job performance standards that can be applied locally.
5. Can provide leads to explore (people to interview, for example).
6. Can provide a historical perspective to better understand current events.
7. Considers both internal and external documents.
Disadvantages
1. Available documents are not always good sources of information. Better
documents may not be available (or shared).
2. Can be time consuming to review all documents.
Critical incident
Advantages
1. Can provide insight into the causes of problems.
2. Reports real events.
Disadvantages
1. Must be well reported to be useful. Bad reports can be misleading.
2. Can be difficult to analyze and understand after the fact.
3. May require consulting experts to confirm findings.
Interviews
Interviews are one-on-one conversations to explore ideas, opinions, values
or other points of view. Some interviews can be quite structured, use
specific questions and record answers in non-equivocal terms (like yes/no).
Other interviews are more open and allow exploring issues as they arise.
Regardless of the approach used, it is essential to take good notes that truly
reflect the interview.
Interviews are particularly useful to,
• Investigate issues in depth.
• Explore ideas, opinions and attitudes.
• Explore sensitive topics that some may not want to discuss in public.
Interviews alone are not always effective to explore issues that affect larger
groups. Samples can be used instead if they represent the population well.
For example, interviewing 5 employees that represent well the
characteristics of a larger group of 20 may be enough to identify important
issues.
Advantages
1. Allows for face-to-face contact and observing behavior.
2. Allows exploring and clarifying opinions, or dealing with the
unexpected.
3. Helps engage participants in the TNA (Training Need Analysis)
process.
4. Helps explore / confirm other data / information (for example, the
information obtained from documents).
Disadvantages
1. Can be time consuming and depend on the availability of individuals. 2.
Individuals can’t always identify or express true needs.
3. Some may use this opportunity to vent frustrations or discuss other
issues.
4. Interviewers must be skilled and well prepared.
5. Interviewing many can be time consuming and expensive.
6. Requires careful sampling when dealing with a large population.
7. Interviewers sometimes ‘take over’ and negatively affect the interview.
Focus groups

Focus groups are essentially group interviews. They are


structured and led differently than interviews, but yield similar
data. Focus groups are particularly useful to,
(a) Engage a group in generating, discussing and refining ideas.
(b) Confirming group opinions, values and tendencies.
(c) Explore topics more deeply than can be done during
individual interviews.
(d) Creating ownership of problems and solutions.
(e) Reach more participants than possible with interviews alone.
Advantages.
1. Allow interviewing more individuals within a limited
amount of time.
2. Allows participants to discuss important issues with their
peers.
3. Helps with team building by shifting the focus from the
individual to the group.
4. Allow comparing and sifting through ideas towards
consensus.
Disadvantages
1. Time consuming and subject to the availability of
individuals.
2. Can lead to conflict (if not well facilitated) or affected by
existing conflicts between individuals or groups.
3. Not everyone wants to discuss issues with others (or
share information). 4. Requires a skilled group leader to
manage group dynamics and achieve good results.
Joint application development (JAD)

Joint application development (JAD) techniques were successfully


utilized to gather requirements for operational systems in the 1980s.
Users of computer systems had grown to be more computer-savvy
and their direct participation in the development of applications proved
to be very useful.
● JAD is a joint process, with all the concerned groups getting
together for a well-defined purpose.
● It is a methodology for developing computer applications
jointly by the users and the IT professionals in a well-
structured manner.
● JAD centers around discussion workshops lasting a certain
number of days under the direction of a facilitator.
● Under suitable conditions, the JAD approach may be adapted for
building a data warehouse.
JAD consists of a five-phased approach:
1. Project Definition
● Complete high-level interviews
● Conduct management interviews
● Prepare management definition guide
1. Research
● Become familiar with the business area and systems
● Document user information requirements
● Document business processes
● Gather preliminary information
● Prepare agenda for the sessions
1. Preparation
● Create working document from previous phase
● Train the scribes
● Prepare visual aids
● Conduct pre-session meetings
● Set up a venue for the sessions
● Prepare checklist for objectives
4. JAD Sessions
● Open with review of agenda and purpose
● Review assumptions
● Review data requirements
● Review business metrics and dimensions
● Discuss dimension hierarchies and roll-ups
● Resolve all open issues
● Close sessions with lists of action items
5. Final Document
● Convert the working document
● Map the gathered information
● List all data sources
● Identify all business metrics
● List all business dimensions and hierarchies
● Assemble and edit the document
● Conduct review sessions
● Get final approvals
● Establish procedure to change requirements
Advantages of Joint Application Development :
These are some of the key benefits of JAD:

1. Produce a design from the customer’s perspective.


2. The teamwork between company and client helps
to remove all risks.
3. Due to the close interactions, progress is faster.
4. JAD helps to accelerate design and also to enhance
quality.
5. JAD cheers the team to push each other which
leads them to work faster and also to deliver on
There are many key stakeholders involved in JAD Process. These are:

1. Execution Process : This process is from customer’s side which includes


Project Manager, CIO, CEO or CISO who has the power to make decisions
regarding the project.
2. Facilitator : This individual is responsible for creating, managing and
executing the JAD activities, minimize disagreements, encourage end-user
involvement, maintaining focus and unbiased approach.
3. IT Representatives : This individual for giving technical advice and to help
the team to develop technical models and to build the prototype of end
result. They must approach and support the customers in turning their
visualizations into models as per the requirements, develop an
understanding of the end-user business goals, represent in IT functions,
render end solutions which are affordable in nature etc.
4.End-User : This concerned person is usually the main focus of

JAD. They offer proper business knowledge and strategy,


illustrate all key user groups who are affected by development
and represent multiple levels within organization.

5.Scribe : This person is responsible for documenting JAD process


and JAD sessions precisely and effectively. They generally serve
as partner to facilitator in each JAD sessions and provide
reference for the review.

6. Observer : The observer will observe each JAD session and will
gather knowledge for end-user needs and of JAD session
decisions, interact with JAD participants outside JAD sessions

You might also like