DWDM Unit 1
DWDM Unit 1
Data Mining
BCA VI SEM
UNIT 1
Syllabus
Database
A database is an aggregation of ordered, electronically recorded data that
has been structured/organized. Here structured Data that follows a
pre-established data format is referred to as structured data and is easier
to assess. Structured information follows a tabular structure with a
relationship between the various rows and columns.
Many dynamic webpages on the Internet today use databases to keep their
content. In this way, data can be handled, updated, controlled, and
organized effectively. Most databases use structured query language.
(SQL) for both creating and getting data. Consider Facebook. It must be
able to store, modify, and display information about users, their contacts,
member actions, communications, ads, and a variety of other things. In
such cases, databases become more crucial for the efficient storing of
data.
DataBase System
Database System is used in traditional way of storing and retrieving
data. The major task of database system is to perform query
processing. These systems are generally referred as online transaction
processing system. These systems are used day to day operations of
any organization.
Introduction to Data Warehousing
➔ The term "Data Warehouse" was first coined by Bill
Inmon in 1990.
➔ According to Inmon, a data warehouse is a subject
oriented, integrated, time-variant, and nonvolatile
collection of data.
➔ A data warehouse refers to a data repository that is
maintained separately from an organization’s
operational databases.
➔ A Data Warehouse (DW) is a relational database that is
designed for query and analysis rather than transaction
processing. It includes historical data derived from
transaction data from single and multiple sources.
Database System vs. Data Warehouse
Database System vs. Data Warehouse
THE COMPELLING NEED FOR DATA
WAREHOUSING
In the 1990s, as businesses grew more complex,
corporations spread globally, and competition became
fiercer, business executives became desperate for
information to stay competitive and improve the
bottom line. The operational computer systems did
provide information to run the day-to-day operations, but
what the executives needed were different kinds of
information that could be readily used to make strategic
decisions.
Organizations achieve competitive
advantage:
DATA WAREHOUSING—THE ONLY
VIABLE SOLUTION
● The type of information needed for strategic
decision making is different from that available
from operational systems.
● A DW is a subject-oriented, integrated,
time-variant and non-volatile collection of data
in support of management’s decision making
process.
Characteristics/features of DW
Data Warehouse-Subject-Oriented
➔ Organized around major subjects, such as customer,
product, sales.
➔ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
➔ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
Data Warehouse-Subject-Oriented
Data Warehouse—Integrated
➔ Constructed by integrating multiple,
heterogeneous data
sources
◆ relational databases, flat files,
on-line transaction records
➔ Data cleaning and data integration
techniques are applied.
◆ Ensure consistency in naming
conventions, encoding structures,
attribute measures, etc. among
different data sources
● E.g., Hotel price: currency, tax,
breakfast covered, etc.
◆ When data is moved to the
warehouse, it is converted
Data Warehouse—Time Variant
➔ The time horizon for the data warehouse is
significantly longer than that of operational systems
◆ Operational database: current value data
◆ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
➔ Every key structure in the data warehouse
◆ Contains an element of time, explicitly or implicitly
◆ But the key of operational data may or may not
contain "time element
Data Warehouse-Nonvolatile
➔ A physically separate store of data transformed from the
operational environment
➔ Operational update of data does not occur in the data
warehouse environment
◆ Does not require transaction processing, recovery, and
concurrency control mechanisms
◆ Requires only two operations in data accessing: initial
loading of data and access of data
Data Granularity
Granularity is one of the main elements in the modeling of
DW data.
Granularity of data refers to detail levels. Multiple levels of
detail may be available depending on the requirements. At
least two granular levels exist for many data warehouses.
The relation between detailing and granularity is important to
understand. It means greater detail of the data (less summary)
when you speak of less granularity or fine granularity. Greater
granularity means fewer details or gross granularity (greater
summarization). The operational data is stored at the lowest
level of information.
Example of Data Granularity
DW and Data Marts
A Data Mart contains a subset of corporate-wide that is of
value to a specific group of users. the scope is confined to
specific selected subjects. e.g. a marketing data mart may
confine its subject to customer, item and sales. The data
contained in data marts tend to be summarized.
In 1998 Bill Inmon stated, “the most important issue
facing the IT manager is whether to build the data
warehouse first or the data mart first”.
Approaches for designing DW/DM
○ Top-Down Approach (Dependent Data Marts ): The data
warehouse is designed first and then data mart are built on top
of data warehouse.
https://www.geeksforgeeks.org/data-warehouse-architecture/
Bottom-Up Approach
Advantages / Disadvantages of Bottom-up
Approach
The advantages of this approach are:
● Faster and easier implementation of manageable pieces
● Favorable return on investment and proof of concept
● Less risk of failure
● Inherently incremental; can schedule important data marts first
● Allows project team to learn and grow
The disadvantages are:
● Each data mart has its own narrow view of data
● Permeates redundant data in every data mart
● Perpetuates inconsistent and irreconcilable data
● Proliferates unmanageable interfaces
(ETL) Extracting, Transformation, Loading
DW system use back-end tools and utilities to populate and
refresh their data. These tools and utilities include the
following functions:
Data Extraction: gathers data from multiple sources
Data Cleaning: detects and rectify errors
Data Transformation: converts data to warehouse format
Load: sorts, summarizes, consolidates, computer views,
check integrity and builds indices and partitions.
Refresh: propagates the updates from the data sources to
the warehouse.
Introduction to Data Warehousing
⚫ A Data Warehouse is used for reporting and analyzing
of information and stores both historical and current
data.
⚫ The data in DW system is used for Analytical
reporting, which is later used by Business Analysts,
Sales Managers or Knowledge workers for
decision-making.
Information from Data Warehousing
1. Increasing customer focus, which induce the
assessment of customer buying pattern.
2. Repositioning products and managing product
portfolios by comparing the performance of sales by
quarter, by year and by geographical region.
3. analyzing operations and looking for the source of
profit.
4. managing customer relationship, making
environmental corrections and maintain the cost of
corporate assets.
Three-Tier Data Warehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that
includes:
➢ Bottom Tier (Data Warehouse Server)
➢ Middle Tier (OLAP Server)
➢ Top Tier (Front end Tools).
➢ A bottom-tier that consists of the Data Warehouse server,
which is almost always an RDBMS. It may include several
specialized data marts and a metadata repository.
➢ Data from operational databases and external sources (such
as user profile data provided by external consultants) are
extracted using application program interfaces called a
gateway. A gateway is provided by the underlying DBMS and
allows customer programs to generate SQL code to be
executed at a server.
➢ Examples of gateways contain ODBC (Open Database
Connection) and OLE-DB (Open-Linking and Embedding
for Databases), by Microsoft, and JDBC (Java Database
Connection).
Three-Tier Data Warehouse Architecture
Three-Tier Data Warehouse Architecture
⚫ A middle-tier which consists of an OLAP server for fast
querying of the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended
relational DBMS that maps functions on multidimensional
data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a
particular purpose server that directly implements
multidimensional information and operations.
2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. since they are de-normalized. This is
This is done to reduce redundant files done to minimize the response time for
and to save storage space. analytical queries.
7. The database is the place where the 7. Data Warehouse is the place where
data is taken as a base and managed to the application data is handled for
get available fast and efficient access. analysis and reporting objectives.
Data warehouse – The building Blocks
Data warehouse – The building Blocks :
Source Data Component
Source data coming into the data warehouses may be grouped into
four broad categories:
1. Production Data: This type of data comes from the different
operating systems of the enterprise. Based on the data
requirements in the data warehouse, we choose segments of the
data from the various operational modes.
2. Internal Data: In each organization, the client keeps their
"private" spreadsheets, reports, customer profiles, and
sometimes even department databases. This is the internal data,
part of which could be useful in a data warehouse.
3. Archived Data: Operational systems are mainly intended to run
the current business. In every operational system, we
periodically take the old data and store it in achieved files.
4. External Data: Most executives depend on information from
external sources for a large percentage of the information they
use. They use statistics associating to their industry produced by
the external department.
Data warehouse – The building Blocks:
Data Staging Component
⚫ After we have been extracted data
from various operational systems
and external sources, we have to
prepare the files for storing in the
data warehouse. The extracted data
coming from several different
sources need to be changed,
converted, and made ready in a
format that is relevant to be saved
for querying and analysis.
⚫ We will now discuss the three
primary functions that take place
in the staging area.
⚫ Data Extraction: This method has
to deal with numerous data sources.
We have to employ the appropriate
techniques for each data source.
Data warehouse – The building Blocks :
Data Staging Component
Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.
➔ First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
➔ Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources.
We combine data from single source record or related data parts from many source
records.
➔ On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data transformation
function ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.
Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time,
we do the initial loading of the information into the data warehouse storage. The initial load
moves high volumes of data using up a substantial amount of time.
Data warehouse – The building Blocks :
Source Storage Component
● Data storage for the data warehousing is a split
repository.
● The data repositories for the operational systems
generally include only the current data.
● Also, these data repositories include the data
structured in highly normalized for fast and efficient
processing.
Data warehouse – The building Blocks :
Information Delivery Component
The information delivery
element is used to enable
the process of subscribing
for data warehouse files
and having it transferred
to one or more
destinations according to
some customer-specified
scheduling algorithm.
Data warehouse – The building Blocks :
Metadata Component
⚫ Metadata in a data warehouse is equal to the data
dictionary or the data catalog in a database
management system.
⚫ In the data dictionary, we keep the data about the
logical data structures, the data about the records and
addresses, the information about the indexes, and so
on.
Data warehouse – The building Blocks :
Data Marts
⚫ It includes a subset of corporate-wide data that is of value
to a specific group of users.
⚫ The scope is confined to particular selected subjects. Data
in a data warehouse should be a fairly current, but not
mainly up to the minute, although development in the
data warehouse industry has made standard and
incremental data dumps more achievable.
⚫ Data marts are lower than data warehouses and usually
contain organization.
⚫ The current trends in data warehousing are to developed a
data warehouse with several smaller related data marts for
particular kinds of queries and reports.
Data warehouse – The building Blocks :
Management and Control Component
⚫ The management and control elements coordinate the
services and functions within the data warehouse.
⚫ These components control the data transformation and
the data transfer into the data warehouse storage.
⚫ On the other hand, it moderates the data delivery to the
clients.
⚫ Its work with the database management systems and
authorizes data to be correctly saved in the repositories.
⚫ It monitors the movement of information into the
staging method and from there into the data warehouses
storage itself.
What is Data Mart?
A Data Mart is a subset of a directorial information store,
generally oriented to a specific purpose or primary data subject
which may be distributed to provide business needs. Data
Marts are analytical record stores designed to focus on
particular business functions for a specific community within
an organization. Data marts are derived from subsets of data in
a data warehouse, though in the bottom-up data warehouse
design methodology, the data warehouse is created from the
union of organizational data marts.
⚫ The fundamental use of a data mart is Business
Intelligence (BI) applications. BI is used to gather, store,
access, and analyze record. It can be used by smaller
businesses to utilize the data they have accumulated since it
is less expensive than implementing a data warehouse.
What is Data Mart?
Reasons for creating a data mart
➔ Creates collective data by a group of users
➔ Easy access to frequently needed data
➔ Ease of creation
➔ Improves end-user response time
➔ Lower cost than implementing a complete data
warehouses
➔ Potential clients are more clearly defined than in a
comprehensive data warehouse
➔ It contains only essential business data and is less
cluttered.
Types of Data Marts
Data Ware house has long While data-mart has short life
8.
life. than warehouse.
It uses a lot of data and has Operational data are not present in Data
13. comprehensive operational data. Mart.
It collects data from various data It generally stores data from a data
14. sources. warehouse.
Long time for processing the data Less time for processing the data because of
15. because of large data. handling only a small amount of data.
⚫ ETL
consists
of three
separate
phases:
ETL (Extract, Transform, and Load) Process
Extraction
● Extraction is the operation of extracting information from
a source system for further use in a data warehouse
environment. This is the first stage of the ETL process.
● Extraction process is often one of the most
time-consuming tasks in the ETL.
● The source systems might be complicated and poorly
documented, and thus determining which data needs to
be extracted can be difficult.
● The data has to be extracted several times in a periodic
manner to supply all changed data to the warehouse and
keep it up-to-date.
ETL (Extract, Transform, and Load) Process
Cleansing
● The cleansing stage is crucial in a data warehouse technique because
it is supposed to improve data quality.
● The primary data cleansing features found in ETL tools are
rectification and homogenization.
● They use specific dictionaries to rectify typing mistakes and to
recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and defines appropriate associations between
values.
● The following examples show the essential of data cleaning:
● If an enterprise wishes to contact its users or its suppliers, a
complete, accurate and up-to-date list of contact addresses, email
addresses and telephone numbers must be available.
● If a client or supplier calls, the staff responding should be quickly
able to find the person in the enterprise database, but this need that
the caller's name or his/her company name is listed in the database.
● If a user appears in the databases with two or more slightly different
names or different account numbers, it becomes difficult to update
the customer's information.
ETL (Extract, Transform, and Load) Process
Transformation
● Transformation is the core of the reconciliation phase. It
converts records from its operational source format into a
particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled
data layer.
● The following points must be rectified in this phase:
● Loose texts may hide valuable information. For example,
XYZ PVT Ltd does not explicitly show that this is a Limited
Partnership company.
● Different formats can be used for individual data. For
example, data can be saved as a string or as three integers.
ETL (Extract, Transform, and Load) Process
Following are the main transformation processes aimed
at populating the reconciled data layer:
● Conversion and normalization that operate on both
storage formats and units of measure to make data
uniform.
● Matching that associates equivalent fields in different
sources.
● Selection that reduces the number of source fields and
records.
Compute-intensive
Transformations
Small amount of data
Difference between ETL vs. ELT
Basics ETL ELT
Analysis
Defining the business requirements:
Select the business process for which the dimensional model will be
designed. Based on the selection, the requirements for the business
process are gathered. A business process require more than one
dimensional model. When you select a single business process (out of all
of the possible processes that exist in a company), you must prioritize the
business processes according to certain criteria. Criteria might include
business process significance, quality of data in the source systems, and
the feasibility and complexity of the business processes.
When you identify the business processes of a dimensional model, you
collect the following metadata:
● Business requirements for the selected business for which you will
design the dimensional model
● Business processes
● Owners
● Source systems that will be used
● Data quality issues
● Common terms used across business processes
● Other business-related metadata
Dimensional analysis
One approach to data warehouse design is to develop and
implement a dimensional model. This has given rise to
dimensional analysis (sometimes generalized as
multi-dimensional analysis ).
Specific
Information
requiremen
packages
ts
Other User
requiremen expectation
ts s
User
General
participatio
implementa
n and
tion plan
sign-off
Requirements Definition Document Outline
1. Introduction. State the purpose and scope of the project. Include
broad project justification. Provide an executive summary of each
subsequent section.
2. General requirements descriptions. Describe the source systems
reviewed. Include interview summaries. Broadly state what types of
information requirements are needed in the data warehouse.
3. Specific requirements. Include details of source data needed. List the
data transformation and storage requirements. Describe the types of
information delivery methods needed by the users.
4. Information packages. Provide as much detail as possible for each
information package. Include in the form of package diagrams.
5. Other requirements. Cover miscellaneous requirements such as data
extract frequencies, data loading methods, and locations to which
information must be delivered.
6. User expectations. State the expectations in terms of problems and
opportunities. Indicate how the users expect to use the data
warehouse.
7. User participation and sign-off. List the tasks and activities in which
the users are expected to participate throughout the development life
cycle.
8. General implementation plan. At this stage, give a high-level plan
for implementation.