0% found this document useful (0 votes)
43 views31 pages

DWDM Unit-1

Data warehousing & data mining

Uploaded by

malinibathala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views31 pages

DWDM Unit-1

Data warehousing & data mining

Uploaded by

malinibathala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Data warehousing and Data mining Unit-I

DATA WAREHOUSING & DATA MINING


UNIT I
Data Warehousing, Business Analysis and On-Line Analytical Processing (OLAP): Basic Concepts,
Data Warehousing Components, Building a Data Warehouse, Database Architectures for Parallel
Processing, Parallel DBMS Vendors, Multidimensional Data Model, Data Warehouse Schemas for
Decision Support, Concept Hierarchies, Characteristics of OLAP Systems, Typical OLAP
Operations, OLAP and OLTP.

Data Warehouse Introduction


A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and data analysis as a
data warehouse. Table design, dimensions and organization should be consistent throughout a data
warehouse so that reports or queries across the data warehouse are consistent.
A data warehouse can also be viewed as a database for historical data from different functions
within a company. The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the
following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process". He defined the terms in the sentence as
follows:
 Subject Oriented: Data that gives information about a particular subject instead of about a
company's ongoing operations.
 Integrated: Data that is gathered into the data warehouse from a variety of sources and merged
into a coherent whole.
 Time-variant: All data in the data warehouse is identified with a particular time period.
 Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
This enables management to gain a consistent picture of the business. It is a single, complete and
consistent store of data obtained from a variety of different sources made available to end users in
what they can understand and use in a business context. It can be Used for decision Support, Used
to manage and control business, Used by managers and end-users to understand the business and
make judgments.

Data Warehousing is an architectural construct of information systems that provides users with
current and historical decision support information that is hard to access or present in traditional
operational data stores

Other important terminology


 Enterprise Data warehouse: It collects all information about subjects (customers, products, sales,
assets, personnel) that span the entire organization
 Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment of a
data warehouse that can provide data for reporting and analysis on a section, unit, department or
operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete

Page 1
Data warehousing and Data mining Unit-I

individual data warehouses which are usually smaller than the corporate data warehouse.
 Decision Support System (DSS): Information technology to help the knowledge worker
(executive, manager, and analyst) makes faster & better decisions
 Drill-down: Traversing the summarization levels from highly summarized data to the underlying
current or old detail
 Metadata: Data about data. Containing location and description of warehouse system
components: names, definition, structure…

Benefits of data warehousing


 Data warehouses are designed to perform well with aggregate queries running on large amounts
of data.
 The structure of data warehouses is easier for end users to navigate, understand and query
against unlike the relational databases primarily designed to handle lots of transactions.
 Data warehouses enable queries that cut across different segments of a company's operation. E.g.
production data could be compared against inventory data even if they were originally stored in
different databases with different structures.
 Queries that would be complex in very normalized databases could be easier to buildand
maintain in data warehouses, decreasing the workload on transaction systems.
 Data warehousing is an efficient way to manage and report on data that is from a varietyof
sources, non uniform and scattered throughout a company.
 Data warehousing is an efficient way to manage demand for lots of information fromlots of
users.
 Data warehousing provides the capability to analyze large amounts of historical data fornuggets
of wisdom that can provide an organization with competitive advantage.
Operational and informational Data
Operational Data:
 Focusing on transactional function such as bank card withdrawals and deposits
 Detailed
 Updateable
 Reflects current data

Page 2
Data warehousing and Data mining Unit-I

Informational Data:
 Focusing on providing answers to problems posed by decision makers
 Summarized
 Non updateable

Data Warehouse Characteristics


• A data warehouse can be viewed as an information system with the following attributes:
– It is a database designed for analytical tasks
– It’s content is periodically updated
– It contains current and historical data to provide a historical perspective of information

Operational data store (ODS)


• ODS is an architecture concept to support day-to-day operational decision support and containscurrent
value data propagated from operational applications
• ODS is subject-oriented, similar to a classic definition of a Data warehouse
• ODS is integrated

Data warehouse Architecture and its seven components


1. Data sourcing, cleanup, transformation, and migration tools
2. Metadata repository
3. Warehouse/database technology
4. Data marts
5. Data query, reporting, analysis, and mining tools
6. Data warehouse administration and management
7. Information delivery system

Page 3
Data warehousing and Data mining Unit-I

Data warehouse is an environment, not a product which is based on relational database


management system that functions as the central repository for informational data. The central repository
information is surrounded by number of key components designed to make the environment is functional,
manageable and accessible.

The data source for data warehouse is coming from operational applications. The data entered into
the data warehouse transformed into an integrated structure and format. The transformation process
involves conversion, summarization, filtering and condensation. The data warehouse must be capable of
holding and managing large volumes of data as well as different structure of data structures over the time.
1. Data warehouse database
This is the central part of the data warehousing environment. This is the item number 2 in the
above arch. diagram. This is implemented based on RDBMS technology.
2. Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions, summarization, key
changes, structural changes and condensation. The data transformation is required so that the
information can be used by decision support tools. The transformation produces programs, control
statements, JCL code, COBOL code, UNIX scripts, and SQL DDLcode etc., to move the data into data
warehouse from multiple operational systems.

Page 4
Data warehousing and Data mining Unit-I

The functionalities of these tools are listed below:


 To remove unwanted data from operational DB
 Converting to common data names and attributes
 Calculating summaries and derived data
 Establishing defaults for missing data
 Accommodating source data definition change.

Issues to be considered while data sourcing, cleanup, extract and transformation:


Data heterogeneity: It refers to DBMS different nature such as it may be in different data modules, it may
have different access languages, it may have data navigation methods, operations, concurrency, integrity
and recovery processes etc.,

3. Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. Itis
classified into two:
1. Technical Meta data: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes,
 Info about data stores
 Transformation descriptions. That is mapping methods from operational db to warehouse db
 Warehouse Object and data structure definitions for target data
 The rules used to perform clean up, and data enhancement
 Data mapping operations
 Access authorization, backup history, archive history, info delivery history, data acquisitionhistory,
data access etc.
2. Business Meta data: It contains info that gives info stored in data warehouse to users. It
includes,
 Subject areas, and info object type including queries, reports, images, video, audio clipsetc.
 Internet home pages
 Info related to info delivery system
 Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a
separate data stores which is known as informational directory or Meta data repository which helps to
integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
 It is the gateway to the data warehouse environment
 It supports easy distribution and replication of content for high performance andavailability
 It should be searchable by business oriented key words
 It should act as a launch platform for end user to access data and analysis tools
 It should support the sharing of info
Page 5
Data warehousing and Data mining Unit-I

 It should support scheduling options for request


 IT should support and provide interface to other applications
 It should support end user monitoring of the status of the data warehouse environment
4. Access tools
Its purpose is to provide info to business users for decision making. There are fivecategories:
 Data query and reporting tools
 Application development tools
 Executive info system tools (EIS)
 OLAP tools
 Data mining tools
Query and reporting tools are used to generate query and report. There are two types ofreporting
tools. They are:
 Production reporting tool used to generate regular operational reports
 Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between users and
databases which offers a point-and-click creation of SQL statement. This tool is a preferred choice of
users to perform segment identification, demographic analysis, territory management and preparation of
customer mailing lists etc.
Application development tools: This is a graphical data access environment which integrates OLAP tools
with data warehouse and can be used to access all db systems
OLAP Tools: are used to analyze the data in multi dimensional and complex views. To enable
multidimensional properties it uses MDDB and MRDB where MDDB refers multi dimensional data base
and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be used for
data visualization and data correction purposes.
5. Data marts
Departmental subsets that focus on selected subjects. They are independent used bydedicated user
group. They are used for rapid delivery of enhanced decision support functionalityto end users. Data mart
is used in the following situation:
 Extremely urgent user requirement
 The absence of a budget for a full scale data warehouse strategy
 The decentralization of business needs
 The attraction of easy to use tools and mind sized project Data
mart presents two problems:
1. Scalability: A small data mart can grow quickly in multi dimensions. So that while designing
it, the organization has to pay more attention on system scalability, consistency and manageability
issues
2. Data integration
6. Data warehouse admin and management
The management of data warehouse includes,

Page 6
Data warehousing and Data mining Unit-I

 Security and priority management


 Monitoring updates from multiple sources
 Data quality checks
 Managing and updating meta data
 Auditing and reporting data warehouse usage and status
 Purging data
 Replicating, sub setting and distributing data
 Backup and recovery
 Data warehouse storage management which includes capacity planning, hierarchicalstorage
management and purging of aged data etc.,
7. Information delivery system
• It is used to enable the process of subscribing for data warehouse info.
• Delivery to one or more destinations according to specified scheduling algorithm

Building a Data warehouse:


There are two reasons why organizations consider data warehousing a critical need. Inother
words, there are two factors that drive you to build and use data warehouse. They are: Business factors:
 Business users want to make decision quickly and correctly using all available data.
Technological factors:
 To address the incompatibility of operational data stores
 IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing sothat
building a data warehouse is easy
There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of thefollowing two
approaches:
 Top - Down Approach (Suggested by Bill Inmon)
 Bottom - Up Approach (Suggested by Ralph Kimball)
Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized repository to house
corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). The data in the
EDW is stored in a normalized form in order to avoid redundancy. The central repository for corporate
wide data helps us maintain one version of truth of the data. The data in the EDW is stored at the most
detail level. The reason to build the EDW on the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements. The
disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
Page 7
Data warehousing and Data mining Unit-I

Once the EDW is implemented we start building subject area specific data marts which contain
data in a de normalized form also called star schema. The data in the marts are usually summarized based
on the end users analytical requirements. The reason to de normalize the data in the mart is to provide
faster access to the data for the end users analytics. If we were to have queried a normalized schema for
the same analytics, we would end up in a complex multiple level joins that would be much slower as
compared to the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehouse
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to
cater for one version of truth for business data. This is very important for the data to be reliable,
consistent across subject areas and for reconciliation in case of data related contention between
subject areas.

The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building the data marts
before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data
warehouse. Here we build the data marts separately at different points of time as and when the specific
subject area requirements are clear. The data marts are integrated or combined together to form a data
warehouse. Separate data marts are combined through the use of conformed dimensions and conformed
facts. A conformed dimension and a conformed fact is onethat can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact same thing with
every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and at the
same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for knowing the
overall requirements of the warehouse.
We should implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one data mart. The
advantage of using the Bottom Up approach is that they do not require high initial costs andhave a
faster implementation time; hence the business can start using the marts much earlier ascompared to
the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de normalized
format; hence there would be high space usage for detailed data. We have a tendency of not keeping
detailed data in this approach hence losing out on advantage of having detail data
i.e. flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
Page 8
Data warehousing and Data mining Unit-I

complexity of the integration may become a serious obstacle.

Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is considering all
data warehouse components as parts of a single complex system, and take into account all possible data
sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common characteristics:
 Are based on a dimensional model
 Contain historical and current data
 Include both detailed and summarized data
 Consolidate disparate data from multiple sources while retaining consistencyData
warehouse is difficult to build due to the following reason:
 Heterogeneity of data sources
 Use of historical data
 Growing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. In addition to the general considerations there are following specific points relevant to the data
warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data model is
the template that describes how information will be organized within the integrated warehouse
framework. The data warehouse data must be a detailed data. It must be formatted, cleaned up and
transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by users to
find definitions or subject areas. In other words, it must provide decision support oriented pointers
to warehouse data and thus provides a logical link between warehouse data and decision support
applications.

Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessaryto know
how the data should be divided across multiple servers and which users should get access to which types
of data. The data can be distributed based on the subject area, location (geographical region), or time
(current, month, year).
Tools
A number of tools are available that are specifically designed to help in the implementation of
the data warehouse. All selected tools must be compatible with the given data warehouse environment
and with each other. All tools must be able to use a common Meta data repository.

Page 9
Data warehousing and Data mining Unit-I

Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
Technical considerations
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include:
 The hardware platform that would house the data warehouse
 The DBMS that supports the warehouse data
 The communication infrastructure that connects data marts, operational systems and endusers
 The hardware and software to support meta data repository
 The systems management framework that enables admin of the entire environment
Implementation considerations
The following logical steps needed to implement a data warehouse:
 Collect and analyze business requirements
 Create a data model and a physical design
 Define data sources
 Choose the DB tech and platform
 Extract the data from operational DB, transform it, clean it up and load it into the
warehouse
 Choose DB access and reporting tools
 Choose DB connectivity software
 Choose data analysis and presentation s/w
 Update the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best way to
choose this is based on the type of data can be selected using this tool and the kind of access it permits
for a particular user. The following lists the various types of data that can be accessed:
 Simple tabular form data
 Ranking data
 Multivariable data
 Time series data

Page 10
Data warehousing and Data mining Unit-I

 Graphing, charting and pivoting data


 Complex textual search data
 Statistical analysis data
 Data for testing of hypothesis, trends and patterns
 Predefined repeatable queries
 Ad hoc user specified queries
 Reporting and analysis data
 Complex queries with multiple joins, multi level sub queries and sophisticated search criteria
Data extraction, clean up, transformation and migration
A proper attention must be paid to data extraction which represents a success factor for a data
warehouse architecture. When implementing data warehouse several the following selection criteria that
affect the ability to transform, consolidate, integrate and repair the data should be considered:
 Timeliness of data delivery to the warehouse
 The tool must have the ability to identify the particular data and that can be read by
conversion tool
 The tool must support flat files, indexed files since corporate data is still in this type
 The tool must have the capability to merge data from multiple data stores
 The tool should have specification interface to indicate the data to be extracted
 The tool should have the ability to read data from data dictionary
 The code generated by the tool should be completely maintainable
 The tool should permit the user to extract the required data
 The tool must have the facility to perform data type and character set translation
 The tool must have the capability to create summarization, aggregation and derivation ofrecords
 The data warehouse database system must be able to perform loading data directly fromthese
tools

Data placement strategies


– As a data warehouse grows, there are at least two options for data placement. One is to putsome
of the data in the data warehouse into another storage media.
– The second option is to distribute the data in the data warehouse across multiple servers.

User levels
The users of data warehouse data can be classified on the basis of their skill level in accessing the
warehouse.
There are three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats and running
pre existing queries and reports. These users do not need tools that allow for building standard and ad hoc
reports
Power Users: can use pre defined as well as user defined queries to create simple and ad hoc reports.
These users can engage in drill down operations. These users may have the experience of using reporting
Page 11
Data warehousing and Data mining Unit-I

and query tools.


Expert users: These users tend to create their own complex queries and perform standard analysis on the
info they retrieve. These users have the knowledge about the use of query and report tools

Benefits of data warehousing


Data warehouse usage includes,
– Locating the right info
– Presentation of info
– Testing of hypothesis
– Discovery of info
– Sharing the analysis

The benefits can be classified into two:


 Tangible benefits (quantified / measureable):It includes,
– Improvement in product inventory
– Decrement in production cost
– Improvement in selection of target markets
– Enhancement in asset and liability management
 Intangible benefits (not easy to quantified): It includes,
– Improvement in productivity by keeping all data in single location andeliminating
rekeying of data
– Reduced redundant processing
– Enhanced customer relation

Mapping the data warehouse architecture to Multiprocessor architecture


The functions of data warehouse are based on the relational data base technology. The relational
data base technology is implemented in parallel manner. There are two advantages of having parallel
relational data base technology for data warehouse:
 Linear Speed up: refers the ability to increase the number of processor to reduce response time
 Linear Scale up: refers the ability to provide same performance on the same requests as the
database size increases
Types of parallelism
There are two types of parallelism:
 Inter query Parallelism: In which different server threads or processes handle multiple requests at
the same time.
 Intra query Parallelism: This form of parallelism decomposes the serial SQL query into lower
level operations such as scan, join, sort etc. Then these lower level operations are executed
concurrently in parallel.

Intra query parallelism can be done in either of two ways:


Page 12
Data warehousing and Data mining Unit-I

 Horizontal parallelism: which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently on different
processors against different set of data
 Vertical parallelism: This occurs among different tasks. All query components such as scan, join,
sort etc are executed in parallel in a pipelined fashion. In other words, an output from one task
becomes an input into another task.

Data partitioning:
Data partitioning is the key component for effective parallel execution of data base operations.
Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server. Another
option for random portioning is round robin fashion partitioning in which each record is placed on the
next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and does not waste
time searching for it across all disks.
The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value of the
partitioning key for each row
Key range partitioning: Rows are placed and located in the partitions according to the value of the
partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T are in
partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk etc.
This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user definedexpression.

Data base architectures of parallel processing


There are three DBMS software architecture styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shred nothing architecture
Page 13
Data warehousing and Data mining Unit-I

Shared Memory Architecture


Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics:
 Multiple PUs share memory.
 Each PU has full access to all shared memory through a common bus.
 Communication between nodes occurs via shared memory.
 Performance is limited by the bandwidth of the memory bus.
Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP nodes can
be used with Oracle Parallel Server in a tightly coupled system, where memory is shared among the
multiple PUs, and is accessible by all the PUs through a memory bus. Examples of tightly coupled
systems include the Pyramid, Sequent, and Sun SparcServer.
Performance is potentially limited in a tightly coupled system by a number of factors. These
include various system components such as the memory bandwidth, PU to PU communication
bandwidth, the memory available on the system, the I/O bandwidth, and the bandwidth of the common
bus.

Parallel processing advantages of shared memory systems are these:


 Memory access is cheaper than inter-node communication. This means that internal
synchronization is faster than using the Lock Manager.
 Shared memory systems are easier to administer than a cluster.
A disadvantage of shared memory systems for parallel processing is as follows:
 Scalability is limited by bus bandwidth and latency, and by available memory.
Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in followingfigure,
have the following characteristics:
 Each node consists of one or more PUs and associated memory.
Page 14
Data warehousing and Data mining Unit-I

 Memory is not shared between nodes.


 Communication occurs over a common high-speed bus.
 Each node has access to the same disks and other resources.
 A node can be an SMP if the hardware supports it.
 Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.

The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed
Lock Manager (DLM ) is required. Examples of loosely coupled systems are VAX clusters or Sun
clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain the
consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained to
ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components, such as the
bandwidth of the high-speed bus through which the nodes communicate, and DLMperformance.
Parallel processing advantages of shared disk systems are as follows:
 Shared disk systems permit high availability. All data is accessible even if one node dies.
 These systems have the concept of one database, which is an advantage over shared nothing
systems.
 Shared disk systems provide for incremental growth. Parallel processing disadvantages of
shared disk systems are these:
 Inter-node synchronization is required, involving DLM overhead and greater
dependency on high-speed interconnect.

Page 15
Data warehousing and Data mining Unit-I

 If the workload is not partitioned well there may be high synchronization overhead.
 There is operating system overhead of running shared disk software.

Shared Nothing Architecture


Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is
connected to a given disk. If a table or database is located on that disk, access depends entirely on the PU
which owns it. Shared nothing systems can be represented as follows:

Shared nothing systems are concerned with access to disks, not access to memory. Nonetheless,
adding more PUs and disks can improve scale up. Oracle Parallel Server can access the disks on a shared
nothing system as long as the operating system provides transparent disk access, but this access is
expensive in terms of latency.

Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
 Shared nothing systems provide for incremental growth.
 System growth is practically unlimited.
 MPPs are good for read-only databases and decision support applications.
 Failure is local: if one node fails, the others stay up.
Disadvantages
 More coordination is required.
 More overhead is required for a process working on a disk belonging to another node.
 If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may
be worthwhile to consider data-dependent routing to alleviate contention.
Page 16
Data warehousing and Data mining Unit-I

Parallel DBMS features


 Scope and techniques of parallel DBMS operations
 Optimizer implementation
 Application transparency
 Parallel environment which allows the DBMS server to take full advantage of the existing
facilities on a very low level
 DBMS management tools help to configure, tune, admin and monitor a parallel RDBMS as
effectively as if it were a serial RDBMS
 Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale upat
reasonable costs.

Parallel DBMS vendors


1. Oracle:
 Oracle supports parallel database processing with its add-on Oracle Parallel Server Option
(OPS) and Parallel Query Option (PQO).
 OPS is designed for loosely coupled clusters of shared-disk systems.
 The PQO is optimized to run on SMPs or with the OPS. It supports all Oracle platforms
except NetWare and OS/2.
 Architecture: Oracle design is the notion of the Virtual shared disk capability. PQO uses a
shared-disk architecture which assumes that each processor node has access to all disks.
PQO supports parallel execution of queries and operations such as index build, database
load, backup and recovery.
 Data Partition: Oracle supports dynamic data repartitioning, which is done in memory using
Key range, hash, round robin.
 Parallel operations: Oracle PQO can parallelize most SQL operations, including joins, scans,
sorts, aggregates, and groupings. And also, Oracle can parallelize the creation of indexes,
database load, backup, and recovery. PQO supports both horizontal and vertical parallelism.
2. Informix:
 Informix runs on a variety of UNIX platforms. Informix Online release 8 also known as XPS
(eXtended Parallel Server), supports MPP hardware platforms that include IBM, SP, AT
& T 3600, Sun, HP and ICL, Goldrush, with Sequent, Siemens/ Pyramid.
 Architecture: Informix developed its Dynamic Scalable Architecture (DSA) to support
shared-memory, shared-disk, and shared nothing models.
 Data partition: Informix Online 7 supports round robin, hash, schema, key range and user
Page 17
Data warehousing and Data mining Unit-I

defined.
 Parallel operations: Informix Online 7 executes queries, INSERT, UPDATE, DELELTE
in parallel.
3. IBM: IBMs parallel client/ server database product - DB2 Parallel Edition (DB2 PE)
 DB2 is a database that is based on DB2/6000 SERVER architecture.
 Architecture: DB2 PE is a Shared nothing model, in which all data is partitioned across
processor nodes.
 All database operations and utilities are fully parallelized.
 DB2 PE can run on Lan-based clusters.
 Data partition: DB2 PE supports hash partitioning and node groups that allow a table to
span multiple nodes. The DBA can choose to partition a table on a table-by-table basis.
 Parallel operations: All dat abase operat ions (quer y processing, INSERT,
UPDATE, DELELTE, load, recovery, index creation, backup, table reorganization) are
fully parallelized.
4. SYBASE:
 SYBASE has implemented its parallel DBMS functionality in a product called SYBASE
MPP. It was jointly developed by Sybase and NCR.
 Architecture: SYBASE MPP is designed to make multiple distributed SQL Servers look
like a single server to the user. It is a Shared nothing system that partitions data across
multiple SQL servers and supports both function shipping and data repartitioning.
SYBASE MPP is an Open Server application that operates on top of existing SQL
Server.
 SYBASE MPP consists of specialized servers: Data server, DBA Server, Administrative
Server.
 Data partition: It supports hash, key range, and Schema partitioning.
 Parallel operations: All SQL statements and utilities are executed in parallel across SQL
Servers. SYBASE MPP supports Horizontal and vertical parallelism

Discretization and concept hierarchy generation


Discretization
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis

Page 18
Data warehousing and Data mining Unit-I

Discretization and Concept Hierarchy


 Discretization
 Reduce the number of values for a given continuous attribute by dividing the range of the
attribute into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as young, middle-aged, or senior)

Discretization and Concept Hierarchy Generation for Numeric Data


 Typical methods: All the methods can be applied recursively
 Binning (covered above)
 Top-down split, unsupervised,
 Histogram analysis (covered above)
 Top-down split, unsupervised
 Clustering analysis (covered above)
 Either top-down split or bottom-up merge, unsupervised
 Entropy-based discretization: supervised, top-down split
 Interval merging by 2 Analysis: unsupervised, bottom-up merge
 Segmentation by natural partitioning: top-down split, unsupervised

Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the
information gain after partitioning is |S | |S |
I ( S , T )  1 Entropy( S1)  2 Entropy( S 2)
|S| |S|

 Entropy is calculated based on class distribution of the samples in the set. Given m classes, the
m
entropy of S1 is Entropy( S1 )   pi log 2 ( pi )
i 1

where pi is the probability of class i in S1


 The boundary that minimizes the entropy function over all possible boundaries is selected as a binary
discretization
 The process is recursively applied to partitions obtained until some stopping criterion is met
 Such a boundary may reduce data size and improve classification accuracy

Interval Merge by 2 Analysis


 Merging-based (bottom-up) vs. splitting-based methods
 Merge: Find the best neighboring intervals and merge them to form larger intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Page 19
Data warehousing and Data mining Unit-I

 Initially, each distinct value of a numerical attr. A is considered to be one interval


 2 tests are performed for every pair of adjacent intervals
 Adjacent intervals with the least 2 values are merged together, since low 2 values for a pair
indicate similar class distributions
 This merge process proceeds recursively until a predefined stopping criterion is met (such as
significance level, max-interval, max inconsistency, etc.)

Segmentation by Natural Partitioning


 A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range
into 3 equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4
intervals
 If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5
intervals

Example of 3-4-5 Rule cou


nt

Step -$351 -$159 profit $1,838 $4,700


1: Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step msd=1,000 Low=-$1,000
2: High=$2,000
(-$1,000 -
Step
$2,000)
3:
(-$1,000 - (0 -$ ($1,000 -
0) 1,000) $2,000)
(-$400 -$5,000)
Step
4:
($2,000 - $5,
(0 - ($1,000 - $2,
000)
$1,000) 000)
(0 - ($1,000
- ($2,000
$200
($200 -
($1,200
-
-
($3,000
($400 ($1,400 -
- - ($4,000
($600 ($1,600 -
- ($800 ($1,800 $5,000)
-
- -

Concept Hierarchy Generation for Categorical Data


 Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
Page 20
Data warehousing and Data mining Unit-I

 Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct
values
 E.g., for a set of attributes: {street, city, state, country}

Automatic Concept Hierarchy Generation


 Some hierarchies can be automatically generated based on the analysis of the number of distinct values
per attribute in the data set
 The attribute with the most distinct values is placed at the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
15 distinct values
country

365 distinct values


province_or_ state

3567 distinct values


city

674,339 distinct values


street

A multidimensional data model:-


Data warehouses and OLAP tools are based on a multidimensional data model. This model
views data in the form of a data cube.
Data cube:- A data cube allows data to be modeled and viewed in multiple dimensions. It
is defined by dimensions and facts.
Dimensions are the attributes with respect to which an organization wants to keep
records. For example, AllElectronics may create a sales data warehouse in order to keep
records with respect to the dimensions time, item, branch, and location. These dimensions
allow the store to keep track of things like monthly sales of items, and the branches and
locations at which the items were sold. Each dimension may have a table associated with
it, called a dimension table. Dimension tables can be specified by users or experts, or
automatically generated. A multidimensional data model is typically organized around a
central theme, like sales, for instance. This theme is represented by a fact table. Facts are
numerical measures. Facts are the quantities used to analyze relationships between
dimensions. Examples of facts for a sales data warehouse include dollars sold (sales
amount in dollars), units sold (number of units sold), and amount budgeted. The fact table
contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.

Page 21
Data warehousing and Data mining Unit-I

Fig below represents 2-D view of sales details for the city Vancouver with respect to the
dimensions’ time and item.

Fig below represents 3-D view of sales details with respect to the dimensions time and item
and loaction. 3-D data of tables are represented as a series of 2-D tables.

Fig below represents 3-D data cube view of sales details with respect to the dimensions’ time
and item and location.

Fig below represents 4-D data cube view of sales details with respect to the dimensions time
,item, location and supplier.
Page 22
Data warehousing and Data mining Unit-I

In the data warehouse, a data cube of the above is referred to as a cuboid. Each data cube
consists of lattice of cuboids, each showing the data at a different level of summarization. The
lattice of cuboids is thus referred to as a data cube.

Figure shows a lattice of cuboids forming a data cube for the dimensions time, item,
location, and supplier. The cuboid which holds the lowest level of summarization is called
the base cuboid. For example, the 4-D cuboid in the above Figure is the base cuboid for
Page 23
Data warehousing and Data mining Unit-I

the given time, item, location, and supplier dimensions. 3-D (non-base) cuboid for time,
item, and location, summarized for all suppliers. The 0-D cuboid which holds the highest
level of summarization is called the apex cuboid. The apex cuboid is typically denoted by
all.

DBMS schemas for decision support


Stars, snowflakes, and fact constellations: Schemas for multidimensional databases:-
The entity-relationship model is commonly used in the design of relational databases. It
consists of a set of entities or objects, and the relationships between them. Such a data
model is appropriate for online transaction processing. Data warehouses, require a concise,
subject-oriented schema which facilitates on-line data analysis. The most popular data
model for data warehouses is a multidimensional model. This model can exist in the form
of a star schema, a snowflake schema, or a fact constellation schema.
 Star schema: The star schema is a modeling paradigm in which the data warehouse
contains (1) a large central table (fact table), and (2) a set of smaller dimension
tables one for each dimension. The schema graph resembles a starburst, with the
dimension tables displayed in a radial pattern around the central fact table.

An example of a star schema for AllElectronics sales is shown in above Figure.


Sales are considered along four dimensions, namely time, item, branch, and
location. The schema contains a central fact table for sales which contains keys to
each of the four dimensions, along with two measures: dollars sold and units sold.
In the star schema, each dimension is represented by only one table, and each table
contains a set of attributes. For example, the location dimension table contains the
attribute set i.e location key, street, city, state, country.
 Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a snowflake.

Page 24
Data warehousing and Data mining Unit-I

The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form. Such a table
is easy to maintain and also saves storage space. However, the snowflake structure can
reduce the effectiveness since more joins will be needed to execute a query. This may
affect the system performance. Snowflake schema of a data warehouse for sales.
A compromise between the star schema and the snowflake schema is to adopt
a mixed schema where only the very large dimension tables are normalized.
Normalizing large dimension tables saves storage space, while keeping small
dimension tables unnormalized may reduce the cost and performance degradation
due to joins on multiple dimension tables. Doing both may lead to an overall
performance gain.
 Fact constellation: Sophisticated applications may require multiple fact tables to
share dimension tables. This kind of schema can be viewed as a collection of stars,
and hence is called a galaxy schema or a fact constellation.

Fig:- Fact constellation schema of a data warehouse for sales and shipping.
Page 25
Data warehousing and Data mining Unit-I

An example of a fact constellation schema is shown in Figure. This schema


specifies two fact tables, sales and shipping. A fact constellation schema allows
dimension tables to be shared between fact tables. The dimensions tables for time,
item, and location, are shared between both the sales and shipping fact tables.

Typical OLAP Operations

 Roll up (drill-up): summarize data


 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or detailed data, or introducing new
dimensions
 Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
Page 26
Data warehousing and Data mining Unit-I

 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

OLAP Server Architectures


 Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and manage warehouse data and OLAP
middle ware
 Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
Specialized support for SQL queries over star/snowflake schemas

OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume
that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast
query processing, maintaining data integrity in multi-access environments and an effectiveness measured
by number of transactions per second. In OLTP database there is detailed and current data, and schema
used to store transactional databases is the entity model (usually 3NF).
- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions.
Queries are often very complex and involve aggregations. For OLAP systems a response time is an
effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP
database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).
-

Page 27
Data warehousing and Data mining Unit-I

The following table summarizes the major differences between OLTP and OLAP system design.

OLTP System OLAP System


Online Transaction Processing Online Analytical Processing
(Operational System) (Data Warehouse)
Source of data Operational data; OLTPs are the Consolidation data; OLAP data comesfrom
original source of the data. the various OLTP Databases
Purpose of data To control and run fundamental To help with planning, problem solving,and
business tasks decision support
What the data Reveals a snapshot of ongoing Multi-dimensional views of various kindsof
business processes business activities
Inserts and Short and fast inserts and updates Periodic long-running batch jobs refreshthe
Updates initiated by end users data
Relatively standardized and simple Often complex queries involving
Queries queries Returning relatively few records aggregations

Depends on the amount of data involved;


Processing Typically very fast batch data refreshes and complex queries
Speed may take many hours; query speed can be
improved by creating indexes
Space Can be relatively small if historical Larger due to the existence of aggregation
Requirements data is archived structures and history data; requires more
indexes than OLTP
Typically de-normalized with fewer tables;
DatabaseDesign Highly normalized with many tables use of star and/or snowflakeschemas

Backup religiously; operational data is Instead of regular backups, some


Backup and critical to run the business, data loss is environments may consider simply reloading
Recovery likely to entail significant monetary loss the OLTP data as a recovery method
and legal liability

Data Warehouse vs. Heterogeneous DBMS


 Traditional heterogeneous DB integration: A query driven approach
 Build wrappers/mediators on top of heterogeneous databases
 When a query is posed to a client site, a meta-dictionary is used to translate the query into
queries appropriate for individual heterogeneous sites involved, and the results are integrated
into a global answer set
 Complex information filtering, compete for resources
 Data warehouse: update-driven, high performance
 Information from heterogeneous sources is integrated in advance and stored in warehouses for
direct query and analysis
Page 28
Data warehousing and Data mining Unit-I

Data Warehouse vs. Operational DBMS


 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration,
accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP


OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

Page 29
Data warehousing and Data mining Unit-I

A three-tier data warehouse architecture:- Data warehouses often adopt a three-tier


architecture. The bottom tier is a ware-house database server which is almost always a
relational database system. The middle tier is an OLAP server which is typically
implemented using either a Relational OLAP (ROLAP) model or a Multidimensional
OLAP (MOLAP) model. The top tier is a client, which contains query and reporting tools,
analysis tools, and or data mining tools.
From the architecture point of view, there are three data warehouse models:
the enterprise warehouse, the data mart, and the virtual warehouse.
 Enterprise warehouse: An enterprise warehouse collects all of the information
about the entire organization. It provides corporate-wide data integration. It
typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise
data warehouse may be implemented on traditional mainframes, UNIX
superservers, etc., It requires extensive business modeling and may take years to
design and build.

Fig:- A three-tier data warehousing architecture.


 Data mart: A data mart contains a subset of corporate-wide data which is useful
for a specific group of users. The scope is confined to specific, selected subjects.
Page 30
Data warehousing and Data mining Unit-I

For example, a marketing data mart may confine its subjects to customer, item, and
sales. The data contained in data marts tend to be summarized. Data marts are
usually implemented on low cost UNIX servers or Windows/NT servers etc., The
implementation of a data mart is within weeks rather than months or years.
Depending on the source of data, data marts can be categorized into the following
two classes: Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area. Dependent data marts
are sourced ectly from enterprise data warehouses.
 Virtual warehouse: A virtual warehouse is a set of views over operational
databases. For efficient query processing, only some of the possible summary
views may be materialized. A virtual warehouse is easy to build but requires
excess capacity on operational database servers.

Page 31

You might also like