0% found this document useful (0 votes)
80 views

Unit-2: Multi-Dimensional Data Model?

The document discusses data warehousing concepts including: 1. A data warehouse stores data from various sources for analysis and reporting. It separates analytics from transactional databases. 2. A multidimensional data model organizes data into dimensions and facts for analysis across multiple perspectives like time, items, and locations. 3. Data warehouse architectures include single-tier, two-tier, and three-tier designs to separate source systems, staging, and analytics layers.

Uploaded by

efwewef
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Unit-2: Multi-Dimensional Data Model?

The document discusses data warehousing concepts including: 1. A data warehouse stores data from various sources for analysis and reporting. It separates analytics from transactional databases. 2. A multidimensional data model organizes data into dimensions and facts for analysis across multiple perspectives like time, items, and locations. 3. Data warehouse architectures include single-tier, two-tier, and three-tier designs to separate source systems, staging, and analytics layers.

Uploaded by

efwewef
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit-2

Data Warehouse:

A Data Warehouse refers to a place where data can be stored for useful mining.
It is like a quick computer system with exceptionally huge data storage
capacity. Data from the various organization's systems are copied to the
Warehouse, where it can be fetched and conformed to delete errors. Here,
advanced requests can be made against the warehouse storage of data.

Data warehouse combines data from numerous sources which ensure the data
quality, accuracy, and consistency. Data warehouse boosts system execution by
separating analytics processing from transnational databases. Data flows into a
data warehouse from different databases. A data warehouse works by sorting
out data into a pattern that depicts the format and types of data. Query tools
examine the data tables using patterns.

Multi-Dimensional Data Model?

A multidimensional model views data in the form of a data-cube. A


data cube enables data to be modelled and viewed in multiple
dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an


organization keeps records. For example, a shop may create a sales

Prepared By: AlpeshLimbachiya Page 1


data warehouse to keep records of the store's sales for the dimension
time, item, and location. These dimensions allow the save to keep
track of things, for example, monthly sales of items and the locations
at which the items were sold. Each dimension has a table related to it,
called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes
item_name, brand, and type.

A multidimensional data model is organized around a central theme,


for example, sales. This theme is represented by a fact table. Facts are
numerical measures. The fact table contains the names of the facts or
measures of the related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of
Delhi. The data is shown in the table. In this 2D representation, the
sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of
an item sold). The fact or measure displayed in rupee sold (in
thousands).

Prepared By: AlpeshLimbachiya Page 2


Now, if we want to view the sales data with a third dimension, For
example, suppose the data according to time and item, as well as the
location is considered for the cities Chennai, Kolkata, Mumbai, and
Delhi. These 3D data are shown in the table. The 3D data of the table
are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form


of a 3D data cube, as shown in fig:

Prepared By: AlpeshLimbachiya Page 3


Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall
architecture of data communication processing and presentation that
exist for end-clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital
components.
Production applications such as payroll accounts payable product
purchasing and inventory control are designed for online transaction
processing (OLTP). Such applications gather detailed data from day
to day operations.
Data Warehouse applications are designed to support the user ad-hoc
data requirements, an activity recently dubbed online analytical
processing (OLAP). These include applications such as forecasting,
profiling, summary reporting, and trend analysis
Types of Data Warehouse Architectures

Prepared By: AlpeshLimbachiya Page 4


Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its
purpose is to minimize the amount of data stored to reach this goal;
it removes data redundancies.

The figure shows the only layer physically available is the source
layer. In this method, data warehouses are virtual. This means that
the data warehouse is implemented as a multidimensional view of
operational data created by specific middleware, or an intermediate
processing layer.

Prepared By: AlpeshLimbachiya Page 5


The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional
workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the
two-tier architecture for a data warehouse system, as shown in fig:

Prepared By: AlpeshLimbachiya Page 6


Although it is typically called two-layer architecture to highlight a
separation between physically available sources and data
warehouses, in fact, consists of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous


source of data. That data is stored initially to corporate relational
databases or legacy databases, or it may come from an
information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted,
cleansed to remove inconsistencies and fill gaps, and integrated
to merge heterogeneous sources into one standard schema. The
so-named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract, transform,
cleanse, validate, filter, and load source data into a data
warehouse.
3. Data Warehouse layer: Information is saved to one logically
centralized individual repository: a data warehouse. The data
warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data
warehouse contents and are designed for specific enterprise

Prepared By: AlpeshLimbachiya Page 7


departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart
schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible
accessed to issue reports, dynamically analyze information, and
simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers,
and customer-friendly GUIs.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing
multiple source system), the reconciled layer and the data
warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data
warehouse.

The main advantage of the reconciled layer is that it creates a


standard reference data model for a whole enterprise. At the same
time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases,
the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be
satisfactorily prepared using the corporate applications or generating
data flows to feed external processes periodically to benefit from
cleaning and integration.

This architecture is especially useful for the extensive, enterprise-


wide systems. A disadvantage of this structure is the extra file
storage space used through the extra redundant reconciled layer. It
also makes the analytical tools a little further away from being real-
time.

Prepared By: AlpeshLimbachiya Page 8


Data Warehouse Implementation
There are various implementations in data warehouses which are as follows

1. Requirements analysis and capacity planning: The first process in data


warehousing involves defining enterprise needs, defining architectures,
carrying out capacity planning, and selecting the hardware and software tools.
This step will contain be consulting senior management as well as the different
stakeholder.

2. Hardware integration: Once the hardware and software has been selected,
they require to be put by integrating the servers, the storage methods, and the
user software tools.

3. Modeling: Modelling is a significant stage that involves designing the


warehouse schema and views. This may contain using a modeling tool if the
data warehouses are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical


modeling is needed. This contains designing the physical data warehouse
organization, data placement, data partitioning, deciding on access techniques,
and indexing.

Prepared By: AlpeshLimbachiya Page 9


5. Sources: The information for the data warehouse is likely to come from
several data sources. This step contains identifying and connecting the sources
using the gateway, ODBC drives, or another wrapper.

6. ETL: The data from the source system will require to go through an ETL
phase. The process of designing and implementing the ETL phase may contain
defining a suitable ETL tool vendors and purchasing and implementing the tool
s. This may contains customize the tool to suit the need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon,
testing the tools will be needed, perhaps using a staging area. Once everything
is working adequately, the ETL tools may be used in populating the
warehouses given the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be


end-user applications. This step contains designing and implementing
applications required by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has
been populated and the end-client applications tested, the warehouse system
and the operations may be rolled out for the user's community to use.

Implementation Guidelines

Prepared By: AlpeshLimbachiya Page 10


1. Build incrementally: Data warehouses must be built incrementally.
Generally, it is recommended that a data marts may be created with one
particular project in mind, and once it is implemented, several other sections
of the enterprise may also want to implement similar systems. An enterprise
data warehouses can then be implemented in an iterative manner allowing all
data marts to extract information from the data warehouse.

2. Need a champion: A data warehouses project must have a champion who is


active to carry out considerable researches into expected price and benefit of
the project. Data warehousing projects requires inputs from many units in an
enterprise and therefore needs to be driven by someone who is needed for
interacting with people in the enterprises and can actively persuade
colleagues.

3. Senior management support: A data warehouses project must be fully


supported by senior management. Given the resource-intensive feature of
such project and the time they can take to implement, a warehouse project
signal for a sustained commitment from senior management.

4. Ensure quality: The only record that has been cleaned and is of a quality that
is implicit by the organizations should be loaded in the data warehouses.

5. Corporate strategy: A data warehouse project must be suitable for corporate


strategies and business goals. The purpose of the project must be defined
before the beginning of the projects.

6. Business plan: The financial costs (hardware, software, and peopleware),


expected advantage, and a project plan for a data warehouses project must be
clearly outlined and understood by all stakeholders. Without such
understanding, rumors about expenditure and benefits can become the only
sources of data, subversion the projects.

7. Training: Data warehouses projects must not overlook data warehouses


training requirements. For a data warehouses project to be successful, the
customers must be trained to use the warehouses and to understand its
capabilities.

Prepared By: AlpeshLimbachiya Page 11


8. Adaptability: The project should build in flexibility so that changes may be
made to the data warehouses if and when required. Like any system, a data
warehouse will require to change, as the needs of an enterprise change.

9. Joint management: The project must be handled by both IT and business


professionals in the enterprise. To ensure that proper communication with the
stakeholder and which the project is the target for assisting the enterprise's
business, the business professional must be involved in the project along with
technical professionals.

Data warehousing to Data miningData Warehouse:

A Data Warehouse refers to a place where data can be stored for useful
mining. It is like a quick computer system with exceptionally huge data storage
capacity. Data from the various organization's systems are copied to the
Warehouse, where it can be fetched and conformed to delete errors. Here,
advanced requests can be made against the warehouse storage of data.

Important Features of Data Warehouse


The Important features of Data Warehouse are given below:

1. Subject Oriented

A data warehouse is subject-oriented. It provides useful data about a subject


instead of the company's ongoing operations, and these subjects can be
customers, suppliers, marketing, product, promotion, etc. A data warehouse
usually focuses on modeling and analysis of data that helps the business
organization to make data-driven decisions.

2. Time-Variant:

The different data present in the data warehouse provides information for a
specific period.
Prepared By: AlpeshLimbachiya Page 12
3. Integrated

A data warehouse is built by joining data from heterogeneous sources, such as


social databases, level documents, etc.

4. Non- Volatile

It means, once data entered into the warehouse cannot be change.

Advantages of Data Warehouse:

1. More accurate data access


2. Improved productivity and performance
3. Cost-efficient
4. Consistent and quality data

Data Mining:
Data mining refers to the analysis of data. It is the computer-supported
process of analyzing huge sets of data that have either been compiled by
computer systems or have been downloaded into the computer. In the data
mining process, the computer analyzes the data and extract useful information
from it. It looks for hidden patterns within the data set and try to predict
future behavior. Data mining is primarily used to discover and indicate
relationships among the data sets.

Data mining aims to enable business organizations to view business behaviors,


trends relationships that allow the business to make data-driven decisions. It is
also known as knowledge Discover in Database (KDD). Data mining tools utilize
AI, statistics, databases, and machine learning systems to discover the

Prepared By: AlpeshLimbachiya Page 13


relationship between the data. Data mining tools can support business-related
questions that traditionally time-consuming to resolve any issue.

Important features of Data Mining:


The important features of Data Mining are given below:

1. It utilizes the Automated discovery of patterns.


2. It predicts the expected results.
3. It focuses on large data sets and databases
4. It creates actionable information.

Advantages of Data Mining:


i. Market Analysis:

Data Mining can predict the market that helps the business to make the
decision. For example, it predicts who is keen to purchase what type of
products.

ii. Fraud detection:

Data Mining methods can help to find which cellular phone calls, insurance
claims, credit, or debit card purchases are going to be fraudulent.

iii. Financial Market Analysis:

Data Mining techniques are widely used to help Model Financial Market

iv. Trend Analysis:

Analyzing the current existing trend in the marketplace is a strategic benefit


because it helps in cost reduction and manufacturing process as per market
demand.

Differences between Data Mining and Data Warehousing:

Data Mining Data Warehousing

Data mining is the process of A data warehouse is a database system


determining data patterns. designed for analytics.

Data mining is generally considered as Data warehousing is the process of


Prepared By: AlpeshLimbachiya Page 14
the process of extracting useful data combining all the relevant data.
from a large set of data.

Business entrepreneurs carry data Data warehousing is entirely carried out by


mining with the help of engineers. the engineers.

In data mining, data is analyzed In data warehousing, data is stored


repeatedly. periodically.

Data mining uses pattern recognition Data warehousing is the process of


techniques to identify patterns. extracting and storing data that allow easier
reporting.

One of the most amazing data mining One of the advantages of the data warehouse
technique is the detection and is its ability to update frequently. That is the
identification of the unwanted errors reason why it is ideal for business
that occur in the system. entrepreneurs who want up to date with the
latest stuff.

The data mining techniques are cost- The responsibility of the data warehouse is
efficient as compared to other to simplify every type of business data.
statistical data applications.

The data mining techniques are not In the data warehouse, there is a high
100 percent accurate. It may lead to possibility that the data required for analysis
serious consequences in a certain by the company may not be integrated into
condition. the warehouse. It can simply lead to loss of
data.

Companies can benefit from this Data warehouse stores a huge amount of
analytical tool by equipping suitable historical data that helps users to analyze
and accessible knowledge-based data. different periods and trends to make future
predictions.

Data generalization

Prepared By: AlpeshLimbachiya Page 15


Data Generalization is the process of summarizing data by replacing relatively
low level values with higher level concepts. It is a form of descriptive data
mining.

There are two basic approaches of data generalization :

1. Data cube approach :

It is also known as OLAP approach.

It is an efficient approach as it is helpful to make the past selling graph.

In this approach, computation and results are stored in the Data cube.

It uses Roll-up and Drill-down operations on a data cube.

These operations typically involve aggregate functions, such as count(), sum(),


average(), and max().

These materialized views can then be used for decision support, knowledge
discovery, and many other applications.

2. Attribute oriented induction:

It is an online data analysis, query oriented and generalization based approach.


In this approach, we perform generalization on basis of different values of each
attributes within the relevant data set. after that same tuple are merged and their
respective counts are accumulated in order to perform aggregation.

It performs off-line aggregation before an OLAP or data mining query is


submitted for processing.

On the other hand, the attribute oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized based (on-line data
analysis technique).

It is not limited to particular measures nor categorical data.

Attribute oriented induction approach uses two method:

(i). Attribute removal.

Prepared By: AlpeshLimbachiya Page 16


(ii). Attribute generalization.

Road Map

introduced the basic concepts, techniques, and applications of frequent pattern


mining using market basket analysis as an example. Many other kinds of data,
user requests, and applications have led to the development of numerous,
diverse methods for mining patterns, associations, and correlation relationships.
Given the rich literature in this area, it is important to lay out a clear road map
to help us get an organized picture of the field and to select the best methods for
pattern mining applications.
outlines a general road map on pattern mining research. Most studies mainly
address three pattern mining aspects: the kinds of

patterns mined, mining methodologies, and applications. Some studies, ...

Scalable Frequent Itemset Mining Methods

1. Apriori: A Candidate Generation-and-Test Approach


2. Improving the Efficiency of Apriori
3. FPGrowth: A Frequent Pattern-Growth Approach
4. ECLAT: Frequent Pattern Mining with Vertical Data Format

The Downward Closure Property and Scalable Mining Methods

1. The downward closure property of frequent patterns


1. Any subset of a frequent itemset must be frequent
2. If {beer, diaper, nuts} is frequent, so is {beer, diaper}
3. i.e., every transaction having {beer, diaper, nuts} also contains {beer,
diaper}
2. Scalable mining methods: Three major approaches
1. Apriori (Agrawal& Srikant@VLDB’94)
2. Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
3. Vertical data format approach (Charm—Zaki& Hsiao @SDM’02)

Prepared By: AlpeshLimbachiya Page 17


Association Rule
Association rule learning can be divided into three algorithms:

Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is
designed to work on the databases that contain transactions. This algorithm
uses a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the
products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.

Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm
uses a depth-first search technique to find frequent itemsets in a transaction
database. It performs faster execution than Apriori Algorithm.

F-P Growth Algorithm


The F-P growth algorithm stands for Frequent Pattern, and it is the improved
version of the Apriori Algorithm. It represents the database in the form of a
tree structure that is known as a frequent pattern or tree. The purpose of this
frequent tree is to extract the most frequent patterns.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are
some popular applications of association rule learning:

1. Market Basket Analysis: It is one of the popular examples and


applications of association rule mining. This technique is commonly used
by big retailers to determine the association between items.
2. Medical Diagnosis: With the help of association rules, patients can be
cured easily, as it helps in identifying the probability of illness for a
particular disease.
3. Protein Sequence: The association rules help in determining the synthesis
of artificial Proteins.
4. It is also used for the Catalog Design and Loss-leader Analysis and many
more other applications.
Prepared By: AlpeshLimbachiya Page 18
Association Mining to Correlation Analysis

Most association rule mining algorithms employ a support-confidence


framework. Often, many interesting rules can be found using low support
thresholds. Although minimum support and confidence thresholds help weed
out or exclude the exploration of a good number of uninteresting rules, many
rules so generated are still not interesting to the users

1)Strong Rules Are Not Necessarily Interesting: An Example

Whether or not a rule is interesting can be assessed either subjectively or


objectively. Ultimately, only the user can judge if a given rule is interesting,
and this judgment, being subjective, may differ from one user to another.
However, objective interestingness measures, based on the statistics ―behind‖
the data, can be used as one step toward the goal of weeding out uninteresting
rules from presentation to the user.
The support and confidence measures are insufficient at filtering out
uninteresting association rules. To tackle this weakness, a correlation measure
can be used to augment the support-confidence framework for association
rules. This leads to correlation rules of the form

That is, a correlation rule is measured not only by its support and confidence
but also by the correlation between itemsets A and B. There are many different
correlation measures from which to choose. In this section, we study various
correlation measures to determine which would be good for mining large data
sets.
Constraint-Based Association Mining

A data mining process may uncover thousands of rules from a given set of data,
most of which end up being unrelated or uninteresting to the users. Often, users
have a good sense of which

―direction‖ of mining may lead to interesting patterns and the ―form‖ of the
patterns or rules they would like to find. Thus, a good heuristic is to have the
Prepared By: AlpeshLimbachiya Page 19
users specify such intuition or expectations as constraints to confine the search
space. This strategy is known as constraint-based mining. The constraints can
include the following:

1. Metarule-Guided Mining of Association Rules

“How are metarules useful?” Metarules allow users to specify the syntactic
form of rules that they are interested in mining. The rule forms can be used as
constraints to help improve the efficiency of the mining process. Metarules
may be based on the analyst’s experience, expectations, or intuition regarding
the data or may be automatically generated based on the database schema.

Metarule-guided mining:- Suppose that as a market analyst


for AllElectronics, you have access to the data describing customers (such as
customer age, address, and credit rating) as well as the list of customer
transactions. You are interested in finding associations between customer traits
and the items that customers buy. However, rather than finding all of the
association rules reflecting these relationships, you are particularly interested
only in determining which pairs of customer traits SCE Department of
Information Technology promote the sale of office software.Ametarule can be
used to specify this information describing the form of rules you are interested
in finding. An example of such a metarule is

where P1 and P2 are predicate variables that are instantiated to attributes from
the given database during the mining process, X is a variable representing a
customer, and Y and W take on values of the attributes assigned to P1 and P2,

Prepared By: AlpeshLimbachiya Page 20


respectively. Typically, a user will specify a list of attributes to be considered
for instantiation with P1 and P2. Otherwise, a default set may be used.

2. Constraint Pushing: Mining Guided by Rule Constraints

Rule constraints specify expected set/subset relationships of the variables in the


mined rules, constant initiation of variables, and aggregate functions. Users
typically employ their knowledge of the application or data to specify rule
constraints for the mining task. These rule constraints may be used together
with, or as an alternative to, metarule-guided mining. In this section, we
examine rule constraints as to how they can be used to make the mining
process more efficient. Let’s study an example where rule constraints are used
to mine hybrid-dimensional association rules.

Our association mining query is to “Find the sales of which cheap items (where
the sum of the prices is less than $100) may promote the sales of which
expensive items (where the minimum price is $500) of the same group for
Chicago customers in 2004.” This can be expressed in the DMQL data mining
query language as follows,

Prepared By: AlpeshLimbachiya Page 21

You might also like