Unit-2: Multi-Dimensional Data Model?
Unit-2: Multi-Dimensional Data Model?
Data Warehouse:
A Data Warehouse refers to a place where data can be stored for useful mining.
It is like a quick computer system with exceptionally huge data storage
capacity. Data from the various organization's systems are copied to the
Warehouse, where it can be fetched and conformed to delete errors. Here,
advanced requests can be made against the warehouse storage of data.
Data warehouse combines data from numerous sources which ensure the data
quality, accuracy, and consistency. Data warehouse boosts system execution by
separating analytics processing from transnational databases. Data flows into a
data warehouse from different databases. A data warehouse works by sorting
out data into a pattern that depicts the format and types of data. Query tools
examine the data tables using patterns.
Consider the data of a shop for items sold per quarter in the city of
Delhi. The data is shown in the table. In this 2D representation, the
sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of
an item sold). The fact or measure displayed in rupee sold (in
thousands).
The figure shows the only layer physically available is the source
layer. In this method, data warehouses are virtual. This means that
the data warehouse is implemented as a multidimensional view of
operational data created by specific middleware, or an intermediate
processing layer.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing
multiple source system), the reconciled layer and the data
warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data
warehouse.
2. Hardware integration: Once the hardware and software has been selected,
they require to be put by integrating the servers, the storage methods, and the
user software tools.
6. ETL: The data from the source system will require to go through an ETL
phase. The process of designing and implementing the ETL phase may contain
defining a suitable ETL tool vendors and purchasing and implementing the tool
s. This may contains customize the tool to suit the need of the enterprises.
7. Populate the data warehouses: Once the ETL tools have been agreed upon,
testing the tools will be needed, perhaps using a staging area. Once everything
is working adequately, the ETL tools may be used in populating the
warehouses given the schema and view definition.
9. Roll-out the warehouses and applications: Once the data warehouse has
been populated and the end-client applications tested, the warehouse system
and the operations may be rolled out for the user's community to use.
Implementation Guidelines
4. Ensure quality: The only record that has been cleaned and is of a quality that
is implicit by the organizations should be loaded in the data warehouses.
A Data Warehouse refers to a place where data can be stored for useful
mining. It is like a quick computer system with exceptionally huge data storage
capacity. Data from the various organization's systems are copied to the
Warehouse, where it can be fetched and conformed to delete errors. Here,
advanced requests can be made against the warehouse storage of data.
1. Subject Oriented
2. Time-Variant:
The different data present in the data warehouse provides information for a
specific period.
Prepared By: AlpeshLimbachiya Page 12
3. Integrated
4. Non- Volatile
Data Mining:
Data mining refers to the analysis of data. It is the computer-supported
process of analyzing huge sets of data that have either been compiled by
computer systems or have been downloaded into the computer. In the data
mining process, the computer analyzes the data and extract useful information
from it. It looks for hidden patterns within the data set and try to predict
future behavior. Data mining is primarily used to discover and indicate
relationships among the data sets.
Data Mining can predict the market that helps the business to make the
decision. For example, it predicts who is keen to purchase what type of
products.
Data Mining methods can help to find which cellular phone calls, insurance
claims, credit, or debit card purchases are going to be fraudulent.
Data Mining techniques are widely used to help Model Financial Market
One of the most amazing data mining One of the advantages of the data warehouse
technique is the detection and is its ability to update frequently. That is the
identification of the unwanted errors reason why it is ideal for business
that occur in the system. entrepreneurs who want up to date with the
latest stuff.
The data mining techniques are cost- The responsibility of the data warehouse is
efficient as compared to other to simplify every type of business data.
statistical data applications.
The data mining techniques are not In the data warehouse, there is a high
100 percent accurate. It may lead to possibility that the data required for analysis
serious consequences in a certain by the company may not be integrated into
condition. the warehouse. It can simply lead to loss of
data.
Companies can benefit from this Data warehouse stores a huge amount of
analytical tool by equipping suitable historical data that helps users to analyze
and accessible knowledge-based data. different periods and trends to make future
predictions.
Data generalization
In this approach, computation and results are stored in the Data cube.
These materialized views can then be used for decision support, knowledge
discovery, and many other applications.
On the other hand, the attribute oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized based (on-line data
analysis technique).
Road Map
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is
designed to work on the databases that contain transactions. This algorithm
uses a breadth-first search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the
products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm
uses a depth-first search technique to find frequent itemsets in a transaction
database. It performs faster execution than Apriori Algorithm.
That is, a correlation rule is measured not only by its support and confidence
but also by the correlation between itemsets A and B. There are many different
correlation measures from which to choose. In this section, we study various
correlation measures to determine which would be good for mining large data
sets.
Constraint-Based Association Mining
A data mining process may uncover thousands of rules from a given set of data,
most of which end up being unrelated or uninteresting to the users. Often, users
have a good sense of which
―direction‖ of mining may lead to interesting patterns and the ―form‖ of the
patterns or rules they would like to find. Thus, a good heuristic is to have the
Prepared By: AlpeshLimbachiya Page 19
users specify such intuition or expectations as constraints to confine the search
space. This strategy is known as constraint-based mining. The constraints can
include the following:
“How are metarules useful?” Metarules allow users to specify the syntactic
form of rules that they are interested in mining. The rule forms can be used as
constraints to help improve the efficiency of the mining process. Metarules
may be based on the analyst’s experience, expectations, or intuition regarding
the data or may be automatically generated based on the database schema.
where P1 and P2 are predicate variables that are instantiated to attributes from
the given database during the mining process, X is a variable representing a
customer, and Y and W take on values of the attributes assigned to P1 and P2,
Our association mining query is to “Find the sales of which cheap items (where
the sum of the prices is less than $100) may promote the sales of which
expensive items (where the minimum price is $500) of the same group for
Chicago customers in 2004.” This can be expressed in the DMQL data mining
query language as follows,