INFO408 Database
INFO408 Database
With the aid of a diagram, describe the architecture of a data warehouse (identify the components in the
architecture and the flows in the architecture) [15 marks]
Data Warehouse architecture is based on a Relational database management system server that functions as the
central repository for informational data. In the data warehouse architecture, operational data and processing are
separate from data warehouse processing. This central information repository is surrounded by several key
components designed to make the entire environment functional, manageable, and accessible by both the
operational systems that source data into the warehouse and by the end-user query and analysis tools.
Usually, a Data Warehouse adopts a three-tier architecture. The three-tier architecture of a data warehouse follows
the below.
Bottom Tier: The bottom tier of the architecture represents the data warehouse database server, also known as the
relational database system. Back-end tools and utilities are made use of to feed data into the bottom tier. These
back-end tools and utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier: The middle tier of a data warehouse lies the OLAP Server which is an extended relational database
management system. The ROLAP maps the operations on multidimensional data to standard relational OLAP
(MOLAP) model, which directly implements the multidimensional data and operations.
Top-Tier: This tier represents the front-end client layer. This layer holds the query tools and reporting tools,
analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse
SourcesDataWarehouseDataPresentation
Operational System Reporting Tools 1 Metadata
Marts Staging area
ETL
TOOLS
Analysis Tools
Raw Data
Summary Data
External data
2
3
Data mining Tools
Flat files
Data Warehouse Components
From the architectures outlined above, some components overlap, while others are unique to the
number of tiers.
ETL Tools
ETL stands for Extract, Transform, and Load. The staging layer uses ETL tools to extract the
needed data from various formats and checks the quality before loading it into the data
warehouse.
The Database
The most crucial component and the heart of each architecture is the database. The warehouse is
where the data is stored and accessed.
Data
Once the system cleans and organizes the data, it stores it in the data warehouse. The data
warehouse represents the central repository that stores metadata, summary data, and raw data
coming from each source.
Metadata is the information that defines the data. Its primary role is to simplify working with
data instances. It allows data analysts to classify, locate, and direct queries to the required data.
Summary data is generated by the warehouse manager. It updates as new data loads into the
warehouse. This component can include lightly or highly summarized data. Its main role is to
speed up query performance.
Raw data is the actual data loading into the repository, which has not been processed. Having the
data in its raw form makes it accessible for further processing and analysis.
Access Tools
Users interact with the gathered information through different tools and technologies. They can
analyze the data, gather insight, and create reports.
Some of the tools used include:
Reporting tools. They play a crucial role in understanding how your business is doing and what
should be done next. Reporting tools include visualizations such as graphs and charts showing
how data changes over time.
OLAP tools. Online analytical processing tools which allow users to analyze multidimensional
data from multiple perspectives. These tools provide fast processing and valuable analysis. They
extract data from numerous relational data sets and reorganize it into a multidimensional format.
Data mining tools. Examine data sets to find patterns within the warehouse and the correlation
between them. Data mining also helps establish relationships when analyzing multidimensional
data
Data Marts
Data marts allow you to have multiple groups within the system by segmenting the data in the
warehouse into categories. It partitions data, producing it for a particular user group.
For instance, you can use data marts to categorize information by departments within the
company.
b) Explain the reasons for creating a data mart from the data warehouse and describe the
architecture of the data mart. [5marks]
A data mart is the access layer of a data warehouse that is used to provide users with data. Data
marts are often seen as small slices of the data warehouse. Data warehouses typically house
enterprise-wide data, and information stored in a data mart usually belongs to a specific
department or team.
The key objective for data marts is to provide the business user with the data that is most relevant,
in the shortest possible amount of time. This allows users to develop and follow a train of
thought, without needing to wait long periods for queries to complete. Data marts are designed to
meet the demands of a specific group and have a comparatively narrow subject area. However,
narrow in focus doesn’t necessarily mean small in size. Data marts may contain millions of
records and require gigabytes of storage
QUESTION TWO
a) Define what data mining is and highlight the different styles of data mining that are
available. [5marks]
Data Mining?
It is a process of extracting useful information or knowledge from a tremendous amount of data
(or big data). The different styles of data mining that are available:
Association, Classification, Clustering Analysis, Prediction, Sequential Patterns or Pattern
Tracking, Decision Trees, Outlier Analysis or Anomaly Analysis, Neural Network.
b) With the aid of appropriate examples explain the following data mining algorithms:
i) Apriori Algorithm [8 marks]
Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets
and relevant association rules. It is devised to operate on a database containing a lot of
transactions, for instance, items brought by customers in a store.
It is very important for effective Market Basket Analysis and it helps the customers in purchasing
their items with more ease which increases the sales of the markets. It has also been used in the
field of healthcare for the detection of adverse drug reactions. It produces association rules that
indicates what all combinations of medications and patient characteristics lead to ADRs.
Another basic example is when we go grocery shopping then which items, we frequently
purchase together is been analysed by the shop owner, is using apriori algorithm. So that the
shopkeeper then arrange that frequently bought together items in same shelf so that it will be easy
to buy by the customer.
Basic principle on which Apriori Machine Learning Algorithm works:
• If an item set occurs frequently then all the subsets of the item set, also occur frequently.
• If an item set occurs infrequently then all the supersets of the item set have infrequent
occurrence.
Applications of Apriori Algorithm
Detecting Adverse Drug Reactions
Apriori algorithm is used for association analysis on healthcare data like-the drugs taken by
patients, characteristics of each patient, adverse ill-effects patients experience, initial diagnosis,
etc. This analysis produces association rules that help identify the combination of patient
characteristics and medications that lead to adverse side effects of the drugs.
Market Basket Analysis
Many e-commerce giants like Amazon use Apriori to draw data insights on which products are
likely to be purchased together and which are most responsive to promotion. For example, a
retailer might use Apriori to predict that people who buy sugar and flour are likely to buy eggs to
bake a cake.
Auto-Complete Applications
Google auto-complete is another popular application of Apriori wherein - when the user types a
word, the search engine looks for other associated words that people usually type after a specific
word.
ii) Frequent Pattern Tree Algorithm [7 marks]
FP-tree(Frequent Pattern tree) is the data structure of the FP-growth algorithm for mining
frequent itemsets from a database by using association rules, it is the alternative of the apriori-like
algorithm. The frequent-pattern tree(FP-tree) structure, is a tree data structure for storing frequent
patterns.
The algorithm is designed to operate on databases containing transactions, such as customers’
purchase history on the Amazon website. The purchased item is considered ‘frequent’. The
similar frequent will share the similar branch of the tree, and when they differ, the nodes will split
them. The node identifies a single item from the branch (set of items), and the branch (path)
shows the number of occurrences—links between the items called node-link.
For example, a supermarket sees that there are 200 customers on Friday evening. Out of the 200
customers, 100 bought chickens, and out of the 100 customers who bought chicken, 50 have
bought Onions. Thus, the association rule would be- If customers buy chicken, then buy onion
too, with a support of 50/200 = 25% and a confidence of 50/100=50%.
Another example, in market-based analysis if the minimum threshold is 30% and bread appears
with eggs and milk more than three times or at least three times then it will be a frequent itemset
Frequent pattern mining can be used in a variety of real-world applications. It can be used in
super markets for selling, product placement on shelves, for promotion rules and in text
searching. It can be used in wireless sensor networks especially in smart homes with sensors
attached on Human Body or home usage objects and other applications that require monitoring
of user environment carefully that are subject to critical conditions or hazards such as gas leak,
fire and explosion. These frequent patterns can be used to monitor the activities for dementia
patients. It can be seen as an important approach with the ability to monitor activities of daily
life in smart environment for tracking functional decline among dementia patients