Dmda Mid 1
Dmda Mid 1
LAQS
Q.1 Describe the 3-tier Architecture of the Data Warehouse with a neat
sketch
Three-Tier Data Warehouse Architecture
The Three-Tier Data Warehouse Architecture is the commonly used Data
Warehouse design to build a Data Warehouse by including the required Data
Warehouse Schema Model, the required OLAP server type, and the required
front-end tools for Reporting or Analysis purposes, which as the name suggests
contain three tiers such as Top tier, Bottom Tier and the Middle Tier that are
procedurally linked with one another from Bottom tier(data sources) through
Middle tier(OLAP servers) to the Top tier(Front-end tools).
Data Warehouse Architecture is the design based on which a Data Warehouse is
built, to accommodate the desired type of Data Warehouse Schema, user interface
application and database management system, for data organization and
repository structure. The type of Architecture is chosen based on the requirement
provided by the project team. Three-tier Data Warehouse Architecture is the
commonly used choice, due to its detailing in the structure. The three different
tiers here are termed as:
Top-Tier
Middle-Tier
Bottom-Tier
Each Tier can have different components based on the prerequisites presented by
the decision-makers of the project but are subject to the novelty of their respective
tier.
1. Bottom Tier
The Bottom Tier in the three-tier architecture of a data warehouse consists of the
Data Repository. The Data Repository is the storage space for the data extracted
from various data sources, which undergoes a series of activities as a part of the
ETL process. ETL stands for Extract, Transform and Load. As a preliminary
process, before the data is loaded into the repository, all the data relevant and
required are identified from several sources of the system. These data are then
cleaned up, to avoid repeating or junk data from its current storage units. The next
step is to transform all these data into a single format of storage. The final step of
ETL is to Load the data on the repository.
The storage type of the repository can be a relational database management
system or a multidimensional database management system. A relational database
system can hold simple relational data, whereas a multidimensional database
system can hold data with more than one dimension. Whenever the Repository
includes both relational and multidimensional database management systems,
there exists a metadata unit.
2. Middle Tier
The Middle tier here is the tier with the OLAP servers. The Data Warehouse can
have more than one OLAP server, and it can have more than one type of OLAP
server model as well, which depends on the volume of the data to be processed
and the type of data held in the bottom tier. There are three types of OLAP server
models, such as:
ROLAP
Relational online analytical processing
MOLAP
Multidimensional online analytical processing
HOLAP
Hybrid online analytical processing
The Middle Tier acts as an intermediary component between the top tier and the
data repository, that is, the top tier and the bottom tier respectively. From the
user’s standpoint, the middle tier gives an idea about the conceptual outlook of
the database.
3. Top Tier
The Top Tier is a front-end layer, that is, the user interface that allows the user to
connect with the database systems. This user interface is usually a tool or an API
call, which is used to fetch the required data for Reporting, Analysis, and Data
Mining purposes. The type of tool depends purely on the form of outcome
expected. It could be a Reporting tool, an Analysis tool, a Query tool or a Data
mining tool.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process is iterative and it requires multiple
iterations of the above steps to extract accurate knowledge from the data. The
following steps are included in the KDD process:
Data Cleaning
Data cleaning is defined as the removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this, we can use Neural
networks, Decision Trees, Naive Bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into the
appropriate form required by the mining procedure. Data Transformation is a
two-step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task-relevant data into patterns and decides the
purpose of the model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It finds the interestingness
score of each pattern and uses summarization and Visualization to make data
understandable by the user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used
to make decisions.
Note: KDD is an iterative process where evaluation measures can be enhanced,
mining can be refined, and new data can be integrated and transformed to get
different and more appropriate results. Preprocessing of databases consists
of Data cleaning and Data Integration.
Q.3: Explain about various Data Mining Tasks with appropriate examples.
1. Anomaly Detection:
Task Explanation: Anomaly detection involves identifying data
points that deviate significantly from the norm or expected
behaviour.
Example: In network security, abnormal spikes in network traffic
might indicate a potential security breach or a denial-of-service
attack. Anomaly detection algorithms can flag these unusual patterns
for further investigation.
2. Association Rule Learning:
Task Explanation: Association rule learning aims to discover
interesting relationships or patterns in large datasets.
Example: In a retail setting, if customers who buy sunscreen also
tend to purchase beach towels, a store can use this association to
optimize product placements or create targeted promotions for those
items.
3. Clustering:
Task Explanation: Clustering involves grouping similar data points
based on certain characteristics, without predefined categories.
Example: In marketing, clustering can be applied to group
customers with similar purchasing behaviour. This can help
businesses tailor marketing strategies to each cluster's preferences,
increasing the effectiveness of targeted campaigns.
4. Classification:
Task Explanation: Classification involves training a model to
categorize new data points into predefined classes based on existing
labelled data.
Example: In healthcare, a classification model can be trained to
predict whether a patient is likely to develop a specific medical
condition based on features such as age, family history, and lifestyle
choices.
5. Regression:
Task Explanation: Regression aims to find the relationship between
variables and predict a continuous outcome.
Example: In finance, a regression model can be used to predict the
future value of a stock based on various factors such as historical
prices, market trends, and economic indicators.
6. Summarization:
Task Explanation: Summarization involves presenting a condensed
version or visualization of the data to highlight key patterns or
trends.
Example: In social media analytics, summarization might include
generating visualizations that show the overall sentiment of user
comments over time or creating reports that highlight the most
engaging posts based on likes and shares.
These data mining tasks collectively empower organizations to extract
meaningful insights, make informed decisions, and discover hidden patterns
within their data, leading to improved efficiency and strategic decision-
making.
Data preprocessing plays a crucial role in ensuring the quality of data and the
accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the analysis
goals.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique that is used to transform the
raw data into a useful and efficient format.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
2. Data Transformation:
This step is taken to transform the data into appropriate forms suitable for
the mining process. This involves the following ways:
1. Normalization:
It is done to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attributes with interval
levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in the
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information. This
is done to improve the efficiency of data analysis and to avoid overfitting of the
model. Some common steps involved in data reduction are: