DWM - Viva and Short Question Answers
DWM - Viva and Short Question Answers
A Datawarehouse is the repository of a data and it is used for Management decision support
system. Datawarehouse consists of wide variety of data that has high level of business conditions
at a single point in time.
In single sentence, it is repository of integrated information which can be available for queries
and analysis.
Business Intelligence is also known as DSS – Decision support system which refers to the
technologies, application and practices for the collection, integration and analysis of the business
related information or data. Even, it helps to see the data on the information itself.
Dimension table is a table which contain attributes of measurements stored in fact tables. This
table consists of hierarchies, categories and logic that can be used to traverse in nodes.
Fact table contains the measurement of business processes, and it contains foreign keys for the
dimension tables.
Average number of bricks produced by one person/machine – measure of the business process
Data Mining is set to be a process of analyzing the data in different dimensions or perspectives
and summarizing into a useful information. Can be queried and retrieved the data from database
in their own format.
7. What is OLTP?
8. What is OLAP?
OLAP is abbreviated as Online Analytical Processing, and it is set to be a system which collects,
manages, processes multi-dimensional data for analysis and management purposes.
9. What is the difference between OLTP and OLAP?
OLTP OLAP
Data is from original data source Data is from various data sources
ODS is abbreviated as Operational Data Store and it is a repository of real time operational data
rather than long term trend data.
A view is nothing but a virtual table which takes the output of the query and it can be used in
place of tables.
A materialized view is nothing but an indirect access to the table data by storing the results of a
query in a separate schema.
ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the
data from the specified data source and extracts a desired subset of data. Next, it transform the
data using rules and lookup tables and convert it to a desired state.
Then, load function is used to load the resulting data to the target database.
Real-time data warehousing captures the business data whenever it occurs. When there is
business activity gets completed, that data will be available in the flow and become available for
use instantly.
Aggregate tables are the tables which contain the existing warehouse data which has been
grouped to certain level of dimensions. It is easy to retrieve data from the aggregated tables than
the original table which has more number of records.
This table reduces the load in the database server and increases the performance of the query.
A factless fact tables are the fact table which doesn’t contain numeric fact column in the fact
table.
Time dimensions are usually loaded through all possible dates in a year and it can be done
through a program. Here, 100 years can be represented with one row per day.
Non-Addictive facts are said to be facts that cannot be summed up for any of the dimensions
present in the fact table. If there are changes in the dimensions, same facts can be useful.
Conformed fact is a table which can be used across multiple data marts in combined with the
multiple fact tables.
An active data warehouse is a data warehouse that enables decision makers within a company or
organization to manage customer relationships effectively and efficiently.
22. What is the difference between Data Warehouse and OLAP?
Data Warehouse is a place where the whole data is stored for analyzing, but OLAP is used for
analyzing the data, managing aggregations, information partitioning into minor level
information.
24. What are the key columns in Fact and dimension tables?
Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of fact tables
are the primary keys of the dimension tables.
SCD is defined as slowly changing dimensions, and it applies to the cases where record changes
over time.
BUS schema consists of suite of confirmed dimension and standardized definition if there is a
fact tables.
Star schema is nothing but a type of organizing the tables in such a way that result can be
retrieved from the database quickly in the data warehouse environment.
Snowflake schema which has primary dimension table to which one or more dimensions can be
joined. The primary dimension table is the only table that can be joined with the fact table.
30. What is a core dimension?
Core dimension is nothing but a Dimension table which is used as dedicated for single fact table
or datamart.
Name itself implies that it is a self explanatory term. Cleaning of Orphan records, Data breaching
business rules, Inconsistent data and missing information in a database.
Metadata is defined as data about the data. The metadata contains information like number of
columns used, fix width and limited width, ordering of fields and data types of the fields.
In data warehousing, loops are existing between the tables. If there is a loop between the tables,
then the query generation will take more time and it creates ambiguity. It is advised to avoid loop
between the tables.
Yes, dimension table can have numeric value as they are the descriptive elements of our
business.
Cubes are logical representation of multidimensional data. The edge of the cube has the
dimension members,and the body of the cube contains the data values.
Dimensional Modeling is a concept which can be used by data warehouse designers to build their
own data warehouse. This model can be stored in two types of tables – Facts and Dimension
table.
Fact table has facts and measurements of the business and dimension table contains the context
of measurements.
There are three types of Dimensional Modeling and they are as follows:
● Conceptual Modeling
● Logical Modeling
● Physical Modeling
Surrogate key is nothing but a substitute for the natural primary key. It is set to be a unique
identifier for each row that can be used for the primary key to a table.
ER modeling will have logical and physical model but Dimensional modeling will have only
Physical model.
ER Modeling is used for normalizing the OLTP database design whereas Dimensional Modeling
is used for de-normalizing the ROLAP and MOLAP design.
● Define Relationships
1. Start an Instance
A Partial backup in an operating system is a backup short of full backup and it can be done while
the database is opened or shutdown.
The goal to Optimizer is to find the most efficient way to execute the SQL statements.
Execution Plan is a plan which is used to the optimizer to select the combination of the steps.
48. What are the approaches used by Optimizer during execution plan?
1. Rule Based
2. Cost Based
49. What are the tools available for ETL?
Informatica
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction
Metadata is defined as data about the data. But, Data dictionary contain the information about the
project information, graphs, abinito commands and server information.
OLAP - (On-line Analytical Processing )provides you with a very good view of what is
happening, but can not predict what will happen in the future or why it is happening where as
data mining is group of techniques that find relationships that have not previously been
discovered.
53. What are the types of tasks that are carried out during data mining ?
•Prediction Tasks- Use some variables to predict unknown or future values of other variables
● Clustering [Descriptive]
● Regression [Predictive]
● Missing values
● Inconsistent data
2. Attribute/feature construction - New attributes are constructed and added to the tuple
Multi feature cubes, which compute complex queries involving multiple dependent
aggregates at multiple granularity. These cubes are very useful in practice. Many complex
data mining queries can be answered by multi feature cubes without any significant
increase in computational cost, in comparison to cube computation for simple queries with
The dimension table of the snowflake schema model may be kept in normalized
Form to reduce redundancies. Such a table is easy to maintain and saves storage space.
`In data transformation, the data are transformed or consolidated into forms appropriate
The slice operation performs a selection on one dimension of the cube resulting in
There are four key characteristics which separate the data warehouse from
other major operational systems:
Handling of relational and complex types of data: Because relational databases and
data warehouses are widely used, the development of efficient and effective data mining
Local- and wide-area computer networks (such as the Internet) connect many sources of
Fact table contains the name of facts (or) measures as well as keys to each of the related
dimensional tables. A dimension table is used for describing the dimension. (e.g.) A dimension
table for item may contain the attributes item_ name, brand and type.
Stars schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of the data, with
no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each
dimension.
Snowflakes schema: The snowflake schema is a variant of the star schema model, where
Some dimension tables are normalized, thereby further splitting the data into additional
Fact Constellations: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is
67. How is a data warehouse different from a database? How are they similar?
relational database is a collection of tables, each of which is assigned a unique name. Each
table consists of a set of attributes (columns or fields) and usually stores a large set of
unique key and described by a set of attribute values. Both are used to store and manipulate
the data.
Descriptive data mining, which describes data in a concise and summarative manner and
Predictive data mining, which analyzes data in order to construct one or a set of models
and attempts to predict the behavior of new data sets. Predictive data mining, such as
Data mining is the process of analyzing A data warehouse is database system which is
unknown patterns of data. designed for analytical instead of
transactional work.
Data mining is the considered as a process of On the other hand, Data warehousing is the
extracting data from large data sets. process of pooling all relevant data together.
One of the most important benefits of data One of the pros of Data Warehouse is its
mining techniques is the detection and ability to update consistently. That's why it is
identification of errors in the system. ideal for the business owner who wants the
best and latest features.
Data mining helps to create suggestive Data Warehouse adds an extra value to
patterns of important factors. Like the buying operational business systems like CRM
habits of customers, products, sales. So that, systems when the warehouse is integrated.
companies can make the necessary
adjustments in operation and production.
The Data mining techniques are never 100% In the data warehouse, there is great chance
accurate and may cause serious consequences that the data which was required for analysis
in certain conditions. by the organization may not be integrated into
the warehouse. It can easily lead to loss of
information.
The information gathered based on Data Data warehouses are created for a huge IT
Mining by organizations can be misused project. Therefore, it involves high
against a group of people. maintenance system which can impact the
revenue of medium to small-scale
organizations.
After successful initial queries, users may ask Data Warehouse is complicated to implement
more complicated queries which would and maintain.
increase the workload.
Organisations can benefit from this analytical Data warehouse stores a large amount of
tool by equipping pertinent and usable historical data which helps users to analyze
knowledge-based information. different time periods and trends for making
future predictions.
Organisations need to spend lots of their In Data warehouse, data is pooled from
resources for training and Implementation multiple sources. The data needs to be
purpose. Moreover, data mining tools work in cleaned and transformed. This could be a
different manners due to different algorithms challenge.
employed in their design.
The data mining methods are cost-effective Data warehouse's responsibility is to simplify
and efficient compares to other statistical data every type of business data. Most of the work
applications. that will be done on user's part is inputting the
raw data.
Another critical benefit of data mining Data warehouse allows users to access critical
techniques is the identification of errors data from the number of sources in a single
which can lead to losses. Generated data place. Therefore, it saves user's time of
could be used to detect a drop-in sale. retrieving data from multiple sources.
Data mining helps to generate actionable Once you input any information into Data
strategies built on data insights. warehouse system, you will unlikely to lose
track of this data again. You need to conduct
a quick search, helps you to find the right
statistic information.
Web usage mining is the process of extracting useful information from server logs i.e. users
history. Web usage mining is the process of finding out what users are looking for on the
Internet. Some users might be looking at only textual data, whereas some others might be
interested in multimedia data.
A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The set
{computer, antivirus software} is a 2-itemset. The occurrence frequency of an itemset is the
number of transactions that contain the itemset. This is also known, simply, as the frequency,
support count, or count of the itemset.
Web content mining, also known as text mining, is generally the second step in Web data mining.
Content mining is the scanning and mining of text, pictures and graphs of a Web page to
determine the relevance of the content to the search query. This scanning is completed after the
clustering of web pages through structure mining and provides the results based upon the level of
relevance to the suggested query. With the massive amount of information that is available on the
World Wide Web, content mining provides the results lists to search engines in order of highest
relevance to the keywords in the query.
Incomplete, noisy, and inconsistent data are commonplace properties of large real world
databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of
interest may not always be available, such as customer information for sales transaction data.
Other data may not be included simply because it was not considered important at the time of
entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment
malfunctions. Data that were inconsistent with other recorded data may have been deleted.
Furthermore, the recording of the history or modifications to the data may have been overlooked.
Missing data, particularly for tuples with missing values for some attributes, may need to be
inferred.
1. Data cleaning
2. Data Integration
3. Data Transformation
4. Data reduction
Data cleaning means removing the inconsistent data or noise and collecting necessary
information of a collection of interrelated data. 6. Define Data mining. (Nov/Dec 2008) amounts
of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand
is referred to as gold mining rather than rock or sand mining. Thus, data mining should have
been more
The design of an effective data mining query language requires a deep understanding of the
power, limitation, and underlying mechanisms of the various kinds of data mining tasks. A data
mining query language can be used to specify data mining tasks. In particular, we examine how
to define data warehouses and data marts in our SQL-based data mining query language, DMQL.
79. List the five primitives for specifying a data mining task.
The set of task-relevant data to be mined the kind of knowledge to be mined: The background
knowledge to be used in the discovery process the interestingness measures and thresholds for
pattern evaluation The expected representation for visualizing the discovered pattern
It is process that abstracts a large set of task-relevant data in a database from relatively low
conceptual levels to higher conceptual levels 2 approaches for Generalization. 1) Data cube
approach 2) Attribute-oriented induction approach
Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. For Missing Values 1. Ignore
the tuple 2. Fill in the missing value manually 3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value: 5. Use the attribute mean for all samples
belonging to the same class as the given tuple 6. Use the most probable value to fill in the
missing value For Noisy Data 1. Binning: Binning methods smooth a sorted data value by
consulting its values around it. 2. Regression: Data can be smoothed by fitting the data to a
function, such as with Regression 3. Clustering: Outliers may be detected by clustering, where
similar values are organized into groups, or 3.
Transaction
Reduction
Portioning
Sampling
Naive Bayes Algorithm is used to generate mining models. These models help to identify
relationships between input columns and the predictable columns. This algorithm can be used in
the initial stage of exploration. The algorithm calculates the probability of every state of each
input column given predictable columns possible states. After the model is made, the results can
be used for exploration and making predictions.
Clustering algorithm is used to group sets of data with similar characteristics also called as
clusters. These clusters help in making faster decisions, and exploring data. The algorithm first
identifies relationships in a dataset following which it generates a series of clusters based on the
relationships. The process of creating clusters is iterative. The algorithm redefines the groupings
to create clusters that better represent the data.
○ Statistics
○ Machine learning
○ Decision Tree
○ Artificial Intelligence
○ Genetic Algorithm
○ Meta learning
○ Data cleaning
○ Data Mining
○ Pattern Evaluation
○ Knowledge Presentation
○ Data Integration
○ Data Selection
○ Data Transformation
88. What are the different fields where data mining is used?
Data Mining is mainly used by big consumer-based companies that focus on retail, financial,
communication, and marketing fields. It is used to get the consumer's transactional data pattern
to determine price, customer preferences, and product positioning, which later impact sales,
customer satisfaction, and corporate profits.
Following is the list of most important areas where data mining is widely used:
Data mining has a significant impact in the field of healthcare. It uses data and analytics to
identify the best practices that can improve care and reduce costs. Scientists use several Data
Mining approaches like multi-dimensional databases, machine learning, soft computing, data
visualization, statistics, etc., to make things easy for patients. Using Data Mining, we can predict
the volume of patients in every category and make sure that the patients get the appropriate care
at the right place and at the right time.
Market Basket Analysis
This modeling technique follows the theory that if you buy a specific group of items, you are
more likely to buy another group of items. Using this technique, the retailer can understand the
purchase behavior of a buyer and change the store's layout according to the buyer's needs.
Educational Data Mining is used to identify and predict the students' future learning behavior. If
a student is studying a particular course, then the institutes can know which related course they
may apply later by using Data Mining. This is also beneficial to make focus on what to teach and
how to teach. The institutes can capture the learning pattern of the students and use to develop
techniques to teach them.
Manufacturing Engineering
By using Data mining tools, we can discover patterns in complex manufacturing processes. We
can use this to predict the product development span time, cost, and dependencies, among other
tasks.
Fraud Detection
Data Mining can be used as a perfect fraud detection system to protect the information of all
users. By Data Mining, we can classify fraudulent or non-fraudulent data and make an algorithm
to identify whether the record is fraudulent or not.
● Intrusion Detection
● Lie Detection
● Customer Segmentation
● Financial Banking
● Corporate Surveillance
● Research Analysis
● Criminal Investigation
● Bio Informatics
89. What are the different techniques used for Data Mining?
Prediction: This technique specifies the relationship between independent and dependent
instances. For example, while considering sales data, if we want to predict the future profit, the
sale acts as a separate instance, whereas the payoff is the dependent instance. Accordingly, based
on sales and profit's historical data, the associated profit is the predicted value.
Decision trees: It specifies a tree structure where the decision tree's root acts as a
condition/question having multiple answers. Each answer sets to specific data that helps in
determining the final decision based on the data.
Clustering analysis: This technique specifies that a cluster of objects having similar
characteristics is formed automatically. The clustering method defines classes and then places
suitable objects in each class.
Sequential Patterns: This technique is used to specify the pattern analysis used for discovering
identical patterns in transaction data or regular events. For example, customers' historical data
helps a brand identify the patterns in the transactions that happened in the past year.
Classification Analysis: This is a Machine Learning based method in which each item in a
particular set is classified into predefined groups. It uses advanced techniques like linear
programming, neural networks, decision trees, etc.
Association rule learning: This technique is used to create a pattern based on the items'
relationship in a single transaction.
There are mainly three storage models available in OLAP. They are:
91. What are the advantages and disadvantages of using the MOLAP storage model?
The term MOLAP stands for "Multidimensional Online Analytical Processing." As the name
shows, it is a multidimensional storage model. This storage model type stores the data in
multidimensional cubes and not in the standard relational databases.
● The most significant disadvantage of using MOLAP is that it can store only a limited
amount of data. In this storage model, the calculations are triggered at the cube
● It is not free. You have to pay the license cost associated with it.
92. What are the advantages and disadvantages of using the ROLAP storage model?
The term ROLAP stands for "Relational Online Analytical Processing." In this storage model,
the data is stored in the form of a relational database.
● In this storage model, the data is stored in relational databases so, it is easy to handle a
● The most significant disadvantage of this storage model is that it is comparatively slow.
● All other disadvantages we face in SQL are the same in this storage model also.
In Data Mining, discrete data is a type of data defined as finite data. This type of information is
never changed.
Example: Mobile numbers, gender, etc. are the example of discrete data.
On the other hand, continuous data is a type of data that changes continuously and in an ordered
fashion.
The continuous measurement of linear scale is called Interval Scaled Variable. For example,
height and weight, weather temperature, etc. We can calculate these measurements by using
Euclidean distance or Minkowski distance.
A factless fact tables are the fact table which doesn’t contain numeric fact column in the fact
table.
A decision tree is a tree in which every node is either a leaf node or a decision node. This tree
takes an input an object and outputs some decision. All Paths from root node to the leaf node are
reached by either using AND or OR or BOTH. The tree is constructed using the regularities of
the data. The decision tree is not affected by Automatic Data Preparation.
The support for a rule R is the ratio of the number of occurrences of R, given all occurrences of
all rules.
The confidence of a rule X->Y, is the ratio of the number of occurrences of Y given X, among all
other occurrences given X
98. Why is association rule necessary.
In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases.
Data discrimination is the comparison of the general features of the target class objects against
one or more contrasting objects.
Text mining is the procedure of synthesizing information, by analyzing relations, patterns, and
rules among textual data. These procedures contains text summarization, text categorization, and
text clustering.