Data Mining Questions
Data Mining Questions
UNIT-1
1) What is data mining ?
Data mining is the non-trival process of identifying valid, novel, potentially useful and ultimately
understandable patterns in data.
Or
Data mining is the process of sorting through large data sets to identify patterns and relationships to
solve problems through data analysis.
KDD stands for knowledge discovery database. It is the process of identifying a valid potentially
useful and ultimately understandable structure in data.
6) What is clustering?
Clustering analysis is a data mining techniques to identify data that are like each other. This
process helps to understand the differences and similarities between the data.
17. State the difference between traditional data base and data warehouse.
----------------
18. Define slicing and dicing operations of OLAP
-----------------
19. What is operational data ?
It is a data collected an available in the transaction processing system. It resides in the source
database
20 What is reconciled data?
The detailed transaction level data that have been cleaned and confirmed for consistency is
called reconciled data. It serves as the base data for all warehouse activity.
UNIT-1 5 MARKS
1. Explain the stages of KDD.
KDD-Knowledge Discovery in Database
The KDD process tends to be highly iterative and interactive. Data mining is only one of the
steps involved in knowledge discovery in databases. The various steps in the knowledge discovery
process include data selection, data cleaning and pre-processing, data transformation and reduction,
data mining algorithm selection and finally the post-processing and the interpretation of the
discovered knowledge.
The stages of KDD, starting with the raw data and finishing with the extracted knowledge are
given below;
❖ Selection: This stage is concerned with selecting or segmenting the data that are relevant to
some criteria.
For example, for credit card customer profiling, we extract the type of transactions for each
type of customers and we may not be interested in details of the shop where the transaction
takes place.
❖ Pre-processing: It is the data cleaning stage where unnecessary information is removed. This
stage reconfigures the data to ensure a consistent format, as there is a possibility of
inconsistent formats.
❖ Transformation: The data is not merely transferred, but transformed in order to be suitable
for the task of data mining. In this stage, the data made usable and navigable.
❖ Data mining: This stage is concerned with extraction of pattern from the data. It is referred to
finding of relevant and useful information from databases.
❖ Interpretation and Evaluation: The pattern obtained in the data mining stage is concerned into
knowledge which in turn is used to support decision making.
❖ Data visualization: Data mining allows the analyst to focus on certain patterns and trends and
explore them in-depth using visualization. It helps users to examine large volumes of data
and detect the patterns visually. Visual displays of data such as maps, charts and other
representation allow data to be represented compactly to the users.
• We know exactly what information we • We are not dear about the possible
are looking. correlation or patterns.
• Loosly coupled . • Loosly coupled or tightly coupled .
mining holds great potential to improve health systems. It uses data and analytics to identify best
practices that improve care and reduce costs. Researchers use data mining approaches like multi-
dimensional databases, machine learning, soft computing, data visualization and statistics. Mining
can be used to predict the volume of patients in every category. Processes are developed that make
sure that the patients receive appropriate care at the right place and at the right time. Data mining
can also help healthcare insurers to detect fraud and abuse.
Market basket analysis is a modelling technique based upon a theory that if you buy a certain group
of items you are more likely to buy another group of items. This technique may allow the retailer to
understand the purchase behavior of a buyer. This information may help the retailer to know the
buyer’s needs and change the store’s layout accordingly. Using differential analysis comparison of
results between different stores, between customers in different demographic groups can be done.
3)Education:
There is a new emerging field, called Educational Data Mining, concerns with developing methods
that discover knowledge from data originating from educational Environments. The goals of EDM are
identified as predicting students’ future learning behaviour, studying the effects of educational
support, and advancing scientific knowledge about learning. Data mining can be used by an
institution to take accurate decisions and also to predict the results of the student. With the results
the institution can focus on what to teach and how to teach. Learning pattern of the students can be
captured and used to develop techniques to teach them.
4)Manufacturing Engineering:
Knowledge is the best asset a manufacturing enterprise would possess. Data mining tools can be
very useful to discover patterns in complex manufacturing process. Data mining can be used in
system-level designing to extract the relationships between product architecture, product portfolio,
and customer needs data. It can also be used to predict the product development span time, cost,
and dependencies among other tasks.
5)CRM:
Customer Relationship Management is all about acquiring and retaining customers, also improving
customers’ loyalty and implementing customer focused strategies. To maintain a proper relationship
with a customer a business need to collect data and analyse the information. This is where data
mining plays its part. With data mining technologies the collected data can be used for analysis.
Instead of being confused where to focus to retain customer, the seekers for the solution get filtered
results.
6)Fraud Detection: Billions of dollars have been lost to the action of frauds. Traditional methods of
fraud detection are time consuming and complex. Data mining aids in providing meaningful patterns
and turning data into information. Any information that is valid and useful is knowledge. A perfect
fraud detection system should protect information of all the users. A supervised method includes
collection of sample records. These records are classified fraudulent or non-fraudulent. A model is
built using this data and the algorithm is made to identify whether the record is fraudulent or not.
7)Intrusion Detection:
Any action that will compromise the integrity and confidentiality of a resource is an intrusion. The
defensive measures to avoid an intrusion includes user authentication, avoid programming errors,
and information protection. Data mining can help improve intrusion detection by adding a level of
focus to anomaly detection. It helps an analyst to distinguish an activity from common everyday
network activity. Data mining also helps extract data which is more relevant to the problem.
8)Customer Segmentation:
Traditional market research may help us to segment customers but data mining goes in deep and
increases market effectiveness. Data mining aids in aligning the customers into a distinct segment
and can tailor the needs according to the customers. Market is always about retaining the
customers. Data mining allows finding a segment of customers based on vulnerability and the
business could offer them with special offers and enhance satisfaction.
9)Financial Banking:
With computerized banking everywhere huge amount of data is supposed to be generated with new
transactions. Data mining can contribute to solving business problems in banking and finance by
finding patterns, causalities, and correlations in business information and market prices that are not
immediately apparent to managers because the volume data is too large or is generated too quickly
to screen by experts. The managers may find these information for better segmenting,targeting,
acquiring, retaining and maintaining a profitable customer.
10)Corporate Surveillance:
Corporate surveillance is the monitoring of a person or group’s behaviour by a corporation. The data
collected is most often used for marketing purposes or sold to other corporations, but is also
regularly shared with government agencies. It can be used by the business to tailor their products
desirable by their customers. The data can be used for direct marketing purposes, such as the
targeted advertisements on Google and Yahoo, where ads are targeted to the user of the search
engine by analyzing their search history and emails.
11)Research Analysis:
History shows that we have witnessed revolutionary changes in research. Data mining is helpful in
data cleaning, data pre-processing and integration of databases. The researchers can find any similar
data from the database that might bring any change in the research. Identification of any
cooccurring sequences and the correlation between any activities can be known. Data visualisation
and visual data mining provide us with a clear view of the data.Criminal InvestigationCriminology is a
process that aims to identify crime characteristics. Actually crime analysis includes exploring and
detecting crimes and their relationships with criminals. The high volume of crime datasets and also
the complexity of relationships between these kinds of data have made criminology an appropriate
field for applying data mining techniques. Text based crime reports can be converted into word
processing files. These information can be used to perform crime matching process.
12)Bio Informatics:
Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich. Mining biological
data helps to extract useful knowledge from massive datasets gathered in biology, and in other
related life sciences areas such as medicine and neuroscience. Applications of data mining to
bioinformatics include gene finding, protein function inference, disease diagnosis, disease prognosis,
disease treatment optimization, protein and gene interaction network reconstruction, data
cleansing, and protein sub-cellular location prediction.
Limited information:
A database is often designed for purposes other then that data mining and, sometimes,
some attributes which are essential for knowledge discovery of the application domain are
not present in the data. Thus, it may be very difficult to discover significant knowledge
about a given domain.
4. Uncertainty:
This refers to the severity of error and the degree of noise in the data. Data precision is an important
consideration in a discovery system.
2. UNSUPERVISED LEARNING:
FACT CONSTELLATION:
A Fact Constellation is a kind of schema where we have morethan one Fact Table sharing among
them some Dimension Tables. It is also called Galaxy Schema.
For example, let us assume that Deccan Electronics would like to have another Fact Table for supply
and delivery.
8. What is SVM?
-----------------
9. Write two advantages of decision trees
------------------
10. What are the two different methods of determining the goodness of a split.
11. ----------------------
12. Define rough set.
Rough Set: The rough set is a pair (A*(X),A*(X)). If the boundary region is empty set, then X is called
crisp set else X is called rough set.
5 MARKS
1. Explain Apriori algorithm with an example.
Apriori Algorithm
For finding frequent itemsets in a dataset for boolean association rule. Name of the algorithm is
Apriori because it uses prior knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.To improve
the efficiency of level-wise generation of frequent itemsets, an important property is used called
Apriori property which helps by reducing the search space.
Efficient Frequent Itemset Mining Methods:
Finding Frequent Itemsets Using Candidate Generation:
The Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules. The name of the algorithm is based on
the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an
iterative approach known as a level-wise search, where k-itemsets are used
to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to
find L3, and so on, until no more frequent[8/29, 8:31 PM] Jee: k-itemsets can be found. The finding
of each Lkrequires one full scan of the database. A two-step process is followed in Aprioriconsisting
of joinand prune action.
Apriori Property
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes thatBefore we start understanding the
algorithm, go through some definitions which are explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate association rules for
them.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)[8/29, 8:32 PM] Jee: (II) compare candidate set item’s support count with
minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove those
items). This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove
that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check foreach itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then remove those
items) this gives us itemset L2.[8/29, 8:33 PM] Jee:
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common. So here, for L2, first element should match.So
itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}
{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are
frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly checkfor every itemset)
find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and[8/29, 8:33 PM] Jee:
Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for L3,
first 2 elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no
itemset in C4
We stop here because no frequent itemsets are found furtherThus, we have discovered all the
frequent item-sets. Now generation of strong association rule comes into picture. For that we need
to calculate confidence of each rule.
Confidence-
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strongassociation rules
Biological neuron
In the case of a perceptron, the input signals are combined with different weights and fed into the
artificial neuron or perceptron along with a bias element. Representationally, within perceptron, the
net sum is calculated as the sum of weights and input signal, and a bias element, then, the net sum is
fed into a non-linear activation function. Based on the activation function, the output signal is sent
out. The diagram below represents a perceptron. Notice the bias element b and sum of weights and
input signals represented using x and w. The threshold function represents the non-linear activation
function.
The perceptrons laid out in form of single-layer or multi-layer neural networks can
be used to perform both regression and classification tasks. The diagram below represents the
single-layer neural network (perceptron) that represents linear regression (left) and softmax
regression (each output o1, o2, and o3 represents logits). Recall that the Softmax regression is a
form of logistic regression that normalizes an input value into a vector of values that follows a
probability distribution whose total sums up to 1. The output values are in the range [0,1]. This
allows the neural network to predict as many classes or dimensions as required. This is why softmax
regression is sometimes referred to as a multinomial logistic regression.
Here is a picture of a perceptron represented as the sum of the summation of inputs with weights
(w1, w2, w3, wm) and bias element, that is passed through the activation function and the final
output is obtained. This can be very well used for both regression and binary classification problem
Perceptron
Fig. Perceptro
*The construction of decision tree involves the following three main phases.
•Construction phase: The initial decision tree is constructed in this phase, based on the
entire training set. It requires recursively partitioning the training set into two or more
sub-partitions using splitting criteria until stopping criteria is met.
• Pruning Phase: The tree constructed in the previous phase may not result in the best
possible set of rules due to over-fitting. The pruning phase removes some of the lower
branches and nodes to improve the performance.
•Processing the pruned tree: This is done to increase the understandability
Best SplitTo build an optimal decision tree it is necessary to select an attribute corresponding to the
best possible split. The main operations during tree building are1. Evaluation of splits for each
attribute and selection of the best split; determination of the splitting attribute.2. Determination of
the splitting condition on the selected splitting attribute3. Partitioning the data using the best
split.The complexity lies in determining the best split for each attribute. The splitting also depends
on the domain of the attribute being numerical or categorical. The generic algorithm assumes that
the splitting attribute and splitting criteria are known. The desirable feature of splitting is that it
should do the best job of splitting at a given stage. The first task is to decide which of the
independent attribute is the best splitter. If the attribute takes multiple values, we sort it and then
use some evaluation function to measure its goodness. We compare the effectiveness of the split
provided by the best splitter from each attribute. The winner is chosen as the splitter for the root
node.