DMA QB Solved
DMA QB Solved
1-mark question:
2. What is Data?
3. What is Information?
Ans. Info means processed data that can have some meaningful sense in our
own mind.
Ans. Patterns are rules that describe specific patterns within the data.
Ans. Antecedent.
Ans. Consequent.
Ans. False.
Ans. True.
Ans. (1) easily understood by humans, (2) valid on new or test data with some
degree of certainty, (3) potentially useful, and (4) novel.
Ans. Metadata.
16.The learning which is used to find out hidden pattern from the labelled
data is called ………
Ans. Supervised
Ans. KNN
Ans. 5.
Ans. D.
Ans. D.
Ans. D
Ans. Data mining is the process of sorting through large data sets to identify
patterns and relationships that can help solve business problems through data
analysis.
2. Define data warehouse. What is the purpose of it?
3.What are the key elements of a data warehouse? Explain each of them.
4. Describe the key steps in the data mining process. Why is it important to
follow these processes.
The data mining process is divided into two parts i.e. Data Preprocessing and
Data Mining. Data Preprocessing involves data cleaning, data integration, data
reduction, and data transformation. The data mining part performs data
mining, pattern evaluation and knowledge representation of data.
It is important because:
Data cleaning: fills the missing data, removes the noisy data.
Data Integration: improves the accuracy and speed of the data mining process.
Data Reduction: This technique is applied to obtain relevant data for analysis
from the collection of data.
Data Mining: intelligent patterns are applied to extract the data patterns.
Ans. It holds importance as dirty data if used directly in mining can cause
confusion in procedures and produce inaccurate results.
Basically, this step involves the removal of noisy or incomplete data from the
collection. Many methods that generally clean data by itself are available but
they are not robust.
(ii) Remove The Noisy Data: Random error is called noisy data.
Binning: Binning methods are applied by sorting values into buckets or bins.
Smoothening is performed by consulting the neighboring values.
Binning is done by smoothing by bin i.e. each bin is replaced by the mean of
the bin. Smoothing by a median, where each bin value is replaced by a bin
median. Smoothing by bin boundaries i.e. The minimum and maximum values
in the bin are bin boundaries and each bin value is replaced by the closest
boundary value.
7. Define support, confidence and lift in Association rule mining. What are
the demerits of Apriori Algorithm?
Support refers to how often a given rule appears in the database being mined.
Confidence refers to the amount of times a given rule turns out to be true in
practice.
Association rules are useful for analyzing and predicting customer behavior.
They play an important part in customer analytics, market basket analysis,
product clustering, catalog design and store layout. Programmers use
association rules to build programs capable of machine learning.
9. Find the cosine similarity and the dissimilarity between the 2 vectors- ‘X’ &
‘Y’ . X= {3, 2, 0, 5} and Y = {1, 0, 0, 0}
10. For the following given Transaction Data set, generate rules using Apriori
Algorithm. Consider the values of support = 22% & Confidence = 70%.
11. Explain each step of KDD process in detail.
A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale. You can store your data as-is,
without having to first structure the data, and run different types of analytics.
17. Discuss the steps of the Apriori Algorithm for mining frequent itemsets.
18. Generate FP-Tree for the following Transaction dataset. [Min. Support
Count= 3]. Show the Conditional Pattern Base, Conditional FP-Tree and
Frequent Item set.
Class Notes
19. Define with suitable examples of each of the following data mining
functionalities: data characterization, data association and data
discrimination. Explain the architecture of a typical data mining system.
Data Characterization:
ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse. The process of ETL can be broken down into the following three
stages:
Extract: The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This step
involves reading data from the source systems and storing it in a staging
area.
Transform: In this stage, the extracted data is transformed into a format
that is suitable for loading into the data warehouse. This may involve
cleaning and validating the data, converting data types, combining data
from multiple sources, and creating new data fields.
Load: After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the data
into the warehouse.
3. Explain Jaccard similarity index. Find the Jaccard similarity index and
Jaccard distance for the following data:
A = {0, 1, 2, 5, 6} B = {0, 2, 3, 4, 5, 7, 9}
The Jaccard Similarity Index is a measure of the similarity between two sets of
data.
If two datasets share the exact same members, their Jaccard Similarity Index
will be 1. Conversely, if they have no members in common then their similarity
will be 0.
No. of observations: {0, 2, 5}
6. Generate all Frequent Itemsets from the following transaction data given
minimum support = 0.3.
Find the Association Rules from the above frequent sets at minimum 50%
confidence.
Module3 & Module 4:
2. What are the advantages and disadvantages of the decision tree approach
over other approaches for data mining?
Advantages:
Compared to other algorithms decision trees requires less effort for data
preparation during pre-processing.
A decision tree does not require normalization of data.
A decision tree does not require scaling of data as well.
Missing values in the data also do NOT affect the process of building a
decision tree to any considerable extent.
A Decision tree model is very intuitive and easy to explain to technical
teams as well as stakeholders.
Disadvantage:
A small change in the data can cause a large change in the structure of the
decision tree causing instability.
For a Decision tree sometimes calculation can go far more complex
compared to other algorithms.
Decision tree often involves higher time to train the model.
Decision tree training is relatively expensive as the complexity and time has
taken are more.
The Decision Tree algorithm is inadequate for applying regression and
predicting continuous values.
Clustering, is a method of data mining that groups similar data points together.
The goal of cluster analysis is to divide a dataset into groups (or clusters) such
that the data points within each group are more similar to each other than to
data points in other groups.
Types of Clustering:
Centroid-based Clustering.
Density-based Clustering.
Distribution-based Clustering.
Hierarchical Clustering
Entropy is uncertainty/ randomness in the data, the more the randomness the
higher will be the entropy. Information gain uses entropy to make decisions. If
the entropy is less, information will be more.
Information gain is used in decision trees and random forest to decide the best
split. Thus, the more the information gain the better the split and this also
means lower the entropy.
Entropy is the measure of uncertainty in the data. The effort is to reduce the
entropy and maximize the information gain. The feature having the most
information is considered important by the algorithm and is used for training
the model.
objects.
Output:- K-number of clusters.
Step-1: Select any k objects from dataset D as the initial medoids.
Step-2: Assign each remaining object to the cluster with the nearest
representative object.
Step-3: Randomly select a non-representative object Orandom from the dataset.
Step-4: Compute the total cost (TC) of swapping representative object (Oj) with
Orandom .
Step-5: If TC<0 then swap Oj with Orandom to form the new set of
representative object.
It's a measure of similarity for the two sets of data, with a range from 0% to
100%. The higher the percentage, the more similar the two populations.
9. Apply the K-means clustering for the following dataset for two clusters.
Consider data point S1 and S2 are the initial centroid of the respective
clusters. Continue the procedure for three iterations.
Same as problem 7.
An attribute selection measure is a heuristic for choosing the splitting test that
“best” separates a given data partition, D, of class-labeled training tuples into
single classes.
In the case of Classification, there are predefined labels assigned to each input
instance according to their properties whereas in clustering those labels are
missing.
Euclidean distance is the shortest path between source and destination which
is a straight line.
Manhattan distance is sum of all the real distances between source(s) and
destination(d) and each distance are always the straight lines.
22. Use single and complete linkage agglomerative clustering to group the
data described by the following distance matrix. Show the dendrograms.
23. How does agglomerative hierarchical clustering works?
It is a bottom-up approach.
In other words, if the linear model fits our observations well enough, then we
can estimate that the more emails we send, the more responses we will get.
Multiple regression indicates that there are more than one input variables that
may affect the outcome, or target variable. For our email campaign example,
you may include an additional variable with the number of emails sent in the
last month.
For these models, it is important to understand exactly what effect each input
has and how they combine to produce the final target variable results.
29. Use the data given in Dataset as shown below, create a regression model
to predict the Test2 from Test1 score. Then predict the score for the one who
got a 46 in Test1.
30. Marks obtained by 12 students in the college test (x) and the university
test (y) are as follows:
Construct the regression line that approximates the data set. What is your
estimate of the marks a student could have in the university test if he
obtained 60 marks in the college test but was ill at the time of the university
test?
Decision trees are less appropriate for estimation tasks where the goal is to
predict the value of a continuous attribute.
Decision trees are prone to errors in classification problems with many
classes and a relatively small number of training examples.
Decision trees can be computationally expensive to train.
Decision trees are prone to overfitting the training data, particularly when
the tree is very deep or complex.
Small variations in the training data can result in different decision trees
being generated.
Many decision tree algorithms do not handle missing data well, and require
imputation or deletion of records with missing values.
The initial splitting criteria used in decision tree algorithms can lead to
biased trees, particularly when dealing with unbalanced datasets or rare
classes.
Decision trees are limited in their ability to represent complex relationships
between variables.
33. Create a decision tree for the following data given below. The objective is
to predict the class category (Play Tennis or not?).
Because it does no training at all when you supply the training data. At training
time, all it is doing is storing the complete data set but it does not do any
calculations at this point. Neither does it try to derive a more compact model
from the data which it could use for scoring. Therefore, we call this algorithm
lazy.
Advantages:
Disadvantages:
39. Apply the data set of question 33 for the Naïve Bayes Classification also.
Secular Trend: In secular trend the changes that have occurred could be as the
result of general tendency of data to increase or decrease. The sale record of
any product may increase or decrease due to general tendency of common
people which is known as secular trend.
Time series relating to Economic, Business, and Commerce may show an
upward or increasing tendency. Whereas, the time series relating to death
rates, birth rates, share prices, etc. may show a downward or decreasing
tendency.
Planning for future in an imp aspect in any working organization which can be
done by analyzing the time sharing data.
The long run of any organization is dependent on how well the business
manager can predict or forecast the future trend. And future trend is possible
to predict the time series data.
1. Secular Trend: In secular trend the changes that have occurred could be as
the result of general tendency of data to increase or decrease. The sales
record of any product may increase or decrease due to general tendency of
common people which is known as secular trend.
3. Mention the merits and demerits of Moving Average Method & Semi
Average Method.
The method assumes a straight line relationship between the plotted points
without considering the fact whether that relationship exists or not.
If we add more data to the original data then we have to do the complete
process again for the new data to get the trend values and the trend line
also changes.
4. Distinguish between ‘seasonal’ and ‘cyclical’ fluctuations in time series
data.
If the fluctuations are not of a fixed frequency then they are cyclic; if the
frequency is unchanging and associated with some aspect of the calendar, then
the pattern is seasonal.
5. Find the trend for the following series using a three-year moving average.
7. Fit a straight-line trend equation by the method of least squares from the
following data and then estimate the trend value for the year 2025.
E.g sum with different dataset
8. Assuming a four-yearly cycle, calculate the trend by the method of moving
averages from the following data relating to the production of tea in India:
Same as sum no 7, just put x=1961 & calculate the answer at the last step.
In the real world, we are surrounded by humans who can learn everything
from their experiences with their learning capability, and we have computers
or machines which work on our instructions. But can a machine also learn from
experiences or past data like a human does? So here comes the role of
Machine Learning. Machine Learning is said as a subset of artificial intelligence.
Applications:
11. What is cloud computing? What are the benefits of cloud computing?
What are the different layers in cloud computing?
The term cloud refers to a network or the internet. It is a technology that uses
remote servers on the internet to store, manage, and access data online rather
than local drives. The data can be anything such as files, images, documents,
audio, video, and more.
Benefits:
Agility, high availability & reliability, high scalability, multi-sharing, device and
location independence, maintenance, low cost, service in the pay per use
mode.
The cloud computing layers that are available: infrastructure as a service (IaaS),
platform as a service (PaaS), and software as a service (SaaS).