Mmds
Mmds
SHORT QUESTIONS
1.What is Data Mining and list few techniques of Data Mining
a) Data mining is the process of searching and analyzing a large batch of raw data in order to
identify patterns and extract useful information.
Companies use data mining software to learn more about their customers. It can help them to
develop more effective marketing strategies, increase sales, and decrease costs. Data mining
relies on effective data collection, warehousing, and computer processing.
TECHNIQUES OF DATAMINING :
1)Association rule
2)Clustering
3)Prediction
4)knn
5)decision tree
6)neural network
7)classification
Natural Logarithms: Data transformation using natural logarithms can help normalize data,
making it suitable for various statistical and machine learning techniques. For example, when
working with skewed data distributions, taking the logarithm of the data can make it more
amenable to linear modeling and analysis.
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.
ESSAY QUESTIONS
1.Explain Statistical Modelling
Predicting the number of people who will travel on a specific rail route is an example of
statistical modeling. To develop a statistical model, we would collect data on the number of
passengers who utilize the train route over time, as well as data on variables that might affect
passenger counts, such as time of day, day of the week, and weather.
Then, using statistical approaches such as regression analysis, we can determine the
correlations between these factors and the number of passengers utilizing the railway route.
For example, we might discover that the number of passengers is larger during rush hour and
on weekdays, and fewer when it is raining.
We can apply this data to build a statistical model that forecasts the number of people who
would use the railway route depending on the time of day, day of the week, and weather
conditions. This model can then be used to anticipate future passenger numbers and make
resource allocation choices, such as adding additional trains during rush hour or giving
specials during severe weather.
It is essential in statistical modeling to pick an appropriate statistical model that fits the data
and to evaluate the model to ensure accuracy and reliability. This might include running the
model on a new set of data or employing statistical tests to assess the model’s performance.
Types of Statistical Models
Unsupervised Learning:
● Clustering: Clustering algorithms group similar data points into clusters or
segments. K-means clustering, hierarchical clustering, and DBSCAN are
popular methods for unsupervised clustering.
● Association Rules: Association rule mining, often used in market basket
analysis, identifies relationships between items in a dataset. Apriori and FP-
growth are well-known algorithms for this purpose.
Dimensionality Reduction:
● Principal Component Analysis (PCA): PCA is used to reduce the
dimensionality of data while preserving as much variance as possible. It is
valuable for visualizing high-dimensional data and eliminating
multicollinearity.
Ensemble Learning:
● Ensemble methods combine multiple models to improve predictive accuracy
and robustness. Examples include Random Forests, Gradient Boosting, and
AdaBoost.
Deep Learning:
● Deep neural networks, including convolutional neural networks (CNNs) for
image data and recurrent neural networks (RNNs) for sequential data, are used
for tasks like image recognition, natural language processing, and time series
analysis.
Reinforcement Learning:
● Reinforcement learning is applied when modeling agents must learn how to
make sequential decisions by interacting with an environment. It is commonly
used in robotics, game playing, and autonomous systems.
Graph Mining:
● Graph mining and analysis are used for tasks involving networks and
relationships. Algorithms like PageRank, community detection, and graph
neural networks are applied to social networks, recommendation systems, and
network analysis.
Interpretable Models:
● In some cases, interpretable models like decision trees and linear regression
are preferred to gain insights and explainability, especially in regulated
industries.
4.Illustrate the classic problems in Machine learning that are highly related to data
mining
a)classification
clustering
regression
anomaly detecton
feature reduction
UNIT-II
Short Questions
1. Define Confidence and Support.
a)
Support Confidence
Support is often used with a threshold Confidence is often used with a threshold to
to identify itemsets that occur identify rules that are strong enough to be of
frequently enough to be of interest. interest.
2. Define Frequent itemset, Maximal Frequent Itemset and closed frequent itemset
a) Frequent itemset: Frequent item sets, also known as association rules, are a
fundamental concept in association rule mining, which is a technique used in
data mining to discover relationships between items in a dataset. The goal of
association rule mining is to identify relationships between items in a dataset
that occur frequently together.
A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a
dataset contains 100 transactions and the item set {milk, bread} appears in 20 of
those transactions, the support count for {milk, bread} is 20.
Maximal frequent itemset
A maximal frequent itemset is represented as a frequent itemset for which
none of its direct supersets are frequent. The itemsets in the lattice are
broken into two groups such as those that are frequent and those that are
infrequent.
Closed Frequent Itemset:
A closed frequent itemset is a frequent itemset for which there is no other frequent
itemset that has the same support and is a proper superset of it. In other words, it's a
frequent itemset that cannot be "closed" further without losing its support level.
3. What are the different ways of improving the efficiency of Apriori algorithm
a) Here are some of the methods how to improve efficiency of apriori algorithm -
4. What are the three major components of the Apriori algorithm in data mining
a) There are three major components of the Apriori algorithm in data mining
which are as follows.
1. Support
2. Confidence
3. Lift
A frequent Pattern Tree is made with the initial item sets of the database. The
purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the item set.
The root node represents null, while the lower nodes represent the item sets.
FP Growth Algorithm?
The FP-Growth Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so much, it uses a
divide-and-conquer strategy. The core of this method is the usage of a special data
structure named frequent-pattern tree (FP-tree), which retains the item set
association information.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc.
Essay Questions
1. Find the frequent itemsets using Apriori Algorithm and generate association rules .
Assume that minimum support
threshold (s = 33.33%) and minimum confident threshold (c = 60%) …