0% found this document useful (0 votes)
37 views12 pages

Mmds

Unit -I discusses various data mining concepts and techniques. It provides definitions for data mining, machine learning, supervised vs unsupervised learning, and feature extraction. It also lists common data mining techniques such as association rules, clustering, prediction, and neural networks. Statistical modeling is explained as using mathematical equations and statistical approaches to describe relationships between variables in a dataset. Computational approaches to modeling discussed include supervised/unsupervised learning techniques like classification, regression, and clustering as well as dimensionality reduction, feature selection, ensemble methods, deep learning, and time series analysis.

Uploaded by

Ankitha Vardhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views12 pages

Mmds

Unit -I discusses various data mining concepts and techniques. It provides definitions for data mining, machine learning, supervised vs unsupervised learning, and feature extraction. It also lists common data mining techniques such as association rules, clustering, prediction, and neural networks. Statistical modeling is explained as using mathematical equations and statistical approaches to describe relationships between variables in a dataset. Computational approaches to modeling discussed include supervised/unsupervised learning techniques like classification, regression, and clustering as well as dimensionality reduction, feature selection, ensemble methods, deep learning, and time series analysis.

Uploaded by

Ankitha Vardhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit -I

SHORT QUESTIONS
1.What is Data Mining and list few techniques of Data Mining
a) Data mining is the process of searching and analyzing a large batch of raw data in order to
identify patterns and extract useful information.
Companies use data mining software to learn more about their customers. It can help them to
develop more effective marketing strategies, increase sales, and decrease costs. Data mining
relies on effective data collection, warehousing, and computer processing.
TECHNIQUES OF DATAMINING :
1)Association rule
2)Clustering
3)Prediction
4)knn
5)decision tree
6)neural network
7)classification

2. Define Hash Functions and Natural Logarithms


a) Hash Functions: In data mining, a hash function is a mathematical function that takes an
input (or "key") and produces a fixed-size string of characters, which is typically a numerical
value or a hexadecimal representation. This output is commonly referred to as the hash code
or hash value. Hash functions are used for various purposes in data mining, including data
indexing, data deduplication, data summarization, and more.

Natural Logarithms: Data transformation using natural logarithms can help normalize data,
making it suitable for various statistical and machine learning techniques. For example, when
working with skewed data distributions, taking the logarithm of the data can make it more
amenable to linear modeling and analysis.

3.What is machine learning


A) Machine learning is a set of methods, tools, and computer algorithms used to train
machines to analyze, understand, and find hidden patterns in data and make predictions. The
eventual goal of machine learning is to utilize data for self-learning, eliminating the need to
program machines in an explicit manner. Once trained on datasets, machines can apply
memorized patterns on new data and as such make better predictions.

4.Differentiate Supervised and Unsupervised learning.


A)

Supervised Learning Unsupervised Learning


Supervised learning algorithms are trained Unsupervised learning algorithms are
using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is In unsupervised learning, only input data


provided to the model along with the is provided to the model.
output.

The goal of supervised learning is to train The goal of unsupervised learning is to


the model so that it can predict the output find the hidden patterns and useful
when it is given new data. insights from the unknown dataset.

Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problem in Clustering and Associations problem
s. s.

Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.

Supervised learning model produces an Unsupervised learning model may give


accurate result. less accurate result as compared to
supervised learning.

Supervised learning is not close to true Unsupervised learning is more close to


Artificial intelligence as in this, we first the true Artificial Intelligence as it learns
train the model for each data, and then similarly as a child learns daily routine
only it can predict the correct output. things by his experiences.

It includes various algorithms such as It includes various algorithms such as


Linear Regression, Logistic Regression, Clustering, KNN, and Apriori algorithm.
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.

5.Define Feature Extraction and list any 2 feature extraction techniques


a) Feature extraction is a crucial aspect of data mining, particularly when dealing with large
and complex datasets. In data mining, feature extraction refers to the process of selecting,
transforming, or creating new features from the raw data to prepare it for analysis.
Before data mining can be effectively performed, raw data needs to be preprocessed. This
often involves handling missing values, dealing with outliers, and cleaning the data. Once the
data is prepared, feature extraction comes into play.
2 feature techniques are:
1) Principal component anlysis
2) Linear discriminate anlysis

ESSAY QUESTIONS
1.Explain Statistical Modelling

A) Statistical modeling is the process of describing the connections between variables in a


dataset using mathematical equations and statistical approaches. In statistical modeling, we
use a collection of statistical methods to investigate the connections between variables and
uncover patterns in data.

Predicting the number of people who will travel on a specific rail route is an example of
statistical modeling. To develop a statistical model, we would collect data on the number of
passengers who utilize the train route over time, as well as data on variables that might affect
passenger counts, such as time of day, day of the week, and weather.

Then, using statistical approaches such as regression analysis, we can determine the
correlations between these factors and the number of passengers utilizing the railway route.
For example, we might discover that the number of passengers is larger during rush hour and
on weekdays, and fewer when it is raining.

We can apply this data to build a statistical model that forecasts the number of people who
would use the railway route depending on the time of day, day of the week, and weather
conditions. This model can then be used to anticipate future passenger numbers and make
resource allocation choices, such as adding additional trains during rush hour or giving
specials during severe weather.

It is essential in statistical modeling to pick an appropriate statistical model that fits the data
and to evaluate the model to ensure accuracy and reliability. This might include running the
model on a new set of data or employing statistical tests to assess the model’s performance.
Types of Statistical Models

Statistical Modeling Techniques

2.What are the various computational approaches to modelling


A) In data mining, computational approaches to modeling involve the use of various
algorithms and techniques to analyze and extract valuable patterns, knowledge, and insights
from large and complex datasets. These approaches help in making predictions, identifying
trends, and uncovering hidden relationships within the data. Here are some key computational
approaches to modeling in data mining:
Supervised Learning:
● Classification: Classification models are used to categorize data into
predefined classes or labels. Algorithms like decision trees, support vector
machines, and neural networks are commonly employed for this task.

● Regression: Regression models predict a continuous numerical value or output


based on input features. Linear regression, polynomial regression, and
regression trees are examples of regression techniques.

Unsupervised Learning:
● Clustering: Clustering algorithms group similar data points into clusters or
segments. K-means clustering, hierarchical clustering, and DBSCAN are
popular methods for unsupervised clustering.
● Association Rules: Association rule mining, often used in market basket
analysis, identifies relationships between items in a dataset. Apriori and FP-
growth are well-known algorithms for this purpose.

Dimensionality Reduction:
● Principal Component Analysis (PCA): PCA is used to reduce the
dimensionality of data while preserving as much variance as possible. It is
valuable for visualizing high-dimensional data and eliminating
multicollinearity.

● t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a non-linear


dimensionality reduction technique that is especially useful for visualizing and
exploring complex data patterns.

Feature Selection and Extraction:


● Feature selection methods like mutual information, chi-squared tests, and
recursive feature elimination help identify the most relevant features for
modeling.

● Feature extraction techniques, such as Principal Component Analysis (PCA)


and Linear Discriminant Analysis (LDA), create new features or transform
existing ones to enhance modeling.

Ensemble Learning:
● Ensemble methods combine multiple models to improve predictive accuracy
and robustness. Examples include Random Forests, Gradient Boosting, and
AdaBoost.

Deep Learning:
● Deep neural networks, including convolutional neural networks (CNNs) for
image data and recurrent neural networks (RNNs) for sequential data, are used
for tasks like image recognition, natural language processing, and time series
analysis.

Time Series Analysis:


● Time series forecasting methods, such as autoregressive integrated moving
average (ARIMA) and seasonal decomposition of time series (STL), are
employed for modeling and predicting time-dependent data.

Text Mining and Natural Language Processing (NLP):


● Techniques in NLP are used for sentiment analysis, text classification, topic
modeling, and information retrieval from unstructured text data. Algorithms
like Word2Vec and BERT have shown substantial success in this domain.
Anomaly Detection:
● Anomaly detection models identify unusual or rare instances in the data,
which is valuable for fraud detection, network security, and quality control.
Methods include Isolation Forests and One-Class SVM.

Reinforcement Learning:
● Reinforcement learning is applied when modeling agents must learn how to
make sequential decisions by interacting with an environment. It is commonly
used in robotics, game playing, and autonomous systems.

Graph Mining:
● Graph mining and analysis are used for tasks involving networks and
relationships. Algorithms like PageRank, community detection, and graph
neural networks are applied to social networks, recommendation systems, and
network analysis.

Big Data and Distributed Computing:


● When dealing with massive datasets, distributed computing frameworks like
Hadoop and Spark are employed for parallel processing and distributed
machine learning.

Interpretable Models:
● In some cases, interpretable models like decision trees and linear regression
are preferred to gain insights and explainability, especially in regulated
industries.

3.Explain Feature Extraction and its techniques


A) Feature extraction is a crucial aspect of data mining, particularly when dealing with large
and complex datasets. In data mining, feature extraction refers to the process of selecting,
transforming, or creating new features from the raw data to prepare it for analysis.
Before data mining can be effectively performed, raw data needs to be preprocessed. This
often involves handling missing values, dealing with outliers, and cleaning the data. Once the
data is prepared, feature extraction comes into play.
One of the primary objectives of feature extraction is dimensionality reduction. Large
datasets with numerous features can be computationally intensive and can lead to overfitting
in data mining models. Feature extraction methods aim to reduce the number of features
while preserving the most critical information.
Feature Extraction can be divided into two broad categories i.e. linear and non-linear.
Feature Extraction Techniques:
○ Principal Component Analysis (PCA): PCA is a widely used technique for
linear dimensionality reduction,It is an unsupervised learning algorithm . It
identifies orthogonal axes (principal components) in the data that capture the
most significant variance and projects the data onto a lower-dimensional
space..

○ Linear Discriminant Analysis (LDA): LDA is used when dealing with


classification tasks. It aims to find directions that maximize class separability.

○ Feature Selection: Feature selection methods involve choosing a subset of the


original features that are most relevant to the mining task. These methods can
be filter-based, wrapper-based, or embedded in the modeling process.

○ Manifold Learning: Non-linear techniques like t-distributed stochastic


neighbor embedding (t-SNE) and Isomap are useful for capturing complex
data patterns that linear techniques like PCA might miss.

4.Illustrate the classic problems in Machine learning that are highly related to data
mining
a)classification
clustering
regression
anomaly detecton
feature reduction

UNIT-II
Short Questions
1. Define Confidence and Support.
a)
Support Confidence

Support is a measure of the number Confidence is a measure of the likelihood


of times an item set appears in a that an itemset will appear if another itemset
dataset. appears.

Support is calculated by dividing the Confidence is calculated by dividing the


number of transactions containing an number of transactions containing both
item set by the total number of itemsets by the number of transactions
transactions. containing the first itemset.

Support is used to identify itemsets Confidence is used to evaluate the strength


that occur frequently in the dataset. of a rule.

Support is often used with a threshold Confidence is often used with a threshold to
to identify itemsets that occur identify rules that are strong enough to be of
frequently enough to be of interest. interest.

Support is interpreted as the Confidence is interpreted as the percentage


percentage of transactions in which of transactions in which the second itemset
an item set appears. appears given that the first itemset appears.

2. Define Frequent itemset, Maximal Frequent Itemset and closed frequent itemset
a) Frequent itemset: Frequent item sets, also known as association rules, are a
fundamental concept in association rule mining, which is a technique used in
data mining to discover relationships between items in a dataset. The goal of
association rule mining is to identify relationships between items in a dataset
that occur frequently together.
A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a
dataset contains 100 transactions and the item set {milk, bread} appears in 20 of
those transactions, the support count for {milk, bread} is 20.
Maximal frequent itemset
A maximal frequent itemset is represented as a frequent itemset for which
none of its direct supersets are frequent. The itemsets in the lattice are
broken into two groups such as those that are frequent and those that are
infrequent.
Closed Frequent Itemset:
A closed frequent itemset is a frequent itemset for which there is no other frequent
itemset that has the same support and is a proper superset of it. In other words, it's a
frequent itemset that cannot be "closed" further without losing its support level.

3. What are the different ways of improving the efficiency of Apriori algorithm

a) Here are some of the methods how to improve efficiency of apriori algorithm -

1. Hash-Based Technique: This method uses a hash-based structure called a


hash table for generating the k-itemsets and their corresponding count. It
uses a hash function for generating the table.
2. Transaction Reduction: This method reduces the number of transactions
scanned in iterations. The transactions which do not contain frequent
items are marked or removed.
3. Partitioning: This method requires only two database scans to mine the
frequent itemsets. It says that for any itemset to be potentially frequent in
the database, it should be frequent in at least one of the partitions of the
database.
4. Sampling: This method picks a random sample S from Database D and
then searches for frequent itemset in S. It may be possible to lose a global
frequent itemset. This can be reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate
itemsets at any marked start point of the database during the scanning of
the database.

4. What are the three major components of the Apriori algorithm in data mining

a) There are three major components of the Apriori algorithm in data mining
which are as follows.

1. Support
2. Confidence
3. Lift

4. What is FP tree and FP Growth algorithm


a) The frequent-pattern tree (FP-tree) is a compact data structure that stores
quantitative information about frequent patterns in a database. Each transaction is
read and then mapped onto a path in the FP-tree. This is done until all transactions
have been read. Different transactions with common subsets allow the tree to remain
compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The
purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the item set.

The root node represents null, while the lower nodes represent the item sets.

FP Growth Algorithm?
The FP-Growth Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so much, it uses a
divide-and-conquer strategy. The core of this method is the usage of a special data
structure named frequent-pattern tree (FP-tree), which retains the item set
association information.

This algorithm works as follows:

o First, it compresses the input database creating an FP-tree instance to represent


frequent items.
o After this first step, it divides the compressed database into a set of conditional
databases, each associated with one frequent pattern.
o Finally, each such database is mined separately.

5. Define Association Rules for data mining

a) Association rule learning is a type of unsupervised learning technique that checks


for the dependency of one data item on another data item and maps accordingly so
that it can be more profitable. It tries to find some interesting relations or
associations among the variables of dataset. It is based on different rules to discover
the interesting relations between variables in the database.

The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc.
Essay Questions
1. Find the frequent itemsets using Apriori Algorithm and generate association rules .
Assume that minimum support
threshold (s = 33.33%) and minimum confident threshold (c = 60%) …

2. Explain the working of an FP algorithm with an example.


3. Explain the working of PCY algorithm with an example.
4.Describe the Apriori Algorithm with steps to implement it.. What are its key
principles and advantages?
5.Compare the Apriori Algorithm with the FP-Growth Algorithm. What are the key
differences and trade-offs between the two algorithms?
UNIT-III
Short Questions
1.What is clustering? Why do businesses need to do clustering?
2.What are the major clustering methods?
3. When Should We Use DBSCAN Over K-Means In Clustering Analysis?
4. Define Clustering Features in BIRCH algorithm
5. What is CF tree in BIRCH algorithm
Essay Questions
1. Describe the BIRCH clustering technique.
2. With an example describe the k-means algorithm.
3. Describe the DBSCAN clustering technique.
4. Explain CURE algorithm with a suitable example.
UNIT-IV
SHORT
1. Define the characteristics of data streams
ESSAY
1.What are Data Streams? Discuss the problems associated with Data Streams. 5m
2.What is the need of Bloom Filters and explain its working.
OR
Explain the role of Bloom filter in data stream.
3. Discuss about Data Stream Management.
4. Illustrate the example of data stream queries
5. Discuss models of data stream processing.
6.Explain about the algorithm which is used to Count Different Elements in Stream
Unit V
Short
1. How do you mine social network news feeds?
2. What is web mining?
ESSAY
1.What is MapReduce? Illustrate a simple example of the working of MapReduce?
2. When we search on the internet,we want to see the most relevant pages .Discuss the
relevant algorithms used to determine the pages which are more authorative on the
internet based on their popularity to ensure users see pages that are most likely to be of
use to them……
3. Is Web a directed graph or undirected graph? Discuss the two challenges of web
search with examples

You might also like