0% found this document useful (0 votes)
38 views11 pages

Assignment 2nd DMDW

Uploaded by

anguralrakesh80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views11 pages

Assignment 2nd DMDW

Uploaded by

anguralrakesh80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ND

ASSIGNMENT-2
OF
DATA MINING AND DATA
WAREHOUSING

SUBMITTED BY: RAKESH KUMAR

ROLL NO: 2100083

COURSE: B. TECH 7TH SEM/CSE

SUBMITTED TO: ER. HARKAMAL MAM


INDEX

SR. TITLE PAGE


NO. NO.
Q1. What are different types of data mining techniques? 3-6
Explain any one in detail.
Q2. How hierarchal clustering helps in data mining? 6-9
Discuss key issues.
Q3. Explain association rule mining? What are the various 10-11
algorithms for generating association rules? Discuss
with examples.
Q1. What are different types of data mining techniques? Explain any one in
detail.

Ans. Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as neural
networks or decision trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers
to better understanding how to process and make conclusions from the huge amount of data,
but what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns, and
regression.

1. Classification:

This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented
or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.

2. Clustering:

Clustering is a division of information into groups of connected objects. Describing the data by
a few clusters mainly loses certain confine details, but accomplishes improvement. It models
data by its clusters. Data modeling puts clustering from a historical point of view rooted in
statistics, mathematics, and numerical analysis. From a machine learning point of view,
clusters relate to hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept. From a practical point of view, clustering
plays an extraordinary job in data mining applications. For example, scientific data exploration,
text mining, information retrieval, spatial database applications, CRM, Web analysis,
computational biology, medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify similar
data. This technique helps to recognize the differences and similarities between the data.
Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other factors such as
availability, consumer demand, and competition. Primarily it gives the exact relationship
between two or more variables in the given data set.
4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.

Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used in
various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.

Q2: How hierarchal clustering helps in data mining? Discuss key issues.

Ans. Hierarchical clustering refers to an unsupervised learning procedure that determines


successive clusters based on previously defined clusters. It works via grouping data into a tree
of clusters. Hierarchical clustering stats by treating each data points as an individual cluster.
The endpoint refers to a different set of clusters, where each cluster is different from the other
cluster, and the objects within each cluster are the same as one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering


o Divisive Clustering

Agglomerative hierarchical clustering

Agglomerative clustering is one of the most common types of hierarchical clustering used to
group similar objects in clusters. Agglomerative clustering is also known as AGNES
(Agglomerative Nesting). In agglomerative clustering, each data point act as an individual
cluster and at each step, data objects are grouped in a bottom-up method. Initially, each data
object is in its cluster. At each iteration, the clusters are combined with different clusters until
one cluster is formed.

Agglomerative hierarchical clustering algorithm

1. Determine the similarity between individuals and all other clusters. (Find proximity
matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.

Let’s understand this concept with the help of graphical representation using a dendrogram.

With the help of given demonstration, we can understand that how the actual algorithm work.
Here no calculation has been done below all the proximity among the clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.


Step 1:

Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between
the individual cluster from all other clusters.

Step 2:

Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are
similar to each other so that we can merge them in the second step. Finally, we get the clusters
[ (P), (QR), (ST), (V)]

Step 3:

Here, we recalculate the proximity as per the algorithm and combine the two closest clusters
[(ST), (V)] together to form new clusters as [(P), (QR), (STV)]

Step 4:

Repeat the same process. The clusters STV and PQ are comparable and combined together to
form a new cluster. Now we have [(P), (QQRSTV)].

Step 5:

Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Divisive Hierarchical Clustering

Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering.


In Divisive Hierarchical clustering, all the data points are considered an individual cluster, and
in every iteration, the data points that are not similar are separated from the cluster. The
separated data points are treated as an individual cluster. Finally, we are left with N clusters.

Advantages of Hierarchical clustering

o It is simple to implement and gives the best output in some cases.


o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering

o It breaks the large clusters.


o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.
Key Issues with Hierarchical Clustering:

1. Computational Complexity: Hierarchical clustering can be computationally intensive,


especially when dealing with large datasets. The time complexity can be high, making it
less suitable for big data scenarios.
2. Sensitivity to Noise and Outliers: Like many clustering techniques, hierarchical clustering
can be sensitive to noise and outliers in the data. Outliers might lead to the creation of
small, undesired clusters or influence the structure of the dendrogram.
3. Memory Usage: As the number of data points increases, the memory required to store
the distance matrix and the dendrogram also increases. This can be a limiting factor for
large datasets.
4. Difficulty in Identifying Optimal Clusters: Determining the optimal number of clusters
from the dendrogram can be subjective and challenging. Cutting the dendrogram at
different levels can lead to different interpretations and clusterings.
5. Scalability: Hierarchical clustering might not scale well to very large datasets due to its
quadratic or cubic time complexity, which can result in slower processing times.
6. Merge Order: Agglomerative hierarchical clustering methods make decisions about
which clusters to merge at each step. The order in which these merges occur can impact
the final clustering result, and there's no universally optimal sequence.
7. Distance Metric Selection: The choice of distance metric (Euclidean, Manhattan, etc.)
and linkage criterion (single, complete, average, etc.) can significantly influence the
clustering outcome.
Q3: Explain association rule mining? What are the various algorithms for
generating association rules? Discuss with examples.

Ans. Association rule mining is a data mining technique used to discover interesting
relationships or associations between items in a transactional database or dataset. These
associations are in the form of rules, often referred to as "if-then" rules, that express the
likelihood of certain items co-occurring in transactions. This technique is widely used in market
basket analysis, where the goal is to identify patterns of items that are frequently purchased
together.

An association rule has two parts: an antecedent (the "if" part) and a consequent (the "then"
part). The rule signifies that if the antecedent is present in a transaction, then the consequent
is likely to be present as well.

For example, in a retail setting:

• Antecedent: {Milk, Bread}


• Consequent: {Eggs}
• Association Rule: {Milk, Bread} => {Eggs}

Association rule mining algorithms use measures like support, confidence, and lift to quantify
the interestingness of rules. These measures help identify rules that are both frequent (occur
together often) and statistically significant.

Here are a few algorithms for generating association rules:

1. Apriori Algorithm: The Apriori algorithm is one of the most well-known algorithms for
association rule mining. It uses a "bottom-up" approach, starting with individual items
and iteratively extending the size of itemsets. It prunes itemsets that don't meet
minimum support thresholds.

Let's consider an example: Transaction 1: {Milk, Bread, Eggs} Transaction 2: {Milk, Bread}
Transaction 3: {Milk, Eggs}

Minimum Support: 2 transactions (50%) Minimum Confidence: 100%

Using these thresholds, we can generate the rule: {Milk} => {Bread} (Support = 2/3, Confidence
= 2/2 = 100%)
2. FP-Growth (Frequent Pattern Growth) Algorithm: The FP-Growth algorithm is an
improvement over Apriori in terms of efficiency. It builds a data structure called an FP-
tree to store frequent itemsets and uses a recursive approach to generate association
rules.

Continuing from the previous example, the FP-Growth algorithm could discover the rule: {Milk}
=> {Bread} (Support = 2/3, Confidence = 2/2 = 100%)

3. Eclat Algorithm: Eclat (Equivalence Class Transformation) is another algorithm for


association rule mining. It focuses on vertical data format and uses intersection-based
techniques to find frequent item sets efficiently.

In the example: Transaction 1: {Milk, Bread, Eggs}

Transaction 2: {Milk, Bread}

Minimum Support: 2 transactions (50%)

Eclat might discover the rule:

{Milk, Bread} => {Eggs} (Support = 1/2)

Association rule mining has applications beyond retail, including website usage analysis,
healthcare data analysis, and more. The choice of algorithm depends on the dataset size,
computational resources, and specific requirements of the analysis. It's important to adjust
support and confidence thresholds appropriately to discover meaningful and relevant rules.

You might also like