Assignment 2nd DMDW
Assignment 2nd DMDW
ASSIGNMENT-2
OF
DATA MINING AND DATA
WAREHOUSING
Ans. Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as neural
networks or decision trees. Thus, data mining incorporates analysis and prediction.
Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers
to better understanding how to process and make conclusions from the huge amount of data,
but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns, and
regression.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented
or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by
a few clusters mainly loses certain confine details, but accomplishes improvement. It models
data by its clusters. Data modeling puts clustering from a historical point of view rooted in
statistics, mathematics, and numerical analysis. From a machine learning point of view,
clusters relate to hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept. From a practical point of view, clustering
plays an extraordinary job in data mining applications. For example, scientific data exploration,
text mining, information retrieval, spatial database applications, CRM, Web analysis,
computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify similar
data. This technique helps to recognize the differences and similarities between the data.
Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other factors such as
availability, consumer demand, and competition. Primarily it gives the exact relationship
between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used in
various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.
Q2: How hierarchal clustering helps in data mining? Discuss key issues.
Agglomerative clustering is one of the most common types of hierarchical clustering used to
group similar objects in clusters. Agglomerative clustering is also known as AGNES
(Agglomerative Nesting). In agglomerative clustering, each data point act as an individual
cluster and at each step, data objects are grouped in a bottom-up method. Initially, each data
object is in its cluster. At each iteration, the clusters are combined with different clusters until
one cluster is formed.
1. Determine the similarity between individuals and all other clusters. (Find proximity
matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let’s understand this concept with the help of graphical representation using a dendrogram.
With the help of given demonstration, we can understand that how the actual algorithm work.
Here no calculation has been done below all the proximity among the clusters are assumed.
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between
the individual cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are
similar to each other so that we can merge them in the second step. Finally, we get the clusters
[ (P), (QR), (ST), (V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters
[(ST), (V)] together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined together to
form a new cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Divisive Hierarchical Clustering
Ans. Association rule mining is a data mining technique used to discover interesting
relationships or associations between items in a transactional database or dataset. These
associations are in the form of rules, often referred to as "if-then" rules, that express the
likelihood of certain items co-occurring in transactions. This technique is widely used in market
basket analysis, where the goal is to identify patterns of items that are frequently purchased
together.
An association rule has two parts: an antecedent (the "if" part) and a consequent (the "then"
part). The rule signifies that if the antecedent is present in a transaction, then the consequent
is likely to be present as well.
Association rule mining algorithms use measures like support, confidence, and lift to quantify
the interestingness of rules. These measures help identify rules that are both frequent (occur
together often) and statistically significant.
1. Apriori Algorithm: The Apriori algorithm is one of the most well-known algorithms for
association rule mining. It uses a "bottom-up" approach, starting with individual items
and iteratively extending the size of itemsets. It prunes itemsets that don't meet
minimum support thresholds.
Let's consider an example: Transaction 1: {Milk, Bread, Eggs} Transaction 2: {Milk, Bread}
Transaction 3: {Milk, Eggs}
Using these thresholds, we can generate the rule: {Milk} => {Bread} (Support = 2/3, Confidence
= 2/2 = 100%)
2. FP-Growth (Frequent Pattern Growth) Algorithm: The FP-Growth algorithm is an
improvement over Apriori in terms of efficiency. It builds a data structure called an FP-
tree to store frequent itemsets and uses a recursive approach to generate association
rules.
Continuing from the previous example, the FP-Growth algorithm could discover the rule: {Milk}
=> {Bread} (Support = 2/3, Confidence = 2/2 = 100%)
Association rule mining has applications beyond retail, including website usage analysis,
healthcare data analysis, and more. The choice of algorithm depends on the dataset size,
computational resources, and specific requirements of the analysis. It's important to adjust
support and confidence thresholds appropriately to discover meaningful and relevant rules.