The document outlines the functionalities of data mining, categorizing tasks into descriptive and predictive types. It details various techniques such as classification, clustering, and association analysis, emphasizing the importance of identifying patterns and relationships within data. Additionally, it discusses the concepts of supervised and unsupervised learning, along with measures of interestingness for discovered patterns.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
14 views27 pages
2 Data Mining Functionalities 14-12-2024
The document outlines the functionalities of data mining, categorizing tasks into descriptive and predictive types. It details various techniques such as classification, clustering, and association analysis, emphasizing the importance of identifying patterns and relationships within data. Additionally, it discusses the concepts of supervised and unsupervised learning, along with measures of interestingness for discovered patterns.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27
Data mining functionalities
March 6, 2025 SWE2009 - Data Mining Techniques 1
Introduction Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.
Data mining tasks classified into two
categories: descriptive and predictive.
Descriptive mining tasks characterize the
general properties of the data in the database.
Predictive mining tasks perform inference on
the current data in order to make predictions. March 6, 2025 SWE2009 - Data Mining Techniques 2 Functionalities/Techniques Concept/Class Description: Characterization and Discrimination Mining Frequent Patterns, Associations and correlations Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis
March 6, 2025 SWE2009 - Data Mining Techniques 3
Characterization and Discrimination Data associated with classes or concepts.
For example, in the Electronics store,
classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders.
Useful to describe individual classes and
concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions.
March 6, 2025 SWE2009 - Data Mining Techniques 4
Contd….
These descriptions can be derived via
(1) data characterization, by summarizing the data
of the class under study (often called the target class) in general terms, or
(2) data discrimination, by comparison of the
target class with one or a set of comparative classes (often called the contrasting classes), or
(3) both data characterization and discrimination.
March 6, 2025 SWE2009 - Data Mining Techniques 5
Characterization and Discrimination Data Characterization: A data mining system should be able to produce a description summarizing the characteristics of customers.
Example: The characteristics of customers
who spend more than $1000 a year at (some store called ) AllElectronics. The result can be a general profile such as age, employment status or credit ratings.
March 6, 2025 SWE2009 - Data Mining Techniques 6
Contd….
Data Discrimination: It is a comparison of the
general features of targeting class data objects with the general features of objects from one or a set of contrasting classes. User can specify target and contrasting classes.
Example: The user may like to compare the
general features of software products whose sales increased by 10% in the last year with those whose sales decreased by about 30% in the same duration.
March 6, 2025 SWE2009 - Data Mining Techniques 7
Contd….
The output of data characterization can
be presented in various forms.
Examples include pie charts, bar charts,
curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
The resulting descriptions can also be
presented as generalized relations or in rule form(called characteristic rules).
March 6, 2025 SWE2009 - Data Mining Techniques 8
Associations and correlations Frequent Patterns : As the name suggests patterns that occur frequently in data.
Frequent Itemset : A set of items that
frequently appear together in a transactional data set, such as milk and bread.
Frequent Sequential Pattern : A frequently
occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card. March 6, 2025 SWE2009 - Data Mining Techniques 9 Contd….
Substructure : Refer to different structural
forms, such as graphs, trees, or lattices, which may be combined with itemsets or subsequences.
If a substructure occurs frequently, it is called
a (frequent) structured pattern.
Mining frequent patterns leads to the
discovery of interesting associations and correlations within data.
March 6, 2025 SWE2009 - Data Mining Techniques 10
Contd…. Association Analysis: from marketing perspective, determining which items are frequently purchased together within the same transaction. Example: An example is mined from the (some store) AllElectronic transactional database. buys (X, “Computers”) buys (X, “software”) [Support = 1%, confidence = 50% ] X represents customer
Confidence or certainty = 50% , if a customer buys
a computer there is a 50% chance that he/she will
buy software as well. Support = 1%, means that 1% of all the transactions under analysis showed that computer and software were purchased together.
March 6, 2025 SWE2009 - Data Mining Techniques 11
Are All the “Discovered” Patterns Interesting? Data mining may generate thousands of patterns: Not all of them are interesting Suggested approach: Human-centered, query-based, focused mining Interestingness measures A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. March 6, 2025 SWE2009 - Data Mining Techniques 12 Contd… Support usefulness
Confidence certainty
The support for a rule R is the ratio of the number of
occurrences of R, given all occurrences of all rules.
The confidence of a rule X Y, is the ratio of the
number of occurrences of Y given X, among all other occurrences given X
In multidimensional databases, where each attribute
is referred to as a dimension, the above rule can be referred to as a multidimensional association rule.
March 6, 2025 SWE2009 - Data Mining Techniques 13
Support and Confidence Support count: The support count of an itemset X, denoted by X.count, in a data set T is the number of transactions in T that contain X. Assume T has n transactions. Then, ( X Y ).count support n ( X Y ).count confidence X .count
March 6, 2025 SWE2009 - Data 14
Mining Techniques Contd….
Support for {Bag, Uniform} =
Bag Uniform Crayons 5/10 = 0.5 Books Bag Uniform Bag Uniform Pencil Bag Pencil Book Uniform Crayons Bag Confidence for Bag Uniform = Bag Pencil Book 5/8 = 0.625 Crayons Uniform Bag Books Crayons Bag Uniform Crayons Pencil Pencil Uniform Books
Motivation: Finding inherent regularities in data
What products were often purchased together?— Bag, Uniform?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? March 6, 2025 SWE2009 - Data Mining Techniques 17 Associations and correlations Another example: Age (X, 20…29) ^ income (X, 20K-29K) buys(X, “CD Player”) [Support = 2%, confidence = 60% ] Customers between 20 to 29 years of age with an income $20000-$29000. There is 60% chance they will purchase CD Player and 2% of all the transactions under analysis showed that this age group customers with that range of income bought CD Player.
March 6, 2025 SWE2009 - Data Mining Techniques 18
Classification and Prediction Classification is the process of finding a model that describes and distinguishes data classes or concepts for the purpose of being able to use the model to predict the class of objects whose class label is unknown. Construct models (functions) that describe and distinguish classes or concepts for future prediction Training data Building the model Test data Evaluate the model Classification model can be represented in various forms such as IF-THEN Rules A decision tree March 6, 2025 Neural network 19 SWE2009 - Data Mining Techniques Contd…. A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
Decision trees can easily be converted to
classification rules.
A neural network, when used for
classification, is typically a collection of neuron-like processing units with weighted connections between the units. March 6, 2025 SWE2009 - Data Mining Techniques 20 Classification Model
March 6, 2025 SWE2009 - Data Mining Techniques 21
Cluster Analysis Clustering analyses data objects without consulting a known class label.
Groups data elements into different groups
based on the similarity between elements within a single group
Maximizing the intraclass similarity and
minimizing the interclass similarity.
Example: Result analysis
March 6, 2025 SWE2009 - Data Mining Techniques 22
Cluster Analysis
March 6, 2025 SWE2009 - Data Mining Techniques 23
Outlier Analysis Outlier Analysis : A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers.
Outliers" are values that "lie outside" the other values.
Example: Use in finding Fraudulent usage of credit
cards. Outlier Analysis may uncover Fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account. Outlier values may also be detected with respect to the location and type of purchase or the purchase frequency. March 6, 2025 SWE2009 - Data Mining Techniques 24 Evolution Analysis Evolution Analysis: Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time.
Example: Time-series data. If the stock market
data (time-series) of the last several years available from the New York Stock exchange and one would like to invest in shares of high tech industrial companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing to one’s decision making regarding stock investments. March 6, 2025 SWE2009 - Data Mining Techniques 25 Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data March 6, 2025 SWE2009 - Data Mining Techniques 26 Test Partition (in SL)