DataMiningFormula

The document discusses various concepts in data mining, including frequent itemsets, association rules, and clustering methods. It covers techniques like linear rescaling, distance measures, and the importance of support and confidence in association rules. Additionally, it highlights challenges such as overfitting, underfitting, and the need for appropriate model selection and validation in predictive modeling.

Uploaded by

huy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

DataMiningFormula

Uploaded by

huy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

1.

Linear rescaling (bias caused by different scales or distributions => sup(J) ≥ sup(I) ≥ minsup
of attributes) => J is also frequent
The number of frequent itemsets with k items decreases with increasing k
9.Maximal Frequent Itemset
A frequent itemset is maximal at a given minimum support level minsup if it is
frequent, and no superset of it is frequent.
10. Association rule
- Describes the relation between different itemsets
- Written as X => Y with some confidence measure
The rule X ⇒ Y is said to be an association rule at minimum support minsup
and minimum confidence minconf, if it satisfies both of the following criteria:
3.Data Similarity - X∪Y is at least minsup
- X=>Y is at least minconf
- Absolute support (inlcude all items), Relative support = abs sp/total transacts
Distance 11.Confidence
The confidence of the rule X=>Y, denoted as conf(X=>Y), is the conditional
probability of X∪Y occurring in a transaction, given that the transaction
contains X

p=1 Manhattan distance - taxi driving distance

conf = [0,1]
p=2 Euclidean distance - as the crow flies distance
12. Brute Force: 2^n - 1
p=∞: Chebyshev distance - largest single coordinate distance
12. Apriori
Contrast measure(dist with O(0,0))
Calculate sup for all 1-itemsets, all 2-itemsets,… all n-itemsets. Check if ≥
minsup -> freqent itemset [runtime O^2n]
13. Frequent Pattern (FP) Growth
Lexicon, single item table creation, order item by support, reorder transaction,
Mahalanobis Distance (data in different dimensions dont care distribution data) build tree [runtime O(n)]
With minsup:
find frequent itemset (1,2,…n) by sup(itemset) ≥ minsup
Equivalent to transforming via PCA, and then Z-score normalization With confidence:
Local Distribution: can be Lp_norm (L1,L2,L∞) or Mahalanobis as appropriate using formula 11 (X∪Y mean: X and Y occurring together)
Identifying local context helps in understanding data variations, patterns, and 1. Classification (chose val 1st avg perfor with cross, tuning, best pr, retrain)
relationships that are not apparent globally but become clear when analyzed [ Acc: (tp+tn)/total | Pre: tp/(tp+fp) | Rcal: tp/(tp+fn) | f1: 2*pre*rcal/pre+rcal
locally – mesure the distance between point and distribution, consider the
correlation, independent on scale Goal: predict test samples as accurate as possible # not accurately explaining
Similarity Crossvalid: N-fold (divide labelled data into n blocks, take one block for valid,
train others, measure avg perform), leave one out (special, block = 1), stratified
(Ensures that each class is represented proportionally in all the subsets)
Issues – Solution same with Regression
Overlap Measure - Data size: Number of training samples for each category small => the learned
model does not reflect well how data points are distributed → poor prediction
- Overfitting: model optimized only on the training set and not on the unseen
samples => lack generalization ability → poor prediction
- Underfitting: model too simple to describe data statistics or model not
Inverse occurrence frequency suitable to the data => low prediction accuracy
P_k(x_i) be the fraction of records in attribute k with value x_i
Rare classes & small data: some classes rarely occur in the training data and
are often misclassified, training data is limited, and no new data can be
obtained (easily)
Solution: Some algorithms can incorporate weights (down weight common
Goodall measure classes, up weight rare classes, include weights into decision boundaries and/or
evaluation metrics) / Biased sampling (over w rare class, under w common
classes) / SMOTE (Use existing data to generate synthetic training data)

KNN: (K=1, K>1)

Text similarity measure: - pros: simple, no need train model, can be used with few training examples
Cosine measure lexicon -> count -> compute - cons: slow classify, curse of dimensionality, sensitive noise/oulier, not exploit
ds, Non-resolvable with n>2 categories
DecisionTree
- pros: simple, interpretable, effective, efficient, analytical power, easy new sce
- cons: complex cal if outp linked, high cost, prone to overfit, lack of conf …
Splitting
- purity: fraction of samples having dominant class label | err_rate = 1 – purity

ex = thiu_x/sum_x, Nx = sum_x, N = sum_elem

GINI: px = all case can occur (thieu/sum, da/sum) … Entropy: same

6.Support concept
Stop Splitting: size (≤) and purity (≥)
- Sup(I) can be relative (fraction of total) or absolute (number of transactions)
- minsup => for a set to be included the list of interesting patterns 2. Regression
- sup(I) ≥ minsup => said to be frequent itemsets
7.Support Monotonicity Property
the support of a subset J is always greater than or equal to the support of the yi: real value, f(xi): predict value, << is better
itemset I

: mean squared error

8.Downward Closure Property
Every subset of a frequent itemset is also frequent!
- if sup(I) ≥ minsup and J is a subset of I :total sum of squares
Dendrograms (Tree plots): largest distance between clusters indicates optimal
number of clusters
:R^2 is measures the correlation between the predicted value ( Select right clustering method: visualize (inspection), compare other
f(X)) and the known value (y) [0,1], >> is better approaches (cluster validation), examine the interpretability, cluster ensampes
R can be outside of [0,1] if the model is non-linear Application: data summarization (how data are distributed) / Customer
segmentation & collaborative filtering (group cus according to profiles,
Linear Regression: demographics..) (filter as per every group found via the clustering step) / Text
applications (Hierarchical clustering potentially useful for organization of web
pages) / Multimedia applications (clustering index – effective retrieval of
multimedia contents)
4. Outlier:
Definition: outlier is an observation which deviates so much from the other
Non-linear Regression: observations as to arouse suspicions that it was generated by a different
mechanism # Inlier is the opposite of an outlier
Application: (-) Outliers are treated as noise and are removed or modified /
Outliers in activity can be a sign of fraudulent behavior / (+) Outliers are a sign
of rare or previously unknown phenomena
Fomula: TPR = TP/(TP + FN) (true + rate) / FPR = FP/(FP + TN) (false + rate)

KNN-Regression: Instead of vote w class, each neighbour votes w value Model types: (detection metrics: binary or score)
pros: simple, small dataset, flexible, can handle non-linear data 1. Extreme values: Outliers are improbable data points / Extreme values are
cons: slower dataset grows, sensitive noise/outlier – scale, need normal,stand improbable data points at the tail of a distribution (improbable: xuất hiện 1 lần)
Regression Tree: 2. Probabilistic models: Outliers are points with low probability of being
generated by the model M
3. Distance based method: The distance-based outlier score of an object O is its
pros: easy to interpret, handle categorical & continuous variable, robust outlier distance to its kth nearest neighbor / Histogram/Grid: Outlier score for X is the
cons: tend to overfit, reduce accuracy, high variance, biased results w feat n` lv number of other points in the same bin /
4. Kernel Density: Outlier score for X is the sum of all other density functions
Issues: at location of X.
a) Sampling density: 5.Recommender:
sparse areas of phase space => less model constraint application: rcm similar items (also read, bought, together) -> increase sales for
low density sampling => opportunity for under-fitting Amazon, potentially find better items for customers / rcm shows that
b) Over fitting: model optimized only on the training set and not on the unseen might like based on (popular, trending, newest ..)-> retain customers for Netflix ->
samples: lack generalization ability => poor prediction Less browsing time for customers
c) Underfitting: model too simple to describe data statistics or model not user based: find similar users -> identify unseen items that similar users liked->
suitable to the data: low prediction accuracy predict user rating for unseen items -> rcm best liked items
Solution: item based: find similar items -> identify unseen items that are similar to items the
- Obtain more data samples: collect or inject samples (resampling, introducing user likes -> predict user rating for unseen items -> rcm highest rating items
small noise, data augmentation) Formula: user1: X = (x1,..xn) / user2: Y = (y1…yn) -> correlation coefficient:
- Select the right models, need to understand data well
- Regularization: more penalty for more complex models
- Use validation/cross-validation to select robust training models
- Select/combine suitable regression methods x^ is avg of X
3. Clustering Theory:
Diffent with Association Pattern: 1. DB: Flat File Sys (Data redundancy, Data inconsistency, Data isolation, Concurrent
- Find attributes/features that co-occur / Focus is on the items edits) / DBMS (Store data once then link to data from multiple place, fewer formats,
Clustering: Group transactions that are similar / Focus on transactions atomic transactions) = db (table, index, schema) + management sys (DDL,DML,SQL)
K-mean clustering: 2. Queries: (CREATE: create new database objects, tt defines the structure of the data
pros: simple, intuitive, good result when well separated, relative efficient row, column / INSERT: insert data (row/column) into an existing table / DROP:
cons: local solution only, need to identify number of cluster, no robust agains completely remove an entire table,db,index object all its data permanently deleted /
highly overlapping, doesn’t work categorical data may not be optimal REMOVE remove specific rows from a table on a condition / ALTER: modify the
structure/schema of a table, changes the table definition / UPDATE: modify the data
within a table, changes the actual values in rows)
3. Data pre (NaN handle, Impute (KNN or Mean), Normalization [0,1],
Standardization (mean:0, std:1) -> each feature same weight)
4. If (more data) -> Curse of Dimensionality -> overfitting due to noisy & irrelevant
feature, data become sparse, distance point inc [Using statistical tests] # if highest
corr -> risk of overfitting, Ignoring Nonlinear Relationships [apply PCA, Decision
Tree to cap non]
5. Brute Force (Checks all possible itemsets to determine their support) 1scan /
Apriori (downward closure property, generates candidates in multiple passes,
scanning db each time) 4scan / FP-Growth (Builds a prefix tree (FP-tree) in 1scan)
6. Issues w 10 fold: each fold may not retain the original class ratio. in some folds, the
K-Mode clustering: minority class might not be represented at all -> to biased performance measures. /
- definition: simple adaptation of k-Means for categorical data The model may learn to prioritize due to their higher representation, potentially
- mode: dominant category in a cluster ignoring the minority class / Imbalanced classes can result in misleading metrics
- idea: instead of taking average to find the centroid, use mode to find the / solution: Stratified Cross-Validation (ensures that each fold maintains the original
dominant category in each cluster class distribution) or SMOTE (Generate synthetic samples for the minority class
- Multi-dimensional categorical data: mode finding for each attribute separately within each fold to balance class distribution)
Agglomerative clustering: (O(n^2log(n))) 7. Underfit: occurs when a model is too simplistic to capture the underlying patterns
in the data, has high bias and lacks the capacity to differentiate between classes. (eg:
in a facial recognition system, an underfit model might misclassify faces as it cannot
learn the nuances of facial features due to its overly simplistic structure. ) / Overfit:
The model may learn to favor the majority class, may simply predict the majority
class for all instances-> biased performance evaluation)
8. DM app: E-commerce Recommendations (using Association Rule Mining) /
Personalized Email Marketing (Classification and Regression): Classification is used
to group customers based on purchasing behavior and preferences, while regression
pros: not assume a number of clusters, may generate meaningful taxonomies
predicts the likelihood of purchasing certain items based on historical behavior
cons: slow for large data sets, O(n^2) for computing distance matrix
9. revisit (enhance model perform, address Overfit/Underfit, new insights)
- Best (single) linkage: smallest distance # Worst (complete) linkage: largest
10. 3 blocks: data process, model building (classification, clustering, or regression),
- Group-average linkage: Average distance
evaluation (Measuring the performance using various metrics.)
- Closest Centroid: Distance between centroids of clusters

Weka Tutorial 3
No ratings yet
Weka Tutorial 3
60 pages
Unit 2: Scs5623 - Data Mining and Warehousing
No ratings yet
Unit 2: Scs5623 - Data Mining and Warehousing
9 pages
Top 9 Data Science Algorithms
No ratings yet
Top 9 Data Science Algorithms
152 pages
Httpsmygju.gju.Edu.jofacescourse Portfoliocourse Syllabuscourse Syllabus.xhtml 2
No ratings yet
Httpsmygju.gju.Edu.jofacescourse Portfoliocourse Syllabuscourse Syllabus.xhtml 2
15 pages
Association
No ratings yet
Association
29 pages
Apriori
No ratings yet
Apriori
27 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
Apriori
No ratings yet
Apriori
27 pages
ML Unit - Iii
No ratings yet
ML Unit - Iii
64 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
28 pages
Assignment ON Data Mining: Submitted by Name: Manjula.T
No ratings yet
Assignment ON Data Mining: Submitted by Name: Manjula.T
11 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Closet - An Efficient Algorithm For Mining Frequent
No ratings yet
Closet - An Efficient Algorithm For Mining Frequent
8 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Unsupervised Learning Algorithm 1
No ratings yet
Unsupervised Learning Algorithm 1
3 pages
Data Mining - Classification Using Frequent Pattern
No ratings yet
Data Mining - Classification Using Frequent Pattern
8 pages
Unit_3 Mining Frequent Patterns
No ratings yet
Unit_3 Mining Frequent Patterns
10 pages
dwdm FINAL4
No ratings yet
dwdm FINAL4
37 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
DWDM-UNIT-4
No ratings yet
DWDM-UNIT-4
12 pages
Association Rules
No ratings yet
Association Rules
24 pages
An Efficient Algorithm For Mining
No ratings yet
An Efficient Algorithm For Mining
6 pages
Unit 3 1
No ratings yet
Unit 3 1
34 pages
Association Rule Mining With R
No ratings yet
Association Rule Mining With R
58 pages
Association
No ratings yet
Association
40 pages
DataScience - Project (Banknote Authentication) - SHILANJOY BHATTACHARJEE EE
No ratings yet
DataScience - Project (Banknote Authentication) - SHILANJOY BHATTACHARJEE EE
14 pages
DMT Unit-IV - UR20 - New
No ratings yet
DMT Unit-IV - UR20 - New
62 pages
DM Unit-II
No ratings yet
DM Unit-II
80 pages
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
No ratings yet
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
118 pages
Experiment: 3: Aim: Theory
No ratings yet
Experiment: 3: Aim: Theory
16 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
Module5 DMW
No ratings yet
Module5 DMW
13 pages
Performance Analysis of Distributed Association Rule Mining With Apriori Algorithm
No ratings yet
Performance Analysis of Distributed Association Rule Mining With Apriori Algorithm
5 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
Association Rule Mining Using Apriori Al PDF
No ratings yet
Association Rule Mining Using Apriori Al PDF
11 pages
Week 3
No ratings yet
Week 3
56 pages
Data Mining
No ratings yet
Data Mining
24 pages
Association Rules
No ratings yet
Association Rules
58 pages
NGDM07 Philip Yu
No ratings yet
NGDM07 Philip Yu
22 pages
Association Analysis: Unit-V
No ratings yet
Association Analysis: Unit-V
12 pages
Note 1455181909
No ratings yet
Note 1455181909
30 pages
Q) Frequent Itemset Generation: States That If An Itemset Is Frequent, Then All of Its Subsets Must Also Be Frequent. This
No ratings yet
Q) Frequent Itemset Generation: States That If An Itemset Is Frequent, Then All of Its Subsets Must Also Be Frequent. This
9 pages
Association Rules
No ratings yet
Association Rules
48 pages
DWDM Unit 4 (R22)
No ratings yet
DWDM Unit 4 (R22)
25 pages
s13042-013-0172-6
No ratings yet
s13042-013-0172-6
11 pages
association rule mapping -unit-4
No ratings yet
association rule mapping -unit-4
11 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
Equent Itemsets & Clustering
No ratings yet
Equent Itemsets & Clustering
27 pages
Apriori
No ratings yet
Apriori
27 pages
DWDM 3
No ratings yet
DWDM 3
34 pages
MODULE 3 - Question &answer-2
No ratings yet
MODULE 3 - Question &answer-2
32 pages
Tutorial
No ratings yet
Tutorial
52 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Lecture 7
No ratings yet
Lecture 7
26 pages
Data Analytics Unit 4
No ratings yet
Data Analytics Unit 4
22 pages
Association Rule
No ratings yet
Association Rule
27 pages
Data Mining of Very Large Data
No ratings yet
Data Mining of Very Large Data
50 pages
Association Rule Mod 3
No ratings yet
Association Rule Mod 3
28 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Neural Modeling Fields: Fundamentals and Applications
From Everand
Neural Modeling Fields: Fundamentals and Applications
Fouad Sabry
No ratings yet
Álvarez-Gómez 2019 SoftwareX FMC-Earthquake Focal Mechanisms Data Management Cluster and Classification
No ratings yet
Álvarez-Gómez 2019 SoftwareX FMC-Earthquake Focal Mechanisms Data Management Cluster and Classification
9 pages
Week 2 Characterization of Learning Problems: Nptel Video Course On Machine Learning
No ratings yet
Week 2 Characterization of Learning Problems: Nptel Video Course On Machine Learning
18 pages
Lab Assignment 8: Nishiv Singh (B20MT029) Google Colab Notebooks Link: Task 1
No ratings yet
Lab Assignment 8: Nishiv Singh (B20MT029) Google Colab Notebooks Link: Task 1
4 pages
IDRISI Selva GIS Image Processing Specifications
No ratings yet
IDRISI Selva GIS Image Processing Specifications
12 pages
Tourism Enhancement Using LLMs & Neural Network_Report (1)
No ratings yet
Tourism Enhancement Using LLMs & Neural Network_Report (1)
37 pages
Application Model of K-Means Clustering: Insights Into Promotion Strategy of Vocational High School
No ratings yet
Application Model of K-Means Clustering: Insights Into Promotion Strategy of Vocational High School
6 pages
Predictive AIOps Solution Brief
No ratings yet
Predictive AIOps Solution Brief
5 pages
A GIS-based Traffic Analysis Zone Design Implement
No ratings yet
A GIS-based Traffic Analysis Zone Design Implement
14 pages
Regional Approach For Mapping Climatological Snow Water Equivalent Over The Mountainous Regions of The Western United States
No ratings yet
Regional Approach For Mapping Climatological Snow Water Equivalent Over The Mountainous Regions of The Western United States
11 pages
ML Practice Questions
No ratings yet
ML Practice Questions
6 pages
T 2V: D R T: OP EC Istributed Epresentations of Opics
No ratings yet
T 2V: D R T: OP EC Istributed Epresentations of Opics
25 pages
Fuzzy Clustering Based On Water Wave Optimization: Zifei Ren Chunzhi Wang
No ratings yet
Fuzzy Clustering Based On Water Wave Optimization: Zifei Ren Chunzhi Wang
5 pages
Chapter20 4e
No ratings yet
Chapter20 4e
36 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
31 pages
Originated From DB Community : IIIT Hyderabad
No ratings yet
Originated From DB Community : IIIT Hyderabad
3 pages
Ashwin Kumar REPORT - 1BI21IS019
No ratings yet
Ashwin Kumar REPORT - 1BI21IS019
57 pages
MBA (Executive) : Program Curriculum
No ratings yet
MBA (Executive) : Program Curriculum
4 pages
Evaluating Spatial Methods For Investigating - Satscan - Lisa - Outros
No ratings yet
Evaluating Spatial Methods For Investigating - Satscan - Lisa - Outros
32 pages
Industrial Training Report
No ratings yet
Industrial Training Report
33 pages
Question Bank For DMDW
No ratings yet
Question Bank For DMDW
10 pages
Approaching Almost Any Machine Learning Problem 1st Edition Abhishek Thakur 2024 scribd download
100% (1)
Approaching Almost Any Machine Learning Problem 1st Edition Abhishek Thakur 2024 scribd download
62 pages
Crime Analysis and Prediction Using Optimized K-Means Algorithm
No ratings yet
Crime Analysis and Prediction Using Optimized K-Means Algorithm
4 pages
ML 1
No ratings yet
ML 1
27 pages
Data Mining Notes
No ratings yet
Data Mining Notes
31 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
20 pages
Running Head:: Data Mining 1
No ratings yet
Running Head:: Data Mining 1
7 pages
Computational Biology: C1 C2 C3 C4 C5 C6 C7 C8 G1 G2
No ratings yet
Computational Biology: C1 C2 C3 C4 C5 C6 C7 C8 G1 G2
7 pages
drones-06-00406-v3
No ratings yet
drones-06-00406-v3
22 pages
Final R20 M.Tech AI Syllabus (1)
No ratings yet
Final R20 M.Tech AI Syllabus (1)
56 pages

DataMiningFormula

Uploaded by

DataMiningFormula

Uploaded by

1.

p=1 Manhattan distance - taxi driving distance

KNN: (K=1, K>1)

ex = thiu_x/sum_x, Nx = sum_x, N = sum_elem

GINI: px = all case can occur (thieu/sum, da/sum) … Entropy: same

: mean squared error

You might also like