0% found this document useful (0 votes)
4 views

DataMiningFormula

The document discusses various concepts in data mining, including frequent itemsets, association rules, and clustering methods. It covers techniques like linear rescaling, distance measures, and the importance of support and confidence in association rules. Additionally, it highlights challenges such as overfitting, underfitting, and the need for appropriate model selection and validation in predictive modeling.

Uploaded by

huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DataMiningFormula

The document discusses various concepts in data mining, including frequent itemsets, association rules, and clustering methods. It covers techniques like linear rescaling, distance measures, and the importance of support and confidence in association rules. Additionally, it highlights challenges such as overfitting, underfitting, and the need for appropriate model selection and validation in predictive modeling.

Uploaded by

huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1.

Linear rescaling (bias caused by different scales or distributions => sup(J) ≥ sup(I) ≥ minsup
of attributes) => J is also frequent
The number of frequent itemsets with k items decreases with increasing k
9.Maximal Frequent Itemset
A frequent itemset is maximal at a given minimum support level minsup if it is
frequent, and no superset of it is frequent.
10. Association rule
- Describes the relation between different itemsets
- Written as X => Y with some confidence measure
The rule X ⇒ Y is said to be an association rule at minimum support minsup
and minimum confidence minconf, if it satisfies both of the following criteria:
3.Data Similarity - X∪Y is at least minsup
- X=>Y is at least minconf
- Absolute support (inlcude all items), Relative support = abs sp/total transacts
Distance 11.Confidence
The confidence of the rule X=>Y, denoted as conf(X=>Y), is the conditional
probability of X∪Y occurring in a transaction, given that the transaction
contains X

p=1 Manhattan distance - taxi driving distance


conf = [0,1]
p=2 Euclidean distance - as the crow flies distance
12. Brute Force: 2^n - 1
p=∞: Chebyshev distance - largest single coordinate distance
12. Apriori
Contrast measure(dist with O(0,0))
Calculate sup for all 1-itemsets, all 2-itemsets,… all n-itemsets. Check if ≥
minsup -> freqent itemset [runtime O^2n]
13. Frequent Pattern (FP) Growth
Lexicon, single item table creation, order item by support, reorder transaction,
Mahalanobis Distance (data in different dimensions dont care distribution data) build tree [runtime O(n)]
With minsup:
find frequent itemset (1,2,…n) by sup(itemset) ≥ minsup
Equivalent to transforming via PCA, and then Z-score normalization With confidence:
Local Distribution: can be Lp_norm (L1,L2,L∞) or Mahalanobis as appropriate using formula 11 (X∪Y mean: X and Y occurring together)
Identifying local context helps in understanding data variations, patterns, and 1. Classification (chose val 1st avg perfor with cross, tuning, best pr, retrain)
relationships that are not apparent globally but become clear when analyzed [ Acc: (tp+tn)/total | Pre: tp/(tp+fp) | Rcal: tp/(tp+fn) | f1: 2*pre*rcal/pre+rcal
locally – mesure the distance between point and distribution, consider the
correlation, independent on scale Goal: predict test samples as accurate as possible # not accurately explaining
Similarity Crossvalid: N-fold (divide labelled data into n blocks, take one block for valid,
train others, measure avg perform), leave one out (special, block = 1), stratified
(Ensures that each class is represented proportionally in all the subsets)
Issues – Solution same with Regression
Overlap Measure - Data size: Number of training samples for each category small => the learned
model does not reflect well how data points are distributed → poor prediction
- Overfitting: model optimized only on the training set and not on the unseen
samples => lack generalization ability → poor prediction
- Underfitting: model too simple to describe data statistics or model not
Inverse occurrence frequency suitable to the data => low prediction accuracy
P_k(x_i) be the fraction of records in attribute k with value x_i
Rare classes & small data: some classes rarely occur in the training data and
are often misclassified, training data is limited, and no new data can be
obtained (easily)
Solution: Some algorithms can incorporate weights (down weight common
Goodall measure classes, up weight rare classes, include weights into decision boundaries and/or
evaluation metrics) / Biased sampling (over w rare class, under w common
classes) / SMOTE (Use existing data to generate synthetic training data)

KNN: (K=1, K>1)


Text similarity measure: - pros: simple, no need train model, can be used with few training examples
Cosine measure lexicon -> count -> compute - cons: slow classify, curse of dimensionality, sensitive noise/oulier, not exploit
ds, Non-resolvable with n>2 categories
DecisionTree
- pros: simple, interpretable, effective, efficient, analytical power, easy new sce
- cons: complex cal if outp linked, high cost, prone to overfit, lack of conf …
Splitting
- purity: fraction of samples having dominant class label | err_rate = 1 – purity

ex = thiu_x/sum_x, Nx = sum_x, N = sum_elem

GINI: px = all case can occur (thieu/sum, da/sum) … Entropy: same

6.Support concept
Stop Splitting: size (≤) and purity (≥)
- Sup(I) can be relative (fraction of total) or absolute (number of transactions)
- minsup => for a set to be included the list of interesting patterns 2. Regression
- sup(I) ≥ minsup => said to be frequent itemsets
7.Support Monotonicity Property
the support of a subset J is always greater than or equal to the support of the yi: real value, f(xi): predict value, << is better
itemset I

: mean squared error


8.Downward Closure Property
Every subset of a frequent itemset is also frequent!
- if sup(I) ≥ minsup and J is a subset of I :total sum of squares
Dendrograms (Tree plots): largest distance between clusters indicates optimal
number of clusters
:R^2 is measures the correlation between the predicted value ( Select right clustering method: visualize (inspection), compare other
f(X)) and the known value (y) [0,1], >> is better approaches (cluster validation), examine the interpretability, cluster ensampes
R can be outside of [0,1] if the model is non-linear Application: data summarization (how data are distributed) / Customer
segmentation & collaborative filtering (group cus according to profiles,
Linear Regression: demographics..) (filter as per every group found via the clustering step) / Text
applications (Hierarchical clustering potentially useful for organization of web
pages) / Multimedia applications (clustering index – effective retrieval of
multimedia contents)
4. Outlier:
Definition: outlier is an observation which deviates so much from the other
Non-linear Regression: observations as to arouse suspicions that it was generated by a different
mechanism # Inlier is the opposite of an outlier
Application: (-) Outliers are treated as noise and are removed or modified /
Outliers in activity can be a sign of fraudulent behavior / (+) Outliers are a sign
of rare or previously unknown phenomena
Fomula: TPR = TP/(TP + FN) (true + rate) / FPR = FP/(FP + TN) (false + rate)

KNN-Regression: Instead of vote w class, each neighbour votes w value Model types: (detection metrics: binary or score)
pros: simple, small dataset, flexible, can handle non-linear data 1. Extreme values: Outliers are improbable data points / Extreme values are
cons: slower dataset grows, sensitive noise/outlier – scale, need normal,stand improbable data points at the tail of a distribution (improbable: xuất hiện 1 lần)
Regression Tree: 2. Probabilistic models: Outliers are points with low probability of being
generated by the model M
3. Distance based method: The distance-based outlier score of an object O is its
pros: easy to interpret, handle categorical & continuous variable, robust outlier distance to its kth nearest neighbor / Histogram/Grid: Outlier score for X is the
cons: tend to overfit, reduce accuracy, high variance, biased results w feat n` lv number of other points in the same bin /
4. Kernel Density: Outlier score for X is the sum of all other density functions
Issues: at location of X.
a) Sampling density: 5.Recommender:
sparse areas of phase space => less model constraint application: rcm similar items (also read, bought, together) -> increase sales for
low density sampling => opportunity for under-fitting Amazon, potentially find better items for customers / rcm shows that
b) Over fitting: model optimized only on the training set and not on the unseen might like based on (popular, trending, newest ..)-> retain customers for Netflix ->
samples: lack generalization ability => poor prediction Less browsing time for customers
c) Underfitting: model too simple to describe data statistics or model not user based: find similar users -> identify unseen items that similar users liked->
suitable to the data: low prediction accuracy predict user rating for unseen items -> rcm best liked items
Solution: item based: find similar items -> identify unseen items that are similar to items the
- Obtain more data samples: collect or inject samples (resampling, introducing user likes -> predict user rating for unseen items -> rcm highest rating items
small noise, data augmentation) Formula: user1: X = (x1,..xn) / user2: Y = (y1…yn) -> correlation coefficient:
- Select the right models, need to understand data well
- Regularization: more penalty for more complex models
- Use validation/cross-validation to select robust training models
- Select/combine suitable regression methods x^ is avg of X
3. Clustering Theory:
Diffent with Association Pattern: 1. DB: Flat File Sys (Data redundancy, Data inconsistency, Data isolation, Concurrent
- Find attributes/features that co-occur / Focus is on the items edits) / DBMS (Store data once then link to data from multiple place, fewer formats,
Clustering: Group transactions that are similar / Focus on transactions atomic transactions) = db (table, index, schema) + management sys (DDL,DML,SQL)
K-mean clustering: 2. Queries: (CREATE: create new database objects, tt defines the structure of the data
pros: simple, intuitive, good result when well separated, relative efficient row, column / INSERT: insert data (row/column) into an existing table / DROP:
cons: local solution only, need to identify number of cluster, no robust agains completely remove an entire table,db,index object all its data permanently deleted /
highly overlapping, doesn’t work categorical data may not be optimal REMOVE remove specific rows from a table on a condition / ALTER: modify the
structure/schema of a table, changes the table definition / UPDATE: modify the data
within a table, changes the actual values in rows)
3. Data pre (NaN handle, Impute (KNN or Mean), Normalization [0,1],
Standardization (mean:0, std:1) -> each feature same weight)
4. If (more data) -> Curse of Dimensionality -> overfitting due to noisy & irrelevant
feature, data become sparse, distance point inc [Using statistical tests] # if highest
corr -> risk of overfitting, Ignoring Nonlinear Relationships [apply PCA, Decision
Tree to cap non]
5. Brute Force (Checks all possible itemsets to determine their support) 1scan /
Apriori (downward closure property, generates candidates in multiple passes,
scanning db each time) 4scan / FP-Growth (Builds a prefix tree (FP-tree) in 1scan)
6. Issues w 10 fold: each fold may not retain the original class ratio. in some folds, the
K-Mode clustering: minority class might not be represented at all -> to biased performance measures. /
- definition: simple adaptation of k-Means for categorical data The model may learn to prioritize due to their higher representation, potentially
- mode: dominant category in a cluster ignoring the minority class / Imbalanced classes can result in misleading metrics
- idea: instead of taking average to find the centroid, use mode to find the / solution: Stratified Cross-Validation (ensures that each fold maintains the original
dominant category in each cluster class distribution) or SMOTE (Generate synthetic samples for the minority class
- Multi-dimensional categorical data: mode finding for each attribute separately within each fold to balance class distribution)
Agglomerative clustering: (O(n^2log(n))) 7. Underfit: occurs when a model is too simplistic to capture the underlying patterns
in the data, has high bias and lacks the capacity to differentiate between classes. (eg:
in a facial recognition system, an underfit model might misclassify faces as it cannot
learn the nuances of facial features due to its overly simplistic structure. ) / Overfit:
The model may learn to favor the majority class, may simply predict the majority
class for all instances-> biased performance evaluation)
8. DM app: E-commerce Recommendations (using Association Rule Mining) /
Personalized Email Marketing (Classification and Regression): Classification is used
to group customers based on purchasing behavior and preferences, while regression
pros: not assume a number of clusters, may generate meaningful taxonomies
predicts the likelihood of purchasing certain items based on historical behavior
cons: slow for large data sets, O(n^2) for computing distance matrix
9. revisit (enhance model perform, address Overfit/Underfit, new insights)
- Best (single) linkage: smallest distance # Worst (complete) linkage: largest
10. 3 blocks: data process, model building (classification, clustering, or regression),
- Group-average linkage: Average distance
evaluation (Measuring the performance using various metrics.)
- Closest Centroid: Distance between centroids of clusters

You might also like