DataMiningFormula
DataMiningFormula
Linear rescaling (bias caused by different scales or distributions => sup(J) ≥ sup(I) ≥ minsup
of attributes) => J is also frequent
The number of frequent itemsets with k items decreases with increasing k
9.Maximal Frequent Itemset
A frequent itemset is maximal at a given minimum support level minsup if it is
frequent, and no superset of it is frequent.
10. Association rule
- Describes the relation between different itemsets
- Written as X => Y with some confidence measure
The rule X ⇒ Y is said to be an association rule at minimum support minsup
and minimum confidence minconf, if it satisfies both of the following criteria:
3.Data Similarity - X∪Y is at least minsup
- X=>Y is at least minconf
- Absolute support (inlcude all items), Relative support = abs sp/total transacts
Distance 11.Confidence
The confidence of the rule X=>Y, denoted as conf(X=>Y), is the conditional
probability of X∪Y occurring in a transaction, given that the transaction
contains X
6.Support concept
Stop Splitting: size (≤) and purity (≥)
- Sup(I) can be relative (fraction of total) or absolute (number of transactions)
- minsup => for a set to be included the list of interesting patterns 2. Regression
- sup(I) ≥ minsup => said to be frequent itemsets
7.Support Monotonicity Property
the support of a subset J is always greater than or equal to the support of the yi: real value, f(xi): predict value, << is better
itemset I
KNN-Regression: Instead of vote w class, each neighbour votes w value Model types: (detection metrics: binary or score)
pros: simple, small dataset, flexible, can handle non-linear data 1. Extreme values: Outliers are improbable data points / Extreme values are
cons: slower dataset grows, sensitive noise/outlier – scale, need normal,stand improbable data points at the tail of a distribution (improbable: xuất hiện 1 lần)
Regression Tree: 2. Probabilistic models: Outliers are points with low probability of being
generated by the model M
3. Distance based method: The distance-based outlier score of an object O is its
pros: easy to interpret, handle categorical & continuous variable, robust outlier distance to its kth nearest neighbor / Histogram/Grid: Outlier score for X is the
cons: tend to overfit, reduce accuracy, high variance, biased results w feat n` lv number of other points in the same bin /
4. Kernel Density: Outlier score for X is the sum of all other density functions
Issues: at location of X.
a) Sampling density: 5.Recommender:
sparse areas of phase space => less model constraint application: rcm similar items (also read, bought, together) -> increase sales for
low density sampling => opportunity for under-fitting Amazon, potentially find better items for customers / rcm shows that
b) Over fitting: model optimized only on the training set and not on the unseen might like based on (popular, trending, newest ..)-> retain customers for Netflix ->
samples: lack generalization ability => poor prediction Less browsing time for customers
c) Underfitting: model too simple to describe data statistics or model not user based: find similar users -> identify unseen items that similar users liked->
suitable to the data: low prediction accuracy predict user rating for unseen items -> rcm best liked items
Solution: item based: find similar items -> identify unseen items that are similar to items the
- Obtain more data samples: collect or inject samples (resampling, introducing user likes -> predict user rating for unseen items -> rcm highest rating items
small noise, data augmentation) Formula: user1: X = (x1,..xn) / user2: Y = (y1…yn) -> correlation coefficient:
- Select the right models, need to understand data well
- Regularization: more penalty for more complex models
- Use validation/cross-validation to select robust training models
- Select/combine suitable regression methods x^ is avg of X
3. Clustering Theory:
Diffent with Association Pattern: 1. DB: Flat File Sys (Data redundancy, Data inconsistency, Data isolation, Concurrent
- Find attributes/features that co-occur / Focus is on the items edits) / DBMS (Store data once then link to data from multiple place, fewer formats,
Clustering: Group transactions that are similar / Focus on transactions atomic transactions) = db (table, index, schema) + management sys (DDL,DML,SQL)
K-mean clustering: 2. Queries: (CREATE: create new database objects, tt defines the structure of the data
pros: simple, intuitive, good result when well separated, relative efficient row, column / INSERT: insert data (row/column) into an existing table / DROP:
cons: local solution only, need to identify number of cluster, no robust agains completely remove an entire table,db,index object all its data permanently deleted /
highly overlapping, doesn’t work categorical data may not be optimal REMOVE remove specific rows from a table on a condition / ALTER: modify the
structure/schema of a table, changes the table definition / UPDATE: modify the data
within a table, changes the actual values in rows)
3. Data pre (NaN handle, Impute (KNN or Mean), Normalization [0,1],
Standardization (mean:0, std:1) -> each feature same weight)
4. If (more data) -> Curse of Dimensionality -> overfitting due to noisy & irrelevant
feature, data become sparse, distance point inc [Using statistical tests] # if highest
corr -> risk of overfitting, Ignoring Nonlinear Relationships [apply PCA, Decision
Tree to cap non]
5. Brute Force (Checks all possible itemsets to determine their support) 1scan /
Apriori (downward closure property, generates candidates in multiple passes,
scanning db each time) 4scan / FP-Growth (Builds a prefix tree (FP-tree) in 1scan)
6. Issues w 10 fold: each fold may not retain the original class ratio. in some folds, the
K-Mode clustering: minority class might not be represented at all -> to biased performance measures. /
- definition: simple adaptation of k-Means for categorical data The model may learn to prioritize due to their higher representation, potentially
- mode: dominant category in a cluster ignoring the minority class / Imbalanced classes can result in misleading metrics
- idea: instead of taking average to find the centroid, use mode to find the / solution: Stratified Cross-Validation (ensures that each fold maintains the original
dominant category in each cluster class distribution) or SMOTE (Generate synthetic samples for the minority class
- Multi-dimensional categorical data: mode finding for each attribute separately within each fold to balance class distribution)
Agglomerative clustering: (O(n^2log(n))) 7. Underfit: occurs when a model is too simplistic to capture the underlying patterns
in the data, has high bias and lacks the capacity to differentiate between classes. (eg:
in a facial recognition system, an underfit model might misclassify faces as it cannot
learn the nuances of facial features due to its overly simplistic structure. ) / Overfit:
The model may learn to favor the majority class, may simply predict the majority
class for all instances-> biased performance evaluation)
8. DM app: E-commerce Recommendations (using Association Rule Mining) /
Personalized Email Marketing (Classification and Regression): Classification is used
to group customers based on purchasing behavior and preferences, while regression
pros: not assume a number of clusters, may generate meaningful taxonomies
predicts the likelihood of purchasing certain items based on historical behavior
cons: slow for large data sets, O(n^2) for computing distance matrix
9. revisit (enhance model perform, address Overfit/Underfit, new insights)
- Best (single) linkage: smallest distance # Worst (complete) linkage: largest
10. 3 blocks: data process, model building (classification, clustering, or regression),
- Group-average linkage: Average distance
evaluation (Measuring the performance using various metrics.)
- Closest Centroid: Distance between centroids of clusters