Course - Data Science Foundations - Data Mining
Course - Data Science Foundations - Data Mining
Simplify the dataset to focus on variables or constructs that carry more meaning and separate it from
noise. Here we generally are talking about reduction of variables or fields (as opposed to observations).
Possible reasons:
Analogy is of projecting a shadow, taking data from a high dimensional space (each variable from the
dataset is a dimension) and projecting a shadow to a lower dimensional space. Think of taking a three
dimensional object and projecting a shadow on a two dimensional space and still be able to tell what it
is. One of the ways of doing it is PCA (Principal Component Analysis). Tools used may be:
- R
- Python
- Orange
- Rapid Miner
Clustering:
Idea is to put the entire set of observations or cases so that “like goes with like”. This is a grouping of
convenience rather than some sort of natural/universal grouping. We group the cases so that it
accomplishes a specific purpose. For example, in marketing similar customers are grouped together for
offers. Clusters are pragmatic groupings to serve a particular purpose.
Classification:
Classification complements clustering. Clustering creates buckets and classification puts new cases into
them. Algorithms used for classification:
Anomaly Detection:
Anomalies distort the statistics, correlations, etc. We have a few ways around it:
- Deleting them, but making sure this does not nullify analysis
- Transform (log, squares, etc., to make distribution symmetrical)
- Robust (use methods that are not strongly influenced by anomalies like median over mean, etc.)
Association Analysis:
This may be used on a purchasing website where associated items may be shown to customers.
Packages in R: arules, arulesViz.
Regression Analysis:
Use many variables to predict one. Example is of least squares regression (the assumption here is that
the data is following normal distribution).
Correlated predictors: Multicollinearity when the predicted variables are associated with each other:
Sequence Mining:
Sequence mining is like association analysis but the sequence/order of events matters here. Examples
are recommendation engines (if a person does a and b, then he is likely to do c…)
Text Mining:
Unlike other types this is unstructured data (instead of rows and columns of numeric data); here we
have a blob of text. E.g.,