Data Mining Summary (Final)
Data Mining Summary (Final)
LC1
Data Mining (DM) /AKA: Knowledge Discovery (KDD): Extrac�on of useful paterns from huge data
sources and the paterns must be VALID, USEFUL, UNDERSTANDABLE.
Data objects: represents an en�ty and is the main component of a data set.
Types of Atributes:
- Nominal.
- Binary.
- Numeric: quan�ta�ve (Interval-Scaled, Ra�o-Scaled).
Median: Middle value if odd number of values, or average of the middle two values otherwise, or by
𝑛𝑛 𝑛𝑛
𝑛𝑛+1 𝑥𝑥� �+𝑥𝑥( +1)
interpola�on for grouped data. 𝑥𝑥̅ = 𝑥𝑥 � � if n is odd, 𝑥𝑥̅ = 2 2
if n is even. X = is the
2 2
ordered list of values in a data set. N = number of values in a data set.
Mode: value that occurs most frequently in the data. 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 3 ∗ (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚).
𝟏𝟏 𝟑𝟑
Quar�les: 𝑸𝑸𝟏𝟏 = (𝒏𝒏 + 𝟏𝟏) , 𝑸𝑸𝟑𝟑 = (𝒏𝒏 + 𝟏𝟏), 𝑸𝑸𝟐𝟐 = 𝑸𝑸𝟑𝟑 − 𝑸𝑸𝟏𝟏
𝟒𝟒 𝟒𝟒
- Gain insight into an informa�on space by mapping data onto graphical primi�ves.
- Provide qualita�ve overview of large data sets.
- Search for paterns, trends, structure, irregulari�es, rela�onships among data.
- Help find interes�ng regions and suitable parameters for further quan�ta�ve analysis.
- Provide a visual proof of computer representa�ons derived.
Categoriza�on of visualiza�on methods
-
For a data set of m dimensions, create m windows on the screen, one for each
Pixel dimension.
oriented - The m dimension values of a record are mapped to m pixels at the
corresponding posi�ons in the windows.
- The colors of the pixels reflect the corresponding values.
- Direct visualiza�on.
- Scaterplot and scaterplot matrices.
Geometric - Landscapes.
projec�on - Prosec�on views.
- Hyper slice.
- Parallel coordinates.
Visualiza�on of the data values as features of icons (Chernoff faces – S�ck Figures).
General techniques:
Icon based Shape coding: use shape to represent certain informa�on encoding.
Color icons: use color icons to encode more informa�on.
Tile bars: use small icons to represent the relevant feature vectors in document
retrieval.
Visualiza�on of the data using a hierarchical par��oning into subspaces.
Hierarchical Methods:
(Dimensional stacking, Worlds within worlds, Tree-map, Cone trees, Info cube).
Visualizing Visualizing non numerical data: text and social networks.
complex
data and Tag cloud: visualizing user generated tags.
rela�ons.
Similarity: numerical measure of how alike two data objects are, value is higher when objects are
more alike, o�en falls in the range [0,1].
Dissimilarity: numerical measure of how different two data objects are, value is lower when objects
are more alike, Minimum dissimilarity is o�en 0, Upper limit varies.
Ordinal Variables: a categorical variable for which the possible values are ordered.
Cosine Similarity: measures the similarity between two vectors of an inner product space by the
cosine of the angle between two vectors o�en used to measure document similarity in text analysis.
𝑑𝑑1 . 𝑑𝑑2
𝑪𝑪𝑪𝑪𝑪𝑪(𝒅𝒅𝟏𝟏 , 𝒅𝒅𝟐𝟐 ) =
‖𝑑𝑑1 ‖‖𝑑𝑑2 ‖
LC2
- Data cleaning: filling Fill in missing values, smooth noisy data, iden�fy or remove outliers, and
resolve inconsistencies.
- Data integra�on: Integra�on of mul�ple databases, data cubes, or files.
- Data reduc�on (Dimensionality reduc�on, Numerosity reduc�on, Data compression).
- Data Transforma�on (Normaliza�on, Concept hierarchy genera�on).
Incomplete data: lacking atribute values, lacking certain atributes of interest, or containing only
aggregate data.
- Binning.
- Regression.
- Clustering.
- Combined computer and human inspec�on.
Data integra�on: Combines data from mul�ple sources into a coherent store.
En�ty iden�fica�on problem: Iden�fy real world en��es from mul�ple data sources.
(𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐−𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆)𝟐𝟐
Correla�on analysis chi-square test: 𝒙𝒙𝟐𝟐 = ∑ The larger Χ2 value, the more likely
𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆
the variables are related.
Data reduc�on: obtain a reduced representa�on of the data set that is much smaller in volume but
yet produces the same (or almost the same) analy�cal results.
1. Dimensionality reduc�on
• Wavelet transforms.
• Principal Components Analysis (PCA): Find a projec�on that captures the largest
amount of varia�on in data.
• Feature selec�on.
• Feature extrac�on.
2. Numerosity reduc�on: Reduce data volume by choosing alterna�ve smaller forms of data
representa�on.
• Parametric methods: Regression and Log-Linear Models.
Linear regression: Data modeled to fit a straight line.
Mul�ple regression: Allows a response variable Y to be modeled as a linear
func�on of mul�dimensional feature vector.
Log-linear model: approximates discrete mul�dimensional probability
distribu�ons.
• Non-Parametric methods: Histograms, clustering, sampling.
Clustering: par��on data set into clusters based on similarity, and store
cluster representa�on.
Sampling: obtaining a small sample s to represent the whole data set N.
• Simple random sampling: There is an equal probability of selec�ng
any par�cular item.
• Sampling without replacement
• Sampling with replacement
• Stra�fied sampling: Par��on the data set, and draw samples from
each par��on.
• Data cube aggrega�on (the lowest level of data cube).
3. Data compression.
a. String compression. c. Dimensionality and numerosity reduc�on.
b. Audio/Video compression.
Data Transforma�on: A func�on that maps the en�re set of values of a given atribute to a new set
of replacement values so that each old value can be iden�fied with one of the new values.
Concept hierarchy: organizes concepts (i.e., atribute values) hierarchically and is usually associated
with each dimension in a data warehouse and Concept hierarchies can be explicitly specified by
domain experts and/or data warehouse designers and can be automa�cally formed for both numeric
and nominal data.
2. Feature extrac�on: Feature extrac�on is a technique used to reduce a large input data set
into relevant features.
Data Compression: is the process of reducing the amount of data required to represent a given
quan�ty of informa�on. Advantages: (reduce storage space – reduce �me for transmission).
Image Compression: is a type of data compression to digital image to reduce their cost for storage or
transmission.
Data: are the means for conveying informa�on. Not the same thing as informa�on.
1. Coding redundancy: associated with the representa�on of informa�on in the form of codes.
• If the gray levels of an image are coded in a way that uses more code symbols than
absolutely necessary to represent each gray level then the resul�ng image is said to
contain coding redundancy.
• Coding redundancy is caused due to poor selec�on of coding technique.
• Coding techniques assigns a unique code for a symbol of a message.
• Wrong choice of coding technique creates unnecessary addi�onal bits These extra bits
are called redundancy.
• Basic concept of coding redundancy:
Variable length coding: U�lize the probability of occurrence of each gray
level (histogram) determine length of code represen�ng that par�cular gray
level.
The shortest code words are assigned to the most frequent high probability
gray levels.
The longest code words are assigned to the least frequent low probability
gray levels.
2. Interpixel Redundancy (Spa�al Temporal redundancy): is due to the correla�on between
the neighboring pixels in an image.
• Spa�al redundancy: In spa�al redundancy there is a correla�on between the
neighboring pixel values.
• Temporal redundancy: In temporal redundancy there is a correla�on between adjacent
frames in the sequence of images.
3. Psychovisual Redundancy: exist because human percep�on does not involve quan�ta�ve
analysis of every pixel or luminance value in the image. Ex: Quan�za�on.
Lossless compression algorithms:
3. Patern Subs�tu�on
4. Shannon Fano Algorithm
5. Huffman Coding
• A variable length coding technique.
• The simplest approach to error free compression is to reduce coding redundancy only.
• The most popular method to yield the smallest possible number of code symbols per
source symbol.
6. Truncated Huffman
7. Arithme�c Coding
8. Lempel-Ziv-Welch (LZA) Algorithms.
- Mean Square Error (MSE): is the cumula�ve square error between the original and
reconstructed pixel values in the 𝑖𝑖th row and 𝑗𝑗th column.
- Peak signal to noise ra�o (PSNR): is used to evaluate the image quality, where 𝐵𝐵 is the dynamic
range (in bits) of the original image.