0% found this document useful (0 votes)
13 views10 pages

Data Mining Summary (Final)

Data Mining Summary

Uploaded by

ma0035859
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Data Mining Summary (Final)

Data Mining Summary

Uploaded by

ma0035859
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Mining Summary (Final)

LC1

Data Mining (DM) /AKA: Knowledge Discovery (KDD): Extrac�on of useful paterns from huge data
sources and the paterns must be VALID, USEFUL, UNDERSTANDABLE.

Data Mining Methods


Descrip�ve Mining Predic�ve Mining
- Tasks characterize the general proper�es - Tasks perform inferences on the current
of the data. data in order to make predic�ons.
Methods: (Associa�on, Clustering, Methods: (Classifica�on, Predic�on, Time-
Summariza�on). series analysis).

Importance of Data Mining:

 Rapid computeriza�on of businesses produces huge amount of data.


 Makes the best use of data.
 Used for compe��ve advantage.

Data Mining (KDD) process (Important):

1. Understand the applica�on domain.


2. Iden�fy data sources and select target data.
3. Pre-process: cleaning, atribute selec�on.
4. Data mining to extract paterns or models.
5. Post-process: iden�fying interes�ng or useful paterns.
6. Incorporate paterns in real world tasks.

Data Mining Big Data


- Iden�fies and extracts relevant - Refers to the collec�on and storage or
informa�on from large sets of data. large amounts of data.
- Uses different techniques based on - Due to the volume, it is impossible to
sta�s�cs and Ar�ficial Intelligence. process it with conven�onal so�ware.
- Delivers specific and concrete results. - Special tools are needed to capture,
- Creates predic�ve, classifica�on or manage and process the informa�on.
segmenta�on models. - These data groups have a reduced volume
- Transforms informa�on into knowledge. of informa�on to make predic�ons.
- The quality of the informa�on can vary
considerably and affect the result of the
analysis.

Challenges in data mining (Important):

- Noisy and incomplete data. - Performance.

- Distributed data. - Data visualiza�on.

- Complex Data. - Data privacy and security.


Types of Data Sets
Record Graph and Network Ordered Spa�al, Image,
Mul�media
• Rela�onal records. • World wide web. • Video data. • Spa�al data:
• Data matrix. • Molecular • Temporal data: maps.
• Document data. structures. �me series.
• Transac�on data. • Gene�c sequence
data.

Data objects: represents an en�ty and is the main component of a data set.

Atribute /AKA:(Dimensions, Features, Variables): a data field, represen�ng a characteris�c or


feature of a data object.

Types of Atributes:

- Nominal.
- Binary.
- Numeric: quan�ta�ve (Interval-Scaled, Ra�o-Scaled).

Discrete Atribute: has only a finite or countably infinite set of values.

Binary Atribute: a special case of a discrete atribute.

Con�nuous Atribute: has real numbers as atribute values.


∑ 𝑥𝑥
Mean: the average value of the data. 𝜇𝜇 = where x is the values of the data objects and N is the
𝑁𝑁
number of objects.

Median: Middle value if odd number of values, or average of the middle two values otherwise, or by
𝑛𝑛 𝑛𝑛
𝑛𝑛+1 𝑥𝑥� �+𝑥𝑥( +1)
interpola�on for grouped data. 𝑥𝑥̅ = 𝑥𝑥 � � if n is odd, 𝑥𝑥̅ = 2 2
if n is even. X = is the
2 2
ordered list of values in a data set. N = number of values in a data set.

Mode: value that occurs most frequently in the data. 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 3 ∗ (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚).

𝟏𝟏 𝟑𝟑
Quar�les: 𝑸𝑸𝟏𝟏 = (𝒏𝒏 + 𝟏𝟏) , 𝑸𝑸𝟑𝟑 = (𝒏𝒏 + 𝟏𝟏), 𝑸𝑸𝟐𝟐 = 𝑸𝑸𝟑𝟑 − 𝑸𝑸𝟏𝟏
𝟒𝟒 𝟒𝟒

Inter-Quar�le range (IQR): IQR = Q3 – Q1

Outlier: usually, a value higher/lower than [Q1-(1.5*IQR), Q3+(1.5*IQR)].


2
∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)
Variance: 𝑆𝑆 2 = , Xi = the value of the one object. n = the number of objects.
𝑛𝑛−1

Standard devia�on(s/σ): is the square root of variance.


Graphic Displays of Basic Sta�s�cal Descrip�ons
Boxplot: graphic display of five number
summary.

Histogram: x axis are values; y axis represents


frequencies.

Quan�le plot: each value xi is paired with fi


indica�ng that approximately 100 fi % of data
are ≤ xi.

Quan�le-Quan�le (q-q) plot: graphs the


quan�les of one univariant distribu�on against
the corresponding quan�les of another.

Scater plot: each pair of values is a pair of


coordinates and ploted as points in the plane.

Why do we visualize data:

- Gain insight into an informa�on space by mapping data onto graphical primi�ves.
- Provide qualita�ve overview of large data sets.
- Search for paterns, trends, structure, irregulari�es, rela�onships among data.
- Help find interes�ng regions and suitable parameters for further quan�ta�ve analysis.
- Provide a visual proof of computer representa�ons derived.
Categoriza�on of visualiza�on methods
-
For a data set of m dimensions, create m windows on the screen, one for each
Pixel dimension.
oriented - The m dimension values of a record are mapped to m pixels at the
corresponding posi�ons in the windows.
- The colors of the pixels reflect the corresponding values.
- Direct visualiza�on.
- Scaterplot and scaterplot matrices.
Geometric - Landscapes.
projec�on - Prosec�on views.
- Hyper slice.
- Parallel coordinates.
Visualiza�on of the data values as features of icons (Chernoff faces – S�ck Figures).
General techniques:
Icon based Shape coding: use shape to represent certain informa�on encoding.
Color icons: use color icons to encode more informa�on.
Tile bars: use small icons to represent the relevant feature vectors in document
retrieval.
Visualiza�on of the data using a hierarchical par��oning into subspaces.
Hierarchical Methods:
(Dimensional stacking, Worlds within worlds, Tree-map, Cone trees, Info cube).
Visualizing Visualizing non numerical data: text and social networks.
complex
data and Tag cloud: visualizing user generated tags.
rela�ons.

Similarity: numerical measure of how alike two data objects are, value is higher when objects are
more alike, o�en falls in the range [0,1].

Dissimilarity: numerical measure of how different two data objects are, value is lower when objects
are more alike, Minimum dissimilarity is o�en 0, Upper limit varies.

Proximity: refers to similarity or dissimilarity.


𝒙𝒙−𝝁𝝁
Z-Score:[ 𝒛𝒛 = ] , x: raw score to be standardized, μ: mean of the popula�on, σ: standard
𝝈𝝈
devia�on.

Ordinal Variables: a categorical variable for which the possible values are ordered.

Document: can be represented by thousands of atributes, each recording the frequency of a


par�cular word (such as keywords).

Cosine Similarity: measures the similarity between two vectors of an inner product space by the
cosine of the angle between two vectors o�en used to measure document similarity in text analysis.

𝑑𝑑1 . 𝑑𝑑2
𝑪𝑪𝑪𝑪𝑪𝑪(𝒅𝒅𝟏𝟏 , 𝒅𝒅𝟐𝟐 ) =
‖𝑑𝑑1 ‖‖𝑑𝑑2 ‖
LC2

Measures for data quality:

1. Accuracy: correct or wrong, accurate or not.


2. Completeness.
3. Consistency.
4. Timeliness.
5. Integrity: how trustable the data are correct.
6. Conformity: the data values of the same atributes must be represented in a uniform format
and data types.
7. Uniqueness.
8. Validity.
9. Currency.
10. Precision.

Major Tasks in Data Preprocessing:

- Data cleaning: filling Fill in missing values, smooth noisy data, iden�fy or remove outliers, and
resolve inconsistencies.
- Data integra�on: Integra�on of mul�ple databases, data cubes, or files.
- Data reduc�on (Dimensionality reduc�on, Numerosity reduc�on, Data compression).
- Data Transforma�on (Normaliza�on, Concept hierarchy genera�on).

Incomplete data: lacking atribute values, lacking certain atributes of interest, or containing only
aggregate data.

Noisy data: meaningless data/containing noise, errors, or outliers.

Inconsistent data: containing discrepancies in codes or names.

How to handle missing data:

1. Ignore the tuple.


2. Fill in the missing values manually.
3. Fill in the missing values automa�cally with:
a. a global constant
b. the atribute mean.
c. the atribute mean for all samples belonging to the same class.
d. the most probable value.

Incorrect atribute values may be due to:

- faulty data collec�on instruments.


- data entry problems.
- data transmission problems.
- technology limita�on.
- inconsistency in naming conven�on.
How to handle noisy data:

- Binning.
- Regression.
- Clustering.
- Combined computer and human inspec�on.

Data integra�on: Combines data from mul�ple sources into a coherent store.

Schema integra�on: Integrate metadata from different sources.

En�ty iden�fica�on problem: Iden�fy real world en��es from mul�ple data sources.
(𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐−𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆)𝟐𝟐
Correla�on analysis chi-square test: 𝒙𝒙𝟐𝟐 = ∑ The larger Χ2 value, the more likely
𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆
the variables are related.

Data reduc�on: obtain a reduced representa�on of the data set that is much smaller in volume but
yet produces the same (or almost the same) analy�cal results.

Data reduc�on strategies:

1. Dimensionality reduc�on
• Wavelet transforms.
• Principal Components Analysis (PCA): Find a projec�on that captures the largest
amount of varia�on in data.
• Feature selec�on.
• Feature extrac�on.
2. Numerosity reduc�on: Reduce data volume by choosing alterna�ve smaller forms of data
representa�on.
• Parametric methods: Regression and Log-Linear Models.
 Linear regression: Data modeled to fit a straight line.
 Mul�ple regression: Allows a response variable Y to be modeled as a linear
func�on of mul�dimensional feature vector.
 Log-linear model: approximates discrete mul�dimensional probability
distribu�ons.
• Non-Parametric methods: Histograms, clustering, sampling.
 Clustering: par��on data set into clusters based on similarity, and store
cluster representa�on.
 Sampling: obtaining a small sample s to represent the whole data set N.
• Simple random sampling: There is an equal probability of selec�ng
any par�cular item.
• Sampling without replacement
• Sampling with replacement
• Stra�fied sampling: Par��on the data set, and draw samples from
each par��on.
• Data cube aggrega�on (the lowest level of data cube).
3. Data compression.
a. String compression. c. Dimensionality and numerosity reduc�on.
b. Audio/Video compression.
Data Transforma�on: A func�on that maps the en�re set of values of a given atribute to a new set
of replacement values so that each old value can be iden�fied with one of the new values.

Concept hierarchy: organizes concepts (i.e., atribute values) hierarchically and is usually associated
with each dimension in a data warehouse and Concept hierarchies can be explicitly specified by
domain experts and/or data warehouse designers and can be automa�cally formed for both numeric
and nominal data.

Methods of data transforma�on:

1. Smoothing: remove noise from data.


2. Atribute/Feature construc�on: new atributes constructed from the given ones.
3. Aggrega�on: summariza�on, data cube construc�on.
4. Normaliza�on: scaled to fall within a smaller, specified range
• normaliza�on by:
o min-max
o z-score
o decimal scaling
5. Discre�za�on: divide the range of a con�nuous atribute into intervals.
• Types of atributes:

o Nominal: values from an unordered set, e.g., color, profession


o Ordinal: values from an ordered set, e.g., military or academic rank.
o Numeric: real numbers, e.g., integer or real numbers.

• Methods of data discre�za�on:

o Binning (Top-down split, unsupervised) two types:


 Equal width (distance) par��oning: Divides the range into N intervals of
equal size.
 Equal depth (frequency) par��oning: Divides the range into N intervals,
each containing approximately same number of samples.
o Histogram analysis (Top-down split, unsupervised).
o Clustering analysis (Top-down split, unsupervised, botom-up merge).
o Decision tree analysis (Top-down split, supervised).
o Correla�on analysis (Unsupervised, botom-up merge).
LC3

Data reduc�on: two types: (dimensionality reduc�on – data compression).

Major techniques of dimensionality reduc�on:

1. Feature selec�on: is to select a subset of size (M<N) that leads to the


smallest classifica�on/clustering error which results in:
a. Training a machine learning algorithm faster.
b. Reducing the complexity of a model and making it easier to
interpret.
c. Building a sensible model with beter predic�on power.
d. Reducing over fi�ng by selec�ng the right set of features.

Feature selec�on methods


Filter based Wrapper method
Evalua�on is independent of the classifica�on Evalua�on is independent and uses criteria
algorithm. related to the classifica�on algorithm.
- Computa�onally cheaper - Interacts with the classifier for
compared with wrapper. feature selec�on.
- Fastest running �me. - More comprehensive search of
Pros - Lower risk of over fi�ng. Pros feature set space.
- Easily scale to a dimensional - Consider feature dependencies.
Dataset. - Beter generaliza�on than filter
approach.
- No interac�on with classifica�on - High computa�onal cost.
model for feature selec�on. - No guarantee of op�mality of the
- Mostly ignores feature solu�on if predicted with other
Cons dependencies and considers each Cons classifier.
feature separately. - More computa�onally unfeasible
- It may lead to low computa�onal with increased number of features.
performance compared with - Higher risk of overfi�ng compared
wrapper-based methods. to filter-based methods.
Ex: PCA, ANOVA, etc. Ex: Forward Selec�on, Backward elimina�on, etc.

2. Feature extrac�on: Feature extrac�on is a technique used to reduce a large input data set
into relevant features.

Data Compression: is the process of reducing the amount of data required to represent a given
quan�ty of informa�on. Advantages: (reduce storage space – reduce �me for transmission).

Image Compression: is a type of data compression to digital image to reduce their cost for storage or
transmission.

Data: are the means for conveying informa�on. Not the same thing as informa�on.

Image: informa�on + redundant data.


Lossless compression Lossy compression
- In this type, there is no data lost (with no - Lossy compression is the class of data
loss in image quality). encoding methods that allows a loss in
- The original image can be recreated some of data images.
exactly from compressed data. - The uncompressed image cannot be same
- The graphics interchange file (GIF) is an the original image file.
image format used lossless compression. - Lossy methods can provide high degrees
of compression and result in smaller
compressed files, but some number of the
original pixels, sound waves or video
frames are removed forever.
- JPEG is an example of lossy compression
Applica�ons
This type is generally used for text or - broadcast television
spreadsheet files, where losing words or - videoconferencing
financial data could pose of a problem.
Redundancy: means repe��ve data.

Redundant data: data that provide no relevant informa�on.


𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
Compression ra�o: 𝐶𝐶𝑅𝑅 =
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐


Rela�ve data redundancy: 𝑅𝑅𝐷𝐷 = 1 − =1−
𝐶𝐶𝑅𝑅 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

Types of data redundancies:

1. Coding redundancy: associated with the representa�on of informa�on in the form of codes.
• If the gray levels of an image are coded in a way that uses more code symbols than
absolutely necessary to represent each gray level then the resul�ng image is said to
contain coding redundancy.
• Coding redundancy is caused due to poor selec�on of coding technique.
• Coding techniques assigns a unique code for a symbol of a message.
• Wrong choice of coding technique creates unnecessary addi�onal bits These extra bits
are called redundancy.
• Basic concept of coding redundancy:
 Variable length coding: U�lize the probability of occurrence of each gray
level (histogram) determine length of code represen�ng that par�cular gray
level.
 The shortest code words are assigned to the most frequent high probability
gray levels.
 The longest code words are assigned to the least frequent low probability
gray levels.
2. Interpixel Redundancy (Spa�al Temporal redundancy): is due to the correla�on between
the neighboring pixels in an image.
• Spa�al redundancy: In spa�al redundancy there is a correla�on between the
neighboring pixel values.
• Temporal redundancy: In temporal redundancy there is a correla�on between adjacent
frames in the sequence of images.
3. Psychovisual Redundancy: exist because human percep�on does not involve quan�ta�ve
analysis of every pixel or luminance value in the image. Ex: Quan�za�on.
Lossless compression algorithms:

1. Repe��ve Sequence Suppression


• If a sequence of n successive tokens appears, replace a series of n successive tokens with
a flag, add the count of occurrences a�er the token. Ex: (8940000000 -> 984f7).
• Applica�ons: (silence in audio, bitmaps, blanks in text, sparse matrices).

2. Run Length Encoding


• Given sequence of image elements [x1,x2,…,xn] (row by row) the output is pairs (c1 ,l1)
(c2 ,l2),…,(cn,ln) where c represent image intensity or color and l is the nth run length of
pixels. Ex: (1111222333333 -> (1,4),(2,3),(3,6)).

3. Patern Subs�tu�on
4. Shannon Fano Algorithm
5. Huffman Coding
• A variable length coding technique.
• The simplest approach to error free compression is to reduce coding redundancy only.
• The most popular method to yield the smallest possible number of code symbols per
source symbol.

6. Truncated Huffman
7. Arithme�c Coding
8. Lempel-Ziv-Welch (LZA) Algorithms.

Compression algorithms evalua�on metrics:


𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
- Compression ra�o: 𝐶𝐶𝑅𝑅 =
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

- Mean Square Error (MSE): is the cumula�ve square error between the original and
reconstructed pixel values in the 𝑖𝑖th row and 𝑗𝑗th column.

- Peak signal to noise ra�o (PSNR): is used to evaluate the image quality, where 𝐵𝐵 is the dynamic
range (in bits) of the original image.

- Structural similarity index measure (SSIM):

You might also like