0% found this document useful (0 votes)

13 views10 pages

Data Mining Summary (Final)

Data Mining Summary

Uploaded by

ma0035859

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views10 pages

Data Mining Summary (Final)

Data Mining Summary

Uploaded by

ma0035859

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Data Mining Summary (Final)

LC1

Data Mining (DM) /AKA: Knowledge Discovery (KDD): Extrac�on of useful paterns from huge data
sources and the paterns must be VALID, USEFUL, UNDERSTANDABLE.

Data Mining Methods

Descrip�ve Mining Predic�ve Mining
- Tasks characterize the general proper�es - Tasks perform inferences on the current
of the data. data in order to make predic�ons.
Methods: (Associa�on, Clustering, Methods: (Classiﬁca�on, Predic�on, Time-
Summariza�on). series analysis).

Importance of Data Mining:

 Rapid computeriza�on of businesses produces huge amount of data.

 Makes the best use of data.
 Used for compe��ve advantage.

Data Mining (KDD) process (Important):

1. Understand the applica�on domain.

2. Iden�fy data sources and select target data.
3. Pre-process: cleaning, atribute selec�on.
4. Data mining to extract paterns or models.
5. Post-process: iden�fying interes�ng or useful paterns.
6. Incorporate paterns in real world tasks.

Data Mining Big Data

- Iden�fies and extracts relevant - Refers to the collec�on and storage or
informa�on from large sets of data. large amounts of data.
- Uses different techniques based on - Due to the volume, it is impossible to
sta�s�cs and Ar�ficial Intelligence. process it with conven�onal so�ware.
- Delivers specific and concrete results. - Special tools are needed to capture,
- Creates predic�ve, classifica�on or manage and process the informa�on.
segmenta�on models. - These data groups have a reduced volume
- Transforms informa�on into knowledge. of informa�on to make predic�ons.
- The quality of the informa�on can vary
considerably and affect the result of the
analysis.

Challenges in data mining (Important):

- Noisy and incomplete data. - Performance.

- Distributed data. - Data visualiza�on.

- Complex Data. - Data privacy and security.

Types of Data Sets
Record Graph and Network Ordered Spa�al, Image,
Mul�media
• Rela�onal records. • World wide web. • Video data. • Spa�al data:
• Data matrix. • Molecular • Temporal data: maps.
• Document data. structures. �me series.
• Transac�on data. • Gene�c sequence
data.

Data objects: represents an en�ty and is the main component of a data set.

Atribute /AKA:(Dimensions, Features, Variables): a data ﬁeld, represen�ng a characteris�c or

feature of a data object.

Types of Atributes:

- Nominal.
- Binary.
- Numeric: quan�ta�ve (Interval-Scaled, Ra�o-Scaled).

Discrete Atribute: has only a ﬁnite or countably inﬁnite set of values.

Binary Atribute: a special case of a discrete atribute.

Con�nuous Atribute: has real numbers as atribute values.

∑ 𝑥𝑥
Mean: the average value of the data. 𝜇𝜇 = where x is the values of the data objects and N is the
𝑁𝑁
number of objects.

Median: Middle value if odd number of values, or average of the middle two values otherwise, or by
𝑛𝑛 𝑛𝑛
𝑛𝑛+1 𝑥𝑥� �+𝑥𝑥( +1)
interpola�on for grouped data. 𝑥𝑥̅ = 𝑥𝑥 � � if n is odd, 𝑥𝑥̅ = 2 2
if n is even. X = is the
2 2
ordered list of values in a data set. N = number of values in a data set.

Mode: value that occurs most frequently in the data. 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 3 ∗ (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚).

𝟏𝟏 𝟑𝟑
Quar�les: 𝑸𝑸𝟏𝟏 = (𝒏𝒏 + 𝟏𝟏) , 𝑸𝑸𝟑𝟑 = (𝒏𝒏 + 𝟏𝟏), 𝑸𝑸𝟐𝟐 = 𝑸𝑸𝟑𝟑 − 𝑸𝑸𝟏𝟏
𝟒𝟒 𝟒𝟒

Inter-Quar�le range (IQR): IQR = Q3 – Q1

Outlier: usually, a value higher/lower than [Q1-(1.5IQR), Q3+(1.5IQR)].

2
∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)
Variance: 𝑆𝑆 2 = , Xi = the value of the one object. n = the number of objects.
𝑛𝑛−1

Standard devia�on(s/σ): is the square root of variance.

Graphic Displays of Basic Sta�s�cal Descrip�ons
Boxplot: graphic display of ﬁve number
summary.

Histogram: x axis are values; y axis represents

frequencies.

Quan�le plot: each value xi is paired with fi

indica�ng that approximately 100 fi % of data
are ≤ xi.

Quan�le-Quan�le (q-q) plot: graphs the

quan�les of one univariant distribu�on against
the corresponding quan�les of another.

Scater plot: each pair of values is a pair of

coordinates and ploted as points in the plane.

Why do we visualize data:

- Gain insight into an informa�on space by mapping data onto graphical primi�ves.
- Provide qualita�ve overview of large data sets.
- Search for paterns, trends, structure, irregulari�es, rela�onships among data.
- Help find interes�ng regions and suitable parameters for further quan�ta�ve analysis.
- Provide a visual proof of computer representa�ons derived.
Categoriza�on of visualiza�on methods
-
For a data set of m dimensions, create m windows on the screen, one for each
Pixel dimension.
oriented - The m dimension values of a record are mapped to m pixels at the
corresponding posi�ons in the windows.
- The colors of the pixels reflect the corresponding values.
- Direct visualiza�on.
- Scaterplot and scaterplot matrices.
Geometric - Landscapes.
projec�on - Prosec�on views.
- Hyper slice.
- Parallel coordinates.
Visualiza�on of the data values as features of icons (Chernoff faces – S�ck Figures).
General techniques:
Icon based Shape coding: use shape to represent certain informa�on encoding.
Color icons: use color icons to encode more informa�on.
Tile bars: use small icons to represent the relevant feature vectors in document
retrieval.
Visualiza�on of the data using a hierarchical par��oning into subspaces.
Hierarchical Methods:
(Dimensional stacking, Worlds within worlds, Tree-map, Cone trees, Info cube).
Visualizing Visualizing non numerical data: text and social networks.
complex
data and Tag cloud: visualizing user generated tags.
rela�ons.

Similarity: numerical measure of how alike two data objects are, value is higher when objects are
more alike, o�en falls in the range [0,1].

Dissimilarity: numerical measure of how diﬀerent two data objects are, value is lower when objects
are more alike, Minimum dissimilarity is o�en 0, Upper limit varies.

Proximity: refers to similarity or dissimilarity.

𝒙𝒙−𝝁𝝁
Z-Score:[ 𝒛𝒛 = ] , x: raw score to be standardized, μ: mean of the popula�on, σ: standard
𝝈𝝈
devia�on.

Ordinal Variables: a categorical variable for which the possible values are ordered.

Document: can be represented by thousands of atributes, each recording the frequency of a

par�cular word (such as keywords).

Cosine Similarity: measures the similarity between two vectors of an inner product space by the
cosine of the angle between two vectors o�en used to measure document similarity in text analysis.

𝑑𝑑1 . 𝑑𝑑2
𝑪𝑪𝑪𝑪𝑪𝑪(𝒅𝒅𝟏𝟏 , 𝒅𝒅𝟐𝟐 ) =
‖𝑑𝑑1 ‖‖𝑑𝑑2 ‖
LC2

Measures for data quality:

1. Accuracy: correct or wrong, accurate or not.

2. Completeness.
3. Consistency.
4. Timeliness.
5. Integrity: how trustable the data are correct.
6. Conformity: the data values of the same atributes must be represented in a uniform format
and data types.
7. Uniqueness.
8. Validity.
9. Currency.
10. Precision.

Major Tasks in Data Preprocessing:

- Data cleaning: ﬁlling Fill in missing values, smooth noisy data, iden�fy or remove outliers, and
resolve inconsistencies.
- Data integra�on: Integra�on of mul�ple databases, data cubes, or ﬁles.
- Data reduc�on (Dimensionality reduc�on, Numerosity reduc�on, Data compression).
- Data Transforma�on (Normaliza�on, Concept hierarchy genera�on).

Incomplete data: lacking atribute values, lacking certain atributes of interest, or containing only
aggregate data.

Noisy data: meaningless data/containing noise, errors, or outliers.

Inconsistent data: containing discrepancies in codes or names.

How to handle missing data:

1. Ignore the tuple.

2. Fill in the missing values manually.
3. Fill in the missing values automa�cally with:
a. a global constant
b. the atribute mean.
c. the atribute mean for all samples belonging to the same class.
d. the most probable value.

Incorrect atribute values may be due to:

- faulty data collec�on instruments.

- data entry problems.
- data transmission problems.
- technology limita�on.
- inconsistency in naming conven�on.
How to handle noisy data:

- Binning.
- Regression.
- Clustering.
- Combined computer and human inspec�on.

Data integra�on: Combines data from mul�ple sources into a coherent store.

Schema integra�on: Integrate metadata from diﬀerent sources.

En�ty iden�ﬁca�on problem: Iden�fy real world en��es from mul�ple data sources.
(𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐−𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆)𝟐𝟐
Correla�on analysis chi-square test: 𝒙𝒙𝟐𝟐 = ∑ The larger Χ2 value, the more likely
𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆
the variables are related.

Data reduc�on: obtain a reduced representa�on of the data set that is much smaller in volume but
yet produces the same (or almost the same) analy�cal results.

Data reduc�on strategies:

1. Dimensionality reduc�on
• Wavelet transforms.
• Principal Components Analysis (PCA): Find a projec�on that captures the largest
amount of varia�on in data.
• Feature selec�on.
• Feature extrac�on.
2. Numerosity reduc�on: Reduce data volume by choosing alterna�ve smaller forms of data
representa�on.
• Parametric methods: Regression and Log-Linear Models.
 Linear regression: Data modeled to fit a straight line.
 Mul�ple regression: Allows a response variable Y to be modeled as a linear
func�on of mul�dimensional feature vector.
 Log-linear model: approximates discrete mul�dimensional probability
distribu�ons.
• Non-Parametric methods: Histograms, clustering, sampling.
 Clustering: par��on data set into clusters based on similarity, and store
cluster representa�on.
 Sampling: obtaining a small sample s to represent the whole data set N.
• Simple random sampling: There is an equal probability of selec�ng
any par�cular item.
• Sampling without replacement
• Sampling with replacement
• Stra�fied sampling: Par��on the data set, and draw samples from
each par��on.
• Data cube aggrega�on (the lowest level of data cube).
3. Data compression.
a. String compression. c. Dimensionality and numerosity reduc�on.
b. Audio/Video compression.
Data Transforma�on: A func�on that maps the en�re set of values of a given atribute to a new set
of replacement values so that each old value can be iden�fied with one of the new values.

Concept hierarchy: organizes concepts (i.e., atribute values) hierarchically and is usually associated
with each dimension in a data warehouse and Concept hierarchies can be explicitly speciﬁed by
domain experts and/or data warehouse designers and can be automa�cally formed for both numeric
and nominal data.

Methods of data transforma�on:

1. Smoothing: remove noise from data.

2. Atribute/Feature construc�on: new atributes constructed from the given ones.
3. Aggrega�on: summariza�on, data cube construc�on.
4. Normaliza�on: scaled to fall within a smaller, speciﬁed range
• normaliza�on by:
o min-max
o z-score
o decimal scaling
5. Discre�za�on: divide the range of a con�nuous atribute into intervals.
• Types of atributes:

o Nominal: values from an unordered set, e.g., color, profession

o Ordinal: values from an ordered set, e.g., military or academic rank.
o Numeric: real numbers, e.g., integer or real numbers.

• Methods of data discre�za�on:

o Binning (Top-down split, unsupervised) two types:

 Equal width (distance) par��oning: Divides the range into N intervals of
equal size.
 Equal depth (frequency) par��oning: Divides the range into N intervals,
each containing approximately same number of samples.
o Histogram analysis (Top-down split, unsupervised).
o Clustering analysis (Top-down split, unsupervised, botom-up merge).
o Decision tree analysis (Top-down split, supervised).
o Correla�on analysis (Unsupervised, botom-up merge).
LC3

Data reduc�on: two types: (dimensionality reduc�on – data compression).

Major techniques of dimensionality reduc�on:

1. Feature selec�on: is to select a subset of size (M<N) that leads to the

smallest classiﬁca�on/clustering error which results in:
a. Training a machine learning algorithm faster.
b. Reducing the complexity of a model and making it easier to
interpret.
c. Building a sensible model with beter predic�on power.
d. Reducing over ﬁ�ng by selec�ng the right set of features.

Feature selec�on methods

Filter based Wrapper method
Evalua�on is independent of the classifica�on Evalua�on is independent and uses criteria
algorithm. related to the classifica�on algorithm.
- Computa�onally cheaper - Interacts with the classifier for
compared with wrapper. feature selec�on.
- Fastest running �me. - More comprehensive search of
Pros - Lower risk of over fi�ng. Pros feature set space.
- Easily scale to a dimensional - Consider feature dependencies.
Dataset. - Beter generaliza�on than filter
approach.
- No interac�on with classifica�on - High computa�onal cost.
model for feature selec�on. - No guarantee of op�mality of the
- Mostly ignores feature solu�on if predicted with other
Cons dependencies and considers each Cons classifier.
feature separately. - More computa�onally unfeasible
- It may lead to low computa�onal with increased number of features.
performance compared with - Higher risk of overfi�ng compared
wrapper-based methods. to filter-based methods.
Ex: PCA, ANOVA, etc. Ex: Forward Selec�on, Backward elimina�on, etc.

2. Feature extrac�on: Feature extrac�on is a technique used to reduce a large input data set
into relevant features.

Data Compression: is the process of reducing the amount of data required to represent a given
quan�ty of informa�on. Advantages: (reduce storage space – reduce �me for transmission).

Image Compression: is a type of data compression to digital image to reduce their cost for storage or
transmission.

Data: are the means for conveying informa�on. Not the same thing as informa�on.

Image: informa�on + redundant data.

Lossless compression Lossy compression
- In this type, there is no data lost (with no - Lossy compression is the class of data
loss in image quality). encoding methods that allows a loss in
- The original image can be recreated some of data images.
exactly from compressed data. - The uncompressed image cannot be same
- The graphics interchange file (GIF) is an the original image file.
image format used lossless compression. - Lossy methods can provide high degrees
of compression and result in smaller
compressed files, but some number of the
original pixels, sound waves or video
frames are removed forever.
- JPEG is an example of lossy compression
Applica�ons
This type is generally used for text or - broadcast television
spreadsheet files, where losing words or - videoconferencing
financial data could pose of a problem.
Redundancy: means repe��ve data.

Redundant data: data that provide no relevant informa�on.

𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
Compression ra�o: 𝐶𝐶𝑅𝑅 =
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

Rela�ve data redundancy: 𝑅𝑅𝐷𝐷 = 1 − =1−
𝐶𝐶𝑅𝑅 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

Types of data redundancies:

1. Coding redundancy: associated with the representa�on of informa�on in the form of codes.
• If the gray levels of an image are coded in a way that uses more code symbols than
absolutely necessary to represent each gray level then the resul�ng image is said to
contain coding redundancy.
• Coding redundancy is caused due to poor selec�on of coding technique.
• Coding techniques assigns a unique code for a symbol of a message.
• Wrong choice of coding technique creates unnecessary addi�onal bits These extra bits
are called redundancy.
• Basic concept of coding redundancy:
 Variable length coding: U�lize the probability of occurrence of each gray
level (histogram) determine length of code represen�ng that par�cular gray
level.
 The shortest code words are assigned to the most frequent high probability
gray levels.
 The longest code words are assigned to the least frequent low probability
gray levels.
2. Interpixel Redundancy (Spa�al Temporal redundancy): is due to the correla�on between
the neighboring pixels in an image.
• Spa�al redundancy: In spa�al redundancy there is a correla�on between the
neighboring pixel values.
• Temporal redundancy: In temporal redundancy there is a correla�on between adjacent
frames in the sequence of images.
3. Psychovisual Redundancy: exist because human percep�on does not involve quan�ta�ve
analysis of every pixel or luminance value in the image. Ex: Quan�za�on.
Lossless compression algorithms:

1. Repe��ve Sequence Suppression

• If a sequence of n successive tokens appears, replace a series of n successive tokens with
a ﬂag, add the count of occurrences a�er the token. Ex: (8940000000 -> 984f7).
• Applica�ons: (silence in audio, bitmaps, blanks in text, sparse matrices).

2. Run Length Encoding

• Given sequence of image elements [x1,x2,…,xn] (row by row) the output is pairs (c1 ,l1)
(c2 ,l2),…,(cn,ln) where c represent image intensity or color and l is the nth run length of
pixels. Ex: (1111222333333 -> (1,4),(2,3),(3,6)).

3. Patern Subs�tu�on
4. Shannon Fano Algorithm
5. Huﬀman Coding
• A variable length coding technique.
• The simplest approach to error free compression is to reduce coding redundancy only.
• The most popular method to yield the smallest possible number of code symbols per
source symbol.

6. Truncated Huﬀman
7. Arithme�c Coding
8. Lempel-Ziv-Welch (LZA) Algorithms.

Compression algorithms evalua�on metrics:

𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
- Compression ra�o: 𝐶𝐶𝑅𝑅 =
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

- Mean Square Error (MSE): is the cumula�ve square error between the original and
reconstructed pixel values in the 𝑖𝑖th row and 𝑗𝑗th column.

- Peak signal to noise ra�o (PSNR): is used to evaluate the image quality, where 𝐵𝐵 is the dynamic
range (in bits) of the original image.

- Structural similarity index measure (SSIM):

Data Analytics
No ratings yet
Data Analytics
302 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
DMS Chapter 5
No ratings yet
DMS Chapter 5
21 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Unit 1
No ratings yet
Unit 1
28 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Analytics Anil Maheshwari Full Chapter Instant Download
100% (3)
Data Analytics Anil Maheshwari Full Chapter Instant Download
44 pages
Lect 3
No ratings yet
Lect 3
51 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
02 Data
No ratings yet
02 Data
24 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Distributed DataMining
No ratings yet
Distributed DataMining
16 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
02 Data
No ratings yet
02 Data
47 pages
Data
No ratings yet
Data
36 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
NUSSOC Analytics Brochure FY25 Q1
No ratings yet
NUSSOC Analytics Brochure FY25 Q1
19 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
02 Data
No ratings yet
02 Data
62 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
Association Rule Generation For Student Performance Analysis Using Apriori Algorithm
No ratings yet
Association Rule Generation For Student Performance Analysis Using Apriori Algorithm
5 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
Down 2
No ratings yet
Down 2
61 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Introduction To Data Mining For Bioinformatics: Fall 2005 Peter Van Der Putten (Putten - at - Liacs - NL)
No ratings yet
Introduction To Data Mining For Bioinformatics: Fall 2005 Peter Van Der Putten (Putten - at - Liacs - NL)
50 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
A Predictive Model For The Early Identification of Student Dropout Using Data Classification Clustering and Association Methods
No ratings yet
A Predictive Model For The Early Identification of Student Dropout Using Data Classification Clustering and Association Methods
10 pages
Cheatsheet Data
No ratings yet
Cheatsheet Data
3 pages
Data Mining
No ratings yet
Data Mining
34 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
BI - Unit 5
No ratings yet
BI - Unit 5
9 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Scikit-Learn Interview Questions and Answers
No ratings yet
Scikit-Learn Interview Questions and Answers
2 pages
OSINT分析使用自适应共振理论进行反恐警告
No ratings yet
OSINT分析使用自适应共振理论进行反恐警告
6 pages
Slide 1
No ratings yet
Slide 1
10 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Data Similarity
0% (1)
Data Similarity
18 pages
DMTN
No ratings yet
DMTN
17 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Lec 5
No ratings yet
Lec 5
24 pages
Literature Survey Diabetes Prediction
No ratings yet
Literature Survey Diabetes Prediction
2 pages
Analysis Process Designer (APD) Illustrated Step-By-step Implementation Part 2 (Using Routine Transformation)
No ratings yet
Analysis Process Designer (APD) Illustrated Step-By-step Implementation Part 2 (Using Routine Transformation)
27 pages
S.S.V.P.S.'S B.S. Deore College of Engineering, Dhule 2017-2018
No ratings yet
S.S.V.P.S.'S B.S. Deore College of Engineering, Dhule 2017-2018
18 pages
MIDAS-2025 Brochure Revised Dates
No ratings yet
MIDAS-2025 Brochure Revised Dates
2 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
IEEE Java Projects List - SPARKTECH 8904892715
No ratings yet
IEEE Java Projects List - SPARKTECH 8904892715
8 pages
Cheatsheet FDA A4 Full
No ratings yet
Cheatsheet FDA A4 Full
2 pages
Quiz l5
No ratings yet
Quiz l5
3 pages
Clow Imc08 Im 11
No ratings yet
Clow Imc08 Im 11
32 pages
Implementasi Algoritma Naïve Bayes Pada Data Set Hepatitis Menggunakan Rapid Miner
No ratings yet
Implementasi Algoritma Naïve Bayes Pada Data Set Hepatitis Menggunakan Rapid Miner
6 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
No ratings yet
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
3 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Machine Learning SVM - Supervised
No ratings yet
Machine Learning SVM - Supervised
32 pages
DAY Course Content Description
No ratings yet
DAY Course Content Description
1 page
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Science For Agriculture
No ratings yet
Data Science For Agriculture
5 pages
Knowledge-Based Systems: Alfonso Hernández Medrano
No ratings yet
Knowledge-Based Systems: Alfonso Hernández Medrano
10 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Chapter - I Customer Relationship Management (CRM) : Learning Objectives
No ratings yet
Chapter - I Customer Relationship Management (CRM) : Learning Objectives
89 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
9721 333c
No ratings yet
9721 333c
3 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
(MCQ) - Data Warehouse and Data Mining - LMT
No ratings yet
(MCQ) - Data Warehouse and Data Mining - LMT
4 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Data Mining Summary (Final)

Uploaded by

Data Mining Summary (Final)

Uploaded by

Data Mining Summary (Final)

Data Mining Methods

Importance of Data Mining:

 Rapid computeriza�on of businesses produces huge amount of data.

Data Mining (KDD) process (Important):

1. Understand the applica�on domain.

Data Mining Big Data

Challenges in data mining (Important):

- Noisy and incomplete data. - Performance.

- Distributed data. - Data visualiza�on.

- Complex Data. - Data privacy and security.

Atribute /AKA:(Dimensions, Features, Variables): a data ﬁeld, represen�ng a characteris�c or

Discrete Atribute: has only a ﬁnite or countably inﬁnite set of values.

Binary Atribute: a special case of a discrete atribute.

Con�nuous Atribute: has real numbers as atribute values.

Inter-Quar�le range (IQR): IQR = Q3 – Q1

Outlier: usually, a value higher/lower than [Q1-(1.5*IQR), Q3+(1.5*IQR)].

Standard devia�on(s/σ): is the square root of variance.

Histogram: x axis are values; y axis represents

Quan�le plot: each value xi is paired with fi

Quan�le-Quan�le (q-q) plot: graphs the

Scater plot: each pair of values is a pair of

Why do we visualize data:

Proximity: refers to similarity or dissimilarity.

Document: can be represented by thousands of atributes, each recording the frequency of a

Measures for data quality:

1. Accuracy: correct or wrong, accurate or not.

Major Tasks in Data Preprocessing:

Noisy data: meaningless data/containing noise, errors, or outliers.

Inconsistent data: containing discrepancies in codes or names.

How to handle missing data:

1. Ignore the tuple.

Incorrect atribute values may be due to:

- faulty data collec�on instruments.

Schema integra�on: Integrate metadata from diﬀerent sources.

Data reduc�on strategies:

Methods of data transforma�on:

1. Smoothing: remove noise from data.

o Nominal: values from an unordered set, e.g., color, profession

• Methods of data discre�za�on:

o Binning (Top-down split, unsupervised) two types:

Data reduc�on: two types: (dimensionality reduc�on – data compression).

Major techniques of dimensionality reduc�on:

1. Feature selec�on: is to select a subset of size (M<N) that leads to the

Feature selec�on methods

Image: informa�on + redundant data.

Redundant data: data that provide no relevant informa�on.

1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

Types of data redundancies:

1. Repe��ve Sequence Suppression

2. Run Length Encoding

Compression algorithms evalua�on metrics:

- Structural similarity index measure (SSIM):

You might also like

Outlier: usually, a value higher/lower than [Q1-(1.5IQR), Q3+(1.5IQR)].