0% found this document useful (0 votes)

86 views49 pages

IDS Unit 2

Uploaded by

AI&DS VGNT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views49 pages

IDS Unit 2

Uploaded by

AI&DS VGNT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data-Related Issues

for Successful Data Mining

Type of Data:
– Data sets differ in a number of ways.
– Type of data determines which techniques can be used to analyze the data.
Quality of Data:
– Data is often far from perfect.
– Improving data quality improves the quality of the resulting analysis.
Preprocessing Steps to Make Data More Suitable for Data Mining:
– Raw data must be processed in order to make it suitable for analysis.
• Improve data quality,
• Modify data so that it better fits a specified data mining technique.
Analyzing Data in Terms of its Relationships:
– find relationships among data objects and then perform remaining analysis using these
relationships rather than data objects themselves.
– There are many similarity or distance measures, and the proper choice depends on the
type of data and application.
Data Mining 3
What is Data?
• Data sets are made up of data objects.
• A data object represents an entity.
– Also called sample, example, instance, data point, object, tuple.
• Data objects are described by attributes.
• An attribute is a property or characteristic of a data object.
– Examples: eye color of a person, temperature, etc.
– Attribute is also known as variable, field, characteristic, or feature
• A collection of attributes describe an object.
• Attribute values are numbers or symbols assigned to an attribute.

Data Mining 4
A Data Object

• database rows  data objects

• database columns  attributes

Data Mining 5
Attributes
• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Attribute values are numbers or symbols assigned to an attribute

• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different; ID has no limit but age has a
maximum and minimum value

Data Mining 6
Attribute Types
Four main types of attributes
Nominal: Categorical (Qualitative)
– categories, states, or “names of things”
• Hair color, marital status, occupation, ID numbers, zip codes
– An important nominal attribute: Binary
• Nominal attribute with only 2 states (0 and 1)
Ordinal: Categorical (Qualitative)
– Values have a meaningful order (ranking) but magnitude between successive values is
not known.
• Size = {small, medium, large}, grades, army rankings
Interval: Numeric (Quantitative)
– Measured on a scale of equal-sized units
– Values have order:
• temperature in C˚ or F˚, calendar dates
– No true zero-point: ratios are not meaningful
Ratio: Numeric (Quantitative)
– Inherent zero-point: ratios are meaningful
• temperature in Kelvin, length, counts, monetary quantities
Data Mining 7
Attribute Types
Four main types of attributes: Nominal Attributes
• The values of a nominal attribute are symbols or names of things.
– Each value represents some kind of category, code, or state,
• Nominal attributes are also referred to as categorical attributes.
• The values of nominal attributes do not have any meaningful order.
• Example: The attribute marital_status can take on the values single, married,
divorced, and widowed.

• Because nominal attribute values do not have any meaningful order about them and
they are not quantitative.
– It makes no sense to find the mean (average) value or median (middle) value for such an
attribute.
– However, we can find the attribute’s most commonly occurring value (mode).

Data Mining 8
Attribute Types
Four main types of attributes: Nominal Attributes
• A binary attribute is a special nominal attribute with only two states: 0 or 1.

• A binary attribute is symmetric if both of its states are equally valuable and carry the
same weight.
– Example: the attribute gender having the states male and female.

• A binary attribute is asymmetric if the outcomes of the states are not equally
important.
– Example: Positive and negative outcomes of a medical test for HIV.
– By convention, we code the most important outcome, which is usually the rarest one, by 1
(e.g., HIV positive) and the other by 0 (e.g., HIV negative).

Data Mining 9
Attribute Types
Four main types of attributes: Ordinal Attributes
• An ordinal attribute is an attribute with possible values that have a meaningful order
or ranking among them, but the magnitude between successive values is not known.
• Example: An ordinal attribute drink_size corresponds to the size of drinks available at
a fast-food restaurant.
– This attribute has three possible values: small, medium, and large.
– The values have a meaningful sequence (which corresponds to increasing drink size);
however, we cannot tell from the values how much bigger, say, a medium is than a large.
• The central tendency of an ordinal attribute can be represented by its mode and its
median (middle value in an ordered sequence), but the mean cannot be defined.

Data Mining 10
Attribute Types
Four main types of attributes: Interval Attributes
• Interval attributes are measured on a scale of equal-size units.
– We can compare and quantify the difference between values of interval attributes.
• Example: A temperature attribute is an interval attribute.
– We can quantify the difference between values. For example, a temperature of 20oC is five
degrees higher than a temperature of 15oC.
– Temperatures in Celsius do not have a true zero-point, that is, 0oC does not indicate “no
temperature.”
– Although we can compute the difference between temperature values, we cannot talk of
one temperature value as being a multiple of another.
• Without a true zero, we cannot say, for instance, that 10oC is twice as warm as 5oC . That is, we
cannot speak of the values in terms of ratios.

• The central tendency of an interval attribute can be represented by its mode, its
median (middle value in an ordered sequence), and its mean.

Data Mining 11
Attribute Types
Four main types of attributes: Ratio Attributes
• A ratio attribute is a numeric attribute with an inherent zero-point.
• Example: A number_of_words attribute is a ratio attribute.
– If a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of
another value.
• The central tendency of an ratio attribute can be represented by its mode, its median
(middle value in an ordered sequence), and its mean.

Data Mining 12
Properties of Attribute Values
• The type of an attribute depends on which of the following properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: * /

• Nominal attribute: distinctness

• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: all 4 properties

Data Mining 13
Properties of Attribute Values
Attribute Description Examples
Type
Nominal The values of a nominal attribute are just zip codes, employee ID
different names, numbers, eye color, sex:
i.e., nominal attributes provide only enough {male, female}
information to distinguish one object from
another. (=, )
Ordinal The values of an ordinal attribute provide hardness of minerals, {good,
enough information to order objects. (<, >) better, best}, grades, street
numbers
Interval For interval attributes, the differences calendar dates, temperature
between values are meaningful, in Celsius or Fahrenheit
i.e., a unit of measurement exists. (+, - )
Ratio For ratio variables, both differences and ratios temperature in Kelvin,
are meaningful. (*, /) monetary quantities, counts,
age, mass, length,

Data Mining 14
Attribute Types
Categorical (Qualitative) and Numeric (Quantitative)
• Nominal and Ordinal attributes are collectively referred to as categorical or
qualitative attributes.
– qualitative attributes, such as employee ID, lack most of the properties of numbers.
– Even if they are represented by numbers, i.e. , integers, they should be treated more like
symbols .
– Mean of values does not have any meaning.

• Interval and Ratio are collectively referred to as quantitative or numeric attributes.

– Quantitative attributes are represented by numbers and have most of the properties of
numbers .
– Note that quantitative attributes can be integer-valued or continuous.
– Numeric operations such as mean, standard deviation are meaningful

Data Mining 15
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• zip codes, profession, or the set of words in a collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete attributes
– Binary attributes where only non-zero values are important are called asymmetric
binary attributes.

• Continuous Attribute
– Has real numbers as attribute values
• temperature, height, or weight
– Practically, real values can only be measured and represented using a finite number of
digits
– Continuous attributes are typically represented as floating-point variables

Data Mining 16
Types of data sets
• Record • Ordered
– Relational records – Video data: sequence of images
– Data matrix, e.g., numerical matrix, – Temporal data: time-series
crosstabs – Sequential Data: transaction
– Document data: text documents: sequences
term-frequency vector – Genetic sequence data
– Transaction data • Spatial, image and multimedia:
• Graph and network – Spatial data: maps
– World Wide Web – Image data:
– Social or information networks – Video data:
– Molecular Structures

Data Mining 17
Record Data
• Data that consists of a collection of records, each of which consists of a fixed set of
attributes

Data Mining 18
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the data objects can
be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
• Such data set can be represented by an m by n matrix, where there are m rows, one for
each object, and n columns, one for each attribute.
• A data matrix is a variation of record data, but because it consists of numeric
attributes, standard matrix operation can be applied to transform and manipulate the
data.

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

Data Mining 19
Document (Text) Data
• Each document becomes a term vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term occurs in the
document
• Convert text documents to record data by counting word frequencies (document-term
matrix).

Data Mining 20
Transaction Data
• Transaction data is a special type of record data, where
– each record (transaction) involves a set of items.
– Example: The set of products purchased by a customer constitute a transaction, while the
individual products that were purchased are the items.

Data Mining 21
Transaction Data
Convert to Record Data

Requires less space

Asymmetric attributes
Requires more space

• In real-world data, the table would contain hundreds or thousands of columns,

depending on the number of items to be considered.
• The number of items bought in a transaction, say 5, is very small in comparison
to the number of columns
• Most values in this matrix are “0”. Such a matrix is called sparse matrix.

Data Mining 22
• Data Objects and Attribute Types
• Basic Statistical Descriptions of Data
• Measuring Data Similarity and Dissimilarity

Data Mining 23
Basic Statistical Descriptions of Data
• Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.

• For data preprocessing tasks, we want to learn about data characteristics regarding
both central tendency and dispersion of the data.
• Measures of central tendency include mean, median, mode, and midrange.
• Measures of data dispersion include quartiles, interquartile range (IQR), and
variance.

• These descriptive statistics are of great help in understanding the distribution of the
data.

Data Mining 24
Measuring Central Tendency: Mean
• The most common and most effective numerical measure of the “center” of a set of
data is the arithmetic mean.
1 n
Arithmetic Mean: x   xi
n i 1

• Sometimes, each value xi in a set may be associated with a weight wi.

– The weights reflect the significance and importance attached to their respective values.

w x i i
Weighted Arithmetic Mean: x i 1
n

w
i 1
i

Data Mining 25
Measuring Central Tendency: Mean
• Although the mean is the single most useful quantity for describing a data set, it is not
always the best way of measuring the center of the data.
– A major problem with the mean is its sensitivity to extreme (outlier) values.
– Even a small number of extreme values can corrupt the mean.
• To offset the effect caused by a small number of extreme values, we can instead use
the trimmed mean,
• Trimmed mean can be obtained after chopping off values at the high and low
extremes.

Data Mining 26
Measuring Central Tendency: Median
• Another measure of the center of data is the median.
• Suppose that a given data set of N distinct values is sorted in numerical order.
– If N is odd, the median is the middle value of the ordered set;
– If N is even, the median is the average of the middle two values.

• In probability and statistics, the median generally applies to numeric data; however,
we may extend the concept to ordinal data.
– Suppose that a given data set of N values for an attribute X is sorted in increasing order.
– If N is odd, then the median is the middle value of the ordered set.
– If N is even, then the median may not be not unique.
• In this case, the median is the two middlemost values and any value in between.

Data Mining 27
Measuring Central Tendency: Mode
• Another measure of central tendency is the mode.

• The mode for a set of data is the value that occurs most frequently in the set.
– It is possible for the greatest frequency to correspond to several different values, which
results in more than one mode.
– Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal.
– At the other extreme, if each data value occurs only once, then there is no mode.

• Central Tendency Measures for Numerical Attributes: Mean, Median, Mode

• Central Tendency Measures for Categorical Attributes: Mode (Median?)

– Central Tendency Measures for Nominal Attributes: Mode
– Central Tendency Measures for Ordinal Attributes: Mode, Median

Data Mining 28
Measuring Central Tendency -
Mean, Median, Mode
Median, mean and mode of symmetric, positively and negatively skewed data

symmetric data positively skewed data negatively skewed data

Data Mining 29
Measuring Central Tendency: Example
What are central tendency measures (mean, median, mode)for the following attributes?
attr1 = {2,4,4,6,8,24}

attr2 = {2,4,7,10,12}

attr3 = {xs,s,s,s,m,m,l}

Data Mining 30
Measuring Central Tendency: Example
What are central tendency measures (mean, median, mode)for the following attributes?
attr1 = {2,4,4,6,8,24}
mean = (2+4+4+6+8+24)/6 = 8 average of all values
median = (4+6)/2 = 5 avg. of two middle values
mode = 4 most frequent item
attr2 = {2,4,7,10,12}
mean = (2+4+7+10+12)/5 = 7 average of all values
median = 7 middle value
mode = any of them (no mode) all of them has same freq.
attr3 = {xs,s,s,s,m,m,l}
mean is meaningless for categorical attributes.
median = s middle value
mode = s most frequent item

Data Mining 31
Measuring Dispersion of Data
• The degree to which numerical data tend to spread is called the dispersion, or
variance of the data.

The most common measures of data dispersion:

• Range: Difference between the largest and smallest values.
• Interquartile Range (IQR): range of middle 50%
– quartiles: Q1 (25th percentile), Q3 (75th percentile) IQR=Q3-Q1
– five number summary: Minimum, Q1, Median, Q3, Maximum
• Variance and Standard Deviation: (sample: s, population: σ)
– variance of N observations:
𝑛 𝑛 where  is the mean
1 1 value of the observations
𝜎2 = ෍(𝑥𝑖 − 𝜇)2 𝑠2 = ෍(𝑥𝑖 − 𝜇)2
𝑛 𝑛−1
1 1

– standard deviation σ (s) is the square root of variance σ2 ( s2)

Data Mining 32
Measuring Dispersion of Data: Quartiles
• Suppose that set of observations for numeric attribute X is sorted in increasing order.
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
– The kth q-quantile for a given data distribution is the value x such that at most k/q of the
data values are less than x and at most (q-k)/q of the data values are more than x, where k
is an integer such that 0<k<q. There are q-1 q-quantiles.
– The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets.
• Quartiles: The 4-quantiles are the three data points that split the data distribution into
four equal parts; each part represents one-fourth of the data distribution.

Data Mining 33
Measuring Dispersion of Data: Outliers
• Outliers can be identified by the help of interquartile range or standard deviation
measures.
– Suspected outliers are values falling at least 1.5xIQR above the third quartile or below the
first quartile.
– Suspected outliers are values that fall outside of the range of μ–Nσ and μ+Nσ where μ is
mean and σ is standard deviation. N can be chosen as 2.5.

• The normal distribution curve: (μ: mean, σ: standard deviation)

– From μ–σ to μ+σ: contains about 68% of the measurements
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it

Data Mining 34
Measuring Dispersion of Data: Boxplot Analysis
• Five-number summary of a distribution: Minimum, Q1, Median, Q3, Maximum
• Boxplots are a popular way of visualizing a distribution and a boxplot incorporates
five-number summary:
– The ends of the box are at the quartiles Q1 and Q3, so that the box length is the
interquartile range, IQR.
– The median is marked by a line within the box. (median of values in IQR)
– Two lines outside the box extend to the smallest and largest observations (outliers are
excluded). Outliers are marked separately.
• If there are no outliers, lower extreme line is the smallest observation (Minimum) and upper
extreme line is the largest observation (Maximum).

. . .
outliers outliers

Data Mining 35
Measuring Dispersion of Data: Example
Consider following two attribute values:
attr1: {2,3,4,5,6,7,8,9} attr2: {1,5,9,10,11,12,18,30}

Which attribute has biggest standard deviation? Do not compute standard deviations.

Give interquartile ranges of attribute values?

Are there any outliers (wrt IQR) in these datasets?

Give a 4 element dataset whose standard deviation is zero?

Data Mining 36
Measuring Dispersion of Data: Example
Consider following two attribute values:
attr1: {2,3,4,5,6,7,8,9} attr2: {1,5,9,10,11,12,18,30}

Which attribute has biggest standard deviation? Do not compute standard deviations.
attr2

Give interquartile ranges of attribute values?

attr1: Q1: (3+4)/2=3.5 Q3:(7+8)/2=7.5 IQR:3.5-7.5 = 4
attr2: Q1: (5+9)/2=7 Q3:(12+18)/2=15 IQR:7-15 = 8

Are there any outliers (wrt IQR) in these datasets?

Yes. 30 in attr2. 30 > 15+1.5*IQR

Give a 4 element dataset whose standard deviation is zero? {1,1,1,1}

Data Mining 37
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary

• Bar Chart: compare data across different categories.

• Histogram: x-axis are values, y-axis represent frequencies

• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi %
of data are  xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution

against the corresponding quantiles of another.

• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
plane.

Data Mining 38
Boxplot: Example in R
Boxplot: graphic display of five-number summary

Data Mining 39
Boxplot: Example in R

Data Mining 40
Boxplot: Example in R

Data Mining 41
Bar Chart: Example in R

• A bar chart represents data in

rectangular bars with length of the bar
proportional to the value of the variable.

Data Mining 42
Histogram: Example in R

• A histogram represents the

frequencies of values of a variable
bucketed into ranges.
• Histogram is similar to bar chat but
the difference is it groups the values
into continuous ranges.

Data Mining 43
Histograms Often Tell More than Boxplots

• The two histograms shown in

the left may have the same
boxplot representation
– The same values for: min,
Q1, median, Q3, max

• But they have rather different

data distributions

Data Mining 44
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi

Data Mining 45
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding quantiles
of another
The Quantile-Quantile Plot in Programming Language, or (Q-Q Plot) is defined as a
value of two variables that are plotted corresponding to each other and check whether
the distributions of two variables are similar or not with respect to the locations.

A straight line that represents the

case of when, for each given
quantile, the unit price at each
branch is the same.

Data Mining 46
Scatter Plot
• A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
– To construct a scatter plot, each pair of values is treated as a pair of coordinates in an
algebraic sense and plotted as points in the plane.

• The scatter plot is a useful method for providing a first look at bivariate data to see
clusters of points and outliers, or to explore the possibility of correlation relationships.

• Two attributes, X, and Y, are correlated if one attribute implies the other.
• Correlations can be positive, negative, or null (uncorrelated).

Data Mining 47
Scatter Plot: Example in R

Data Mining 48
Scatter Plot: Example in R

• Scatterplots show many points

plotted in the Cartesian plane.
Each point represents the
values of two variables.
• One variable is chosen in the
horizontal axis and another in
the vertical axis.

Data Mining 49
Scatter Plot:
Positively and Negatively Correlated Data

Negatively Correlated

Positively Correlated

Uncorrelated
Uncorrelated

Data Mining 50
Scatter Plot:
Positively and Negatively Correlated Data

The left half fragment is positively correlated

The right half is negative correlated

Data Mining 51

Hostel Management System
50% (6)
Hostel Management System
18 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Unit 2 Data Preprocessing For Students
No ratings yet
Unit 2 Data Preprocessing For Students
169 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Ids U2 PPT 30092024
No ratings yet
Ids U2 PPT 30092024
87 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Mining
No ratings yet
Mining
129 pages
Full
No ratings yet
Full
367 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Attributes
No ratings yet
Attributes
66 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Data
No ratings yet
Data
84 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
ODATA Service Creation
83% (6)
ODATA Service Creation
23 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Data Exploration
No ratings yet
Data Exploration
12 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
SE Case Study
No ratings yet
SE Case Study
33 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
02data InClass 20150827
No ratings yet
02data InClass 20150827
18 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Ragb Alllnkg Kyoulltherrdz: in Structor
No ratings yet
Ragb Alllnkg Kyoulltherrdz: in Structor
31 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
02 Data
No ratings yet
02 Data
47 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
02data Part1
No ratings yet
02data Part1
19 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
SQLInterview Q&As
No ratings yet
SQLInterview Q&As
274 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Week 02.0 Chapt02
No ratings yet
Week 02.0 Chapt02
9 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
4 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Ststistc Properties
0% (1)
Ststistc Properties
5 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
Psych Stats Reviewer
100% (1)
Psych Stats Reviewer
16 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Sample Questions For Microsoft DP 600 Exam by Fisher
No ratings yet
Sample Questions For Microsoft DP 600 Exam by Fisher
9 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Form 5 Exam Summary
No ratings yet
Form 5 Exam Summary
17 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Barangay Information System Thesis Documentation
100% (3)
Barangay Information System Thesis Documentation
4 pages
Building A Data Empowered Company Domo Ebook PDF
No ratings yet
Building A Data Empowered Company Domo Ebook PDF
12 pages
Introduction To Statistical Quality Control, 7th Edition by Douglas C. Montgomery. 1
No ratings yet
Introduction To Statistical Quality Control, 7th Edition by Douglas C. Montgomery. 1
42 pages
DB Design Normalization
No ratings yet
DB Design Normalization
62 pages
Kendriya Vidyalaya Sangathan Kolkata Region Pre-Board Examination 2020-21 Class - Xii Subject:Computer Science Time: 3Hrs M.M.-70
No ratings yet
Kendriya Vidyalaya Sangathan Kolkata Region Pre-Board Examination 2020-21 Class - Xii Subject:Computer Science Time: 3Hrs M.M.-70
8 pages
Current Log1
No ratings yet
Current Log1
48 pages
Srinu Logbook
No ratings yet
Srinu Logbook
54 pages
TCS Intel Security Int Guide v2
No ratings yet
TCS Intel Security Int Guide v2
57 pages
Create A User, Grant Permission and Alter Its Password
No ratings yet
Create A User, Grant Permission and Alter Its Password
28 pages
SAP RAP Managed Vs Unmanaged 1744354052
No ratings yet
SAP RAP Managed Vs Unmanaged 1744354052
17 pages
Profile: Nguyễn Thị Hồng Thủy
No ratings yet
Profile: Nguyễn Thị Hồng Thủy
4 pages
Links Del Libro Python A Fondo
No ratings yet
Links Del Libro Python A Fondo
18 pages
Xii Ip Hy 23 24
No ratings yet
Xii Ip Hy 23 24
13 pages
Unit 9 Library and Information Networks and Consortia: 9.0 Objectives
No ratings yet
Unit 9 Library and Information Networks and Consortia: 9.0 Objectives
20 pages
Database Programming With SQL: 2-2 Limit Rows Selected
No ratings yet
Database Programming With SQL: 2-2 Limit Rows Selected
14 pages
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Hadoop Backup and Recovery
No ratings yet
Hadoop Backup and Recovery
14 pages
Consider A Student Registration Database Comprising of The Below Given Table Schema
No ratings yet
Consider A Student Registration Database Comprising of The Below Given Table Schema
4 pages
Experiment No:8: Title: Query Over Mongodb With A Collection
No ratings yet
Experiment No:8: Title: Query Over Mongodb With A Collection
8 pages
DOHMH AirQualityIndicator DataDictionary March2023
No ratings yet
DOHMH AirQualityIndicator DataDictionary March2023
4 pages
Data Types: Getting Started With Statistics
From Everand
Data Types: Getting Started With Statistics
Lee Baker
No ratings yet
Chandra Finn: Work Experience
No ratings yet
Chandra Finn: Work Experience
1 page
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Assignment - Java Developer Intern
No ratings yet
Assignment - Java Developer Intern
2 pages
Introduction To Data Mining Assignment 2
No ratings yet
Introduction To Data Mining Assignment 2
1 page

IDS Unit 2

Uploaded by

IDS Unit 2

Uploaded by

Data-Related Issues

for Successful Data Mining

• database rows  data objects

• Distinction between attributes and attribute values

• Nominal attribute: distinctness

• Interval and Ratio are collectively referred to as quantitative or numeric attributes.

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

Requires less space

• In real-world data, the table would contain hundreds or thousands of columns,

• Sometimes, each value xi in a set may be associated with a weight wi.

• Central Tendency Measures for Numerical Attributes: Mean, Median, Mode

• Central Tendency Measures for Categorical Attributes: Mode (Median?)

symmetric data positively skewed data negatively skewed data

The most common measures of data dispersion:

– standard deviation σ (s) is the square root of variance σ2 ( s2)

• The normal distribution curve: (μ: mean, σ: standard deviation)

Give interquartile ranges of attribute values?

Are there any outliers (wrt IQR) in these datasets?

Give a 4 element dataset whose standard deviation is zero?

Give interquartile ranges of attribute values?

Are there any outliers (wrt IQR) in these datasets?

Give a 4 element dataset whose standard deviation is zero? {1,1,1,1}

• Bar Chart: compare data across different categories.

• Histogram: x-axis are values, y-axis represent frequencies

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution

• A bar chart represents data in

• A histogram represents the

• The two histograms shown in

• But they have rather different

A straight line that represents the

• Scatterplots show many points

The left half fragment is positively correlated

You might also like