0% found this document useful (0 votes)
6 views19 pages

DWDM Unit-2

The document discusses data objects and attributes, detailing qualitative and quantitative types, including nominal, ordinal, binary, discrete, and continuous attributes. It also covers statistical representation of data, including measures of central tendency and dispersion, as well as various graphical representation techniques for data visualization. Additionally, it explains data preprocessing techniques and the importance of data quality, including data cleaning, integration, and transformation.

Uploaded by

amm804994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views19 pages

DWDM Unit-2

The document discusses data objects and attributes, detailing qualitative and quantitative types, including nominal, ordinal, binary, discrete, and continuous attributes. It also covers statistical representation of data, including measures of central tendency and dispersion, as well as various graphical representation techniques for data visualization. Additionally, it explains data preprocessing techniques and the importance of data quality, including data cleaning, integration, and transformation.

Uploaded by

amm804994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DWDM

UNIT-2

1) Write about data objects and attributes? What are the different types of attributes?

ANS) Type of attributes :


This is the First step of Data Data-preprocessing. We differentiate between different
types of attributes and then preprocess the data. So here is description of attribute
types.
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Discrete, Continuous)

Qualitative Attributes
1. Nominal Attributes – related to names : The values of a Nominal attribute are
name of things, some kind of symbols. Values of Nominal attributes represents
some category or state and that’s why nominal attribute also referred
as categorical attributes and there is no order (rank, position) among values of
nominal attribute.
Example :

2. Binary Attributes : Binary data has only 2 values/states. For Example yes or
no, affected or unaffected, true or false.
i) Symmetric : Both values are equally important (Gender).
ii) Asymmetric : Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a
meaningful sequence or ranking(order) between them, but the magnitude
between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.

Quantitative Attributes
1. Numeric : A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values. Numerical attributes are of 2
types, interval and ratio.
i) An interval-scaled attribute has values, whose differences are interpretable,
but the numerical attributes do not have the correct reference point or we can
call zero point. Data can be added and subtracted at interval scale but can not be
multiplied or divided.Consider a example of temperature in degrees Centigrade.
If a days temperature of one day is twice than the other day we cannot say that
one day is twice as hot as another day.
ii) A ratio-scaled attribute is a numeric attribute with an fix zero-point. If a
measurement is ratio-scaled, we can say of a value as being a multiple (or ratio)
of another value. The values are ordered, and we can also compute the difference
between values, and the mean, median, mode, Quantile-range and Five number
summary can be given.
2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example

3. Continuous : Continuous data have infinite no of states. Continuous data is of


float type. There can be many values between 2 and 3.
Example :

The following properties (operations) of numbers are typically used to


describe attributes.
1. Distinctness = and =I=
2. Order<, :S, >, and 2".
3. Addition+ and -

4. Multiplication* and/

Different attribute types

Description Examples Operations


Nomin The values of a zip codes, mode, entropy,
al nominal
attribute are just employee ID contingency
different numbers,
names; i.e., nominal eye color, correlation,
values gender x2 test
provide only enough
information to
distinguish
one object from
another.
(=,#)
Ordin The values of an hardness of median,
al ordinal minerais,
attribute provide {good, better, percentiles,
enough best},
information to order grades, rank
correlation,
objects. street numbers run tests,
(<, >) sign tests
Interv For interval calendar dates, mean,
al attributes, the
differences between temperature in standard
values Celsius deviation,
are meaningful, i.e., or Fahrenheit Pearson's
a unit
of measurement correlation ,
exists. t and F tests
(+, - )
Ratio For ratio variables, temperature in geometric
both Kelvin, mean,
differences and monetary harmonic
ratios are quantities, mean,
meaningful. counts, age, percent
mass,
(*, /) length, variation
electrical
current

2) Explain in detail the statistical representation of data with examples.

ANS) Statistical Descriptions of Data

• Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.

• We start with measures of central tendency

• The most common data dispersion measures are the range, quartiles, and
interquartile range; the five-number summary and boxplots; and the variance
and standard deviation of the data

• Measuring the Central Tendency: Mean, Median, and Mode

The mean of this set of values is


Median:Let’s find the median of the data from Example The data are already sorted
in increasing order. There is an even number of observations (i.e., 12);

therefore, the median is not unique. It can be any value within the two middlemost
values of 52 and 56 (that is, within the sixth and seventh values in the list).

By convention, we assign the average of the two middlemost values as the median;
that is, 52+56= 108=

2 2

54.

Thus, the median is $54,000.


Mode: identify which value in the data set occurs most often. Range: which is the
difference between the largest and smallest value in the data set, describes how well
the central tendency represents the data

The range of the set is the difference between the largest (max()) and smallest
(min()) values.

Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.

The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles. The 100-quantiles are more commonly referred
to as percentiles; they divide the data distribution into 100 equal-sized consecutive
sets. The median, quartiles, and percentiles are the most widely used forms of
quantiles. Q1 Q2 Q3 25th percentile 75th percentile Median 25%
The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of
the data.The third quartile, denoted by Q3, is the 75th percentile—it cuts off the
lowest 75% (or highest 25%) of the data. The second quartile is the 50th percentile.
As the median, it

gives the center of the data distribution.

The distance between the first and third quartiles is a simple measure of spread

that gives the range covered by the middle half of the data. This distance is called the

interquartile range (IQR) and is defined as IQR =Q3-Q1.

Five-Number Summary, Boxplots, and Outliers


Because Q1, the median, and Q3 together contain no information about the
endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can
be obtained by providing the lowest and highest data values as well. This is known as

the five-number summary.

The five-number summary of a distribution consists of the


median (Q2), the quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order of Minimum, Q1, Median, Q3, Maximum.

Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the

five-number summary as follows:

• Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.

• The median is marked by a line within the box.

• Two lines (called whiskers) outside the box extend to the smallest (Minimum)
and largest (Maximum) observations
3) Write the types of graphical representations of data and data visualization
techniques.

ANS) Graphic Displays of Basic Statistical descriptions of Data

• These include quantile plots, quantile–quantile plots, histograms, and scatter


plots. Such graphs are helpful for the visual inspection of data, which is useful
for data preprocessing.
• Quantile Plot: A quantile plot is a simple and effective way to have a first look
at a uni-variate data distribution. First, it displays all of the data for the given
attribute (allowing the user

Quantile–Quantile Plot

A quantile–quantile plot, or q-q plot, graphs the quantiles of one uni-variate


distribution

against the corresponding quantiles of another. It is a powerful visualization tool in


that it
allows the user to view whether there is a shift in going from one distribution to
another.
• Histograms

• Histograms (or frequency histograms) are at least a century old and are
widely used.“Histos” means pole or mast, and “gram” means chart, so a
histogram is a chart of poles. Plotting histograms is a graphical method for
summarizing the distribution of a given attribute, X.
• If X is numeric, the term histogram is preferred. The range of values for X is
partitioned into disjoint consecutive sub-ranges.

• The sub-ranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X.

• The range of a bucket is known as the width. Typically, the buckets are of
equal width.
Although histograms are widely used, they may not be as effective as the quantile
plot, q-q plot, and boxplot methods in comparing groups of uni-variate observations.

Scatter Plots and Data Correlation

A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes. To
construct a scatter plot, each pair of values is treated as a pair of coordinates in an
algebraic sense and plotted as points in the plane.

The scatter plot is a useful method for providing a first look at bivariate data
to see clusters of points and outliers, or to explore the possibility of correlation
relationships. Two attributes, X, and Y, are correlated if one attribute implies the
other. Correlations can be positive, negative, or null (uncorrelated).
Data Visualization
• Data visualization aims to communicate data clearly and effectively through
graphical representation

1) Pixel-Oriented Visualization Techniques:

A simple way to visualize the value of a dimension is to use a pixel where the color of
the pixel reflects the dimension’s value. For a data set of m dimensions,
pixel-oriented techniques create m windows on the screen, one for each dimension.

2) Geometric Projection Visualization Techniques


A drawback of pixel-oriented visualization techniques is that they cannot help us
much in understanding the distribution of data in a multidimensional space.
3) Geometric projection techniques help users find interesting projections of
multidimensional data sets. The central challenge the geometric projection
techniques try to address is how to visualize a high-dimensional space on a 2-D
display.
4) Icon-Based Visualization Techniques

Icon-based visualization techniques use small icons to represent multidimensional


data values. We look at two popular icon-based techniques: Chern-off faces and stick
figures.
Chern-off faces were introduced in 1973 by statistician Herman Chern-off. They
display
multidimensional data of up to 18 variables (or dimensions) as a cartoon human
face.
5) Hierarchical Visualization Techniques
Hierarchical visualization techniques partition all dimensions into subsets (i.e.,
subspaces). The subspaces are visualized in a hierarchical manner.
4) How to measure data similarity and dissimilarity between different types of
attributes ?Explain.

ANS) Measures of Similarity and Dissimilarity

• Similarity and dissimilarity are important because they are used by a


number of data mining techniques, such as clustering, nearest neighbor
classification, and anomaly detection

• Informally, the similarity between two objects is a numerical measure of the


degree to which the two objects are alike. Consequently, similarities are
higher for pairs of objects that are more alike. Similarities are
usually non-negative and are often between 0 (no similarity) and 1
(complete similarity).

The dissimilarity between two objects is a numerical measure of the degree to which
the two objects are different . Dissimilarities are lower for more similar

• Similarity and Dissimilarity between Simple Attributes

• Dissimilarities between Data Objects(multiple attributes)

Distances

• We first present some examples, and then offer a more formal description of
distances in terms of the properties common to all distances. The Euclidean
distance, d, between two points, x and y, in one-, two-, three-, or higher
dimensional space, is given by the following familiar formula:

where n is the number of dimensions and xk and yk are respectively, the kth
attributes (components) of x and y

• The Euclidean distance measure given in above Equation is generalized


by the Minkowski distance metric shown as

• where r is a parameter. The following are the three most common


examples of Minkowski distances.

where r is a parameter. The following are the three most common examples
of Minkowski distances.

• r = 1. City block (Manhattan)distance. A common example is the Hamming


distance, which is the number of bits that are different between two
objects that have only binary attributes, i.e., between two binary vectors.(L1)

• r = 2. Euclidean distance(L2)

• r = ͚ Supremum distance. This is the maximum difference between any


attribute of the objects. More formally, distance is defined by (L ͚ )

Distances, such as the Euclidean distance, have some well-known properties.


If d(x, y) is the distance between two points, x and y, then the following properties
hold.

1. Positivity

(a) d(x, x) > =0 for all x and y,

(b) d(x, Y) = 0 only if x = Y.

Ex: find dissimilarity among points p1(0,2) p2(2,0) p3(3,1) p4(5,1)

Euclidean distance matrix

pl p2 p3 p4

pl 0.0 2.8 3.2 5.1

p2 2.8 0.0 1.4 3.2

p3 3.2 1.4 0.0 2.0

p4 5.1 3.2 2.0 0.0

5,6,7,8-questions in notes

9) What is the data preprocessing? Explain about data preprocessing techniques in


detail.

ANS) Data Preprocessing

Why Pre-process the Data -Data Cleaning-Data Integration-Data Reduction Data


Transformation and Data Discretization

Data have quality if they satisfy the requirements of the intended use. There are
many
factors comprising data quality, including accuracy, completeness, consistency,
timeliness, believability, and interpretability.
• Data Preprocessing is a broad area and consists of a number of different
strategies and techniques that are interrelated in complex ways.

• Data Cleaning
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• variable Transformation
1) Data Cleaning

i)Missing Values:
Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values.
Fill in the missing value manually: In general, this approach is time consuming and
may not be feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or -∞.
Use a measure of central tendency for the attribute (e.g., the mean or median) to fill
in the missing value:

Use the attribute mean or median for all samples belonging to the same class as the
given tuple Use the most probable value to fill in the missing value

ii)Noisy Data

Noise is a random error or variance in a measured variable

Binning: Binning methods smooth a sorted data value by consulting its


“neighborhood,” that is, the values around it. The sorted values are distributed into a
number of “buckets,” or bins.

2) Aggregation
Sometimes "less is more" and this is the case with aggregation, the combining
of two or more objects into a single object. Consider a data set consisting of
transactions (data objects) recording the daily sales of products in various store
locations (Minneapolis, Chicago, Paris, ...) for different days over the course of a year.
One way to aggregate transactions for this data set is to replace all the
transactions of a single store with a single storewide transaction. This reduces the
hundreds or thousands of transactions that occur daily at a specific store to a single
daily transaction, and the number of data objects is reduced to the number of stores.
3) Sampling
Sampling is a commonly used approach for selecting a subset of the data objects to
be analyzed. In statistics, it has long been used for both the pre-Liminary
investigation of the data and the final data analysis. Sampling can also be very useful
in data mining
The key principle for effective sampling is the following: Using a sample will work
almost as well as using the entire data set if the sample is representative.
Sampling Approaches
simple random sampling: For this type of sampling, there is an equal probability of
selecting any particular item. There are two variations on random sampling (and
other sampling techniques as well):
i) sampling without replacement
ii) sampling with replacement
iii) Stratified sampling
4) Dimensionality Reduction
A key benefit is that many data mining algorithms work better if the
dimensionality the number of attributes in the data-is lower. This is partly because
dimensionality reduction can eliminate irrelevant features and reduce noise and
partly because of the curse of dimensionality.
5)Feature Subset Selection
Another way to reduce the dimensionality is to use only a subset of the
features.While it might seem that such an approach would lose information, this is
not the case if redundant and irrelevant features are present. Redundant features
duplicate much or all of the information contained in one or more other attributes.
For example, the purchase price of a product and the amount of sales tax paid
contain much of the same information. Irrelevant features Contain almost no useful
information for the data mining task at hand.
There are three standard approaches to feature selection: embedded, filter, and
wrapper

Flowchart of a feature subset selection process


6) Feature Creation
It is frequently possible to create, from the original attributes, a new set of
attributes that captures the important information in a data set much more
effectively. Furthermore, the number of new attributes can be smaller than the
number of original attributes,
Three related methodologies for creating new attributes are described next:
feature extraction, mapping the data to a new space, and feature construction.
7) Discretization and Binarization
It is often necessary to transform a continuous attribute into a categorical
attribute (discretization), and both continuous and discrete attributes may need to
be transformed into one or more binary attributes (binarization). Additionally, if a
categorical attribute has a large number of values (categories), or some values occur
infrequently, then it may be beneficial for certain data mining tasks to reduce the
number of categories by combining some of the values.
8) Variable Transformation:
variable transformation refers to a transformation that is applied to all the values of
a variable. Two important types of variable transformations: simple functional
transformations and normalization.

10) Explain the types of data cleaning methods and how to overcome the noise data?

ANS) embedded in 9th question..

11) Write about aggregation? Explian briefly different types aggregation techniques
in detail.

ANS) Aggregation
Sometimes "less is more" and this is the case with aggregation, the combining
of two or more objects into a single object. Consider a data set consisting of
transactions (data objects) recording the daily sales of products in various store
locations (Minneapolis, Chicago, Paris, ...) for different days over the course of a year.
One way to aggregate transactions for this data set is to replace all the
transactions of a single store with a single storewide transaction. This reduces the
hundreds or thousands of transactions that occur daily at a specific store to a single
daily transaction, and the number of data objects is reduced to the number of stores.
There are several motivations for aggregation. First, the smaller data sets
resulting from data reduction require less memory and processing time, and hence,
aggregation may permit the use of more expensive data mining algorithms. Second,
aggregation can act as a change of scope or scale by Providing a high-level view of
the data instead of a low-level view.
Types of Aggregation Techniques:
1) Summation (Sum)
Summation is the simplest form of aggregation, where all the values in a dataset are
added together to get a total.
Example: Summing up the sales figures for each day to get the total sales for a month.
2) Averaging (Mean)
Averaging calculates the central tendency of a dataset by summing all the values
and then dividing by the number of values. The mean provides a general idea of the
dataset's typical value.
Example: Calculating the average score of students in a class.
3) Counting
Counting involves determining the number of occurrences of a specific value or the
total number of records in a dataset.
Example: Counting the number of customers who made a purchase in a store.
4) Median
The median is the middle value in a dataset when the values are sorted in
ascending or descending order. It’s a robust measure of central tendency, especially
useful when dealing with outliers.
Example: Finding the median income of a population to understand the typical
income level.
5) Mode
Mode is the value that appears most frequently in a dataset. It’s useful for
categorical data where you want to know the most common category.
Example: Determining the most common age group among participants in a survey.
6) Maximum and Minimum (Max/Min)
These aggregations find the highest and lowest values in a dataset, respectively.
They are used to understand the range of data.
Example: Identifying the highest and lowest temperatures recorded in a year.
7) Range
The range is the difference between the maximum and minimum values in a
dataset. It provides insight into the spread or variability of the data.
Example: The range of prices for a particular product across different stores.

12) Explain in detail about data reduction ,data discretization, and binarization.
ANS) in notes..

You might also like