DWDM Unit-2
DWDM Unit-2
UNIT-2
1) Write about data objects and attributes? What are the different types of attributes?
Qualitative Attributes
1. Nominal Attributes – related to names : The values of a Nominal attribute are
name of things, some kind of symbols. Values of Nominal attributes represents
some category or state and that’s why nominal attribute also referred
as categorical attributes and there is no order (rank, position) among values of
nominal attribute.
Example :
2. Binary Attributes : Binary data has only 2 values/states. For Example yes or
no, affected or unaffected, true or false.
i) Symmetric : Both values are equally important (Gender).
ii) Asymmetric : Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a
meaningful sequence or ranking(order) between them, but the magnitude
between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.
Quantitative Attributes
1. Numeric : A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values. Numerical attributes are of 2
types, interval and ratio.
i) An interval-scaled attribute has values, whose differences are interpretable,
but the numerical attributes do not have the correct reference point or we can
call zero point. Data can be added and subtracted at interval scale but can not be
multiplied or divided.Consider a example of temperature in degrees Centigrade.
If a days temperature of one day is twice than the other day we cannot say that
one day is twice as hot as another day.
ii) A ratio-scaled attribute is a numeric attribute with an fix zero-point. If a
measurement is ratio-scaled, we can say of a value as being a multiple (or ratio)
of another value. The values are ordered, and we can also compute the difference
between values, and the mean, median, mode, Quantile-range and Five number
summary can be given.
2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example
4. Multiplication* and/
• Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
• The most common data dispersion measures are the range, quartiles, and
interquartile range; the five-number summary and boxplots; and the variance
and standard deviation of the data
therefore, the median is not unique. It can be any value within the two middlemost
values of 52 and 56 (that is, within the sixth and seventh values in the list).
By convention, we assign the average of the two middlemost values as the median;
that is, 52+56= 108=
2 2
54.
The range of the set is the difference between the largest (max()) and smallest
(min()) values.
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles. The 100-quantiles are more commonly referred
to as percentiles; they divide the data distribution into 100 equal-sized consecutive
sets. The median, quartiles, and percentiles are the most widely used forms of
quantiles. Q1 Q2 Q3 25th percentile 75th percentile Median 25%
The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of
the data.The third quartile, denoted by Q3, is the 75th percentile—it cuts off the
lowest 75% (or highest 25%) of the data. The second quartile is the 50th percentile.
As the median, it
The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called the
• Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
• Two lines (called whiskers) outside the box extend to the smallest (Minimum)
and largest (Maximum) observations
3) Write the types of graphical representations of data and data visualization
techniques.
Quantile–Quantile Plot
• Histograms (or frequency histograms) are at least a century old and are
widely used.“Histos” means pole or mast, and “gram” means chart, so a
histogram is a chart of poles. Plotting histograms is a graphical method for
summarizing the distribution of a given attribute, X.
• If X is numeric, the term histogram is preferred. The range of values for X is
partitioned into disjoint consecutive sub-ranges.
• The sub-ranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X.
• The range of a bucket is known as the width. Typically, the buckets are of
equal width.
Although histograms are widely used, they may not be as effective as the quantile
plot, q-q plot, and boxplot methods in comparing groups of uni-variate observations.
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes. To
construct a scatter plot, each pair of values is treated as a pair of coordinates in an
algebraic sense and plotted as points in the plane.
The scatter plot is a useful method for providing a first look at bivariate data
to see clusters of points and outliers, or to explore the possibility of correlation
relationships. Two attributes, X, and Y, are correlated if one attribute implies the
other. Correlations can be positive, negative, or null (uncorrelated).
Data Visualization
• Data visualization aims to communicate data clearly and effectively through
graphical representation
A simple way to visualize the value of a dimension is to use a pixel where the color of
the pixel reflects the dimension’s value. For a data set of m dimensions,
pixel-oriented techniques create m windows on the screen, one for each dimension.
The dissimilarity between two objects is a numerical measure of the degree to which
the two objects are different . Dissimilarities are lower for more similar
Distances
• We first present some examples, and then offer a more formal description of
distances in terms of the properties common to all distances. The Euclidean
distance, d, between two points, x and y, in one-, two-, three-, or higher
dimensional space, is given by the following familiar formula:
where n is the number of dimensions and xk and yk are respectively, the kth
attributes (components) of x and y
where r is a parameter. The following are the three most common examples
of Minkowski distances.
• r = 2. Euclidean distance(L2)
1. Positivity
pl p2 p3 p4
5,6,7,8-questions in notes
Data have quality if they satisfy the requirements of the intended use. There are
many
factors comprising data quality, including accuracy, completeness, consistency,
timeliness, believability, and interpretability.
• Data Preprocessing is a broad area and consists of a number of different
strategies and techniques that are interrelated in complex ways.
• Data Cleaning
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• variable Transformation
1) Data Cleaning
i)Missing Values:
Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values.
Fill in the missing value manually: In general, this approach is time consuming and
may not be feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or -∞.
Use a measure of central tendency for the attribute (e.g., the mean or median) to fill
in the missing value:
Use the attribute mean or median for all samples belonging to the same class as the
given tuple Use the most probable value to fill in the missing value
ii)Noisy Data
2) Aggregation
Sometimes "less is more" and this is the case with aggregation, the combining
of two or more objects into a single object. Consider a data set consisting of
transactions (data objects) recording the daily sales of products in various store
locations (Minneapolis, Chicago, Paris, ...) for different days over the course of a year.
One way to aggregate transactions for this data set is to replace all the
transactions of a single store with a single storewide transaction. This reduces the
hundreds or thousands of transactions that occur daily at a specific store to a single
daily transaction, and the number of data objects is reduced to the number of stores.
3) Sampling
Sampling is a commonly used approach for selecting a subset of the data objects to
be analyzed. In statistics, it has long been used for both the pre-Liminary
investigation of the data and the final data analysis. Sampling can also be very useful
in data mining
The key principle for effective sampling is the following: Using a sample will work
almost as well as using the entire data set if the sample is representative.
Sampling Approaches
simple random sampling: For this type of sampling, there is an equal probability of
selecting any particular item. There are two variations on random sampling (and
other sampling techniques as well):
i) sampling without replacement
ii) sampling with replacement
iii) Stratified sampling
4) Dimensionality Reduction
A key benefit is that many data mining algorithms work better if the
dimensionality the number of attributes in the data-is lower. This is partly because
dimensionality reduction can eliminate irrelevant features and reduce noise and
partly because of the curse of dimensionality.
5)Feature Subset Selection
Another way to reduce the dimensionality is to use only a subset of the
features.While it might seem that such an approach would lose information, this is
not the case if redundant and irrelevant features are present. Redundant features
duplicate much or all of the information contained in one or more other attributes.
For example, the purchase price of a product and the amount of sales tax paid
contain much of the same information. Irrelevant features Contain almost no useful
information for the data mining task at hand.
There are three standard approaches to feature selection: embedded, filter, and
wrapper
10) Explain the types of data cleaning methods and how to overcome the noise data?
11) Write about aggregation? Explian briefly different types aggregation techniques
in detail.
ANS) Aggregation
Sometimes "less is more" and this is the case with aggregation, the combining
of two or more objects into a single object. Consider a data set consisting of
transactions (data objects) recording the daily sales of products in various store
locations (Minneapolis, Chicago, Paris, ...) for different days over the course of a year.
One way to aggregate transactions for this data set is to replace all the
transactions of a single store with a single storewide transaction. This reduces the
hundreds or thousands of transactions that occur daily at a specific store to a single
daily transaction, and the number of data objects is reduced to the number of stores.
There are several motivations for aggregation. First, the smaller data sets
resulting from data reduction require less memory and processing time, and hence,
aggregation may permit the use of more expensive data mining algorithms. Second,
aggregation can act as a change of scope or scale by Providing a high-level view of
the data instead of a low-level view.
Types of Aggregation Techniques:
1) Summation (Sum)
Summation is the simplest form of aggregation, where all the values in a dataset are
added together to get a total.
Example: Summing up the sales figures for each day to get the total sales for a month.
2) Averaging (Mean)
Averaging calculates the central tendency of a dataset by summing all the values
and then dividing by the number of values. The mean provides a general idea of the
dataset's typical value.
Example: Calculating the average score of students in a class.
3) Counting
Counting involves determining the number of occurrences of a specific value or the
total number of records in a dataset.
Example: Counting the number of customers who made a purchase in a store.
4) Median
The median is the middle value in a dataset when the values are sorted in
ascending or descending order. It’s a robust measure of central tendency, especially
useful when dealing with outliers.
Example: Finding the median income of a population to understand the typical
income level.
5) Mode
Mode is the value that appears most frequently in a dataset. It’s useful for
categorical data where you want to know the most common category.
Example: Determining the most common age group among participants in a survey.
6) Maximum and Minimum (Max/Min)
These aggregations find the highest and lowest values in a dataset, respectively.
They are used to understand the range of data.
Example: Identifying the highest and lowest temperatures recorded in a year.
7) Range
The range is the difference between the maximum and minimum values in a
dataset. It provides insight into the spread or variability of the data.
Example: The range of prices for a particular product across different stores.
12) Explain in detail about data reduction ,data discretization, and binarization.
ANS) in notes..