Unit 2 Data Preprocessing
Unit 2 Data Preprocessing
Types of Attributes
This is the initial phase of data preprocessing involves categorizing attributes into different types,
which serves as a foundation for subsequent data processing steps. Attributes can be broadly classified
into two main types:
• Qualitative: Nominal (N), Ordinal (O), Binary(B)
• Quantitative: Numeric, Discrete, Continuous
Qualitative Attributes
1. Nominal Attributes
Nominal attributes, as related to names, refer to categorical data where the values represent different
categories or labels without any inherent order or ranking. These attributes are often used to represent
names or labels associated with objects, entities, or concepts.
Attributes Values
Colors Black, Brown, White
Categorical Data Lecturer, Professor, Assistant Professor
2. Binary Attributes
Binary attributes are a type of qualitative attribute where the data can take on only two distinct values
or states. These attributes are often used to represent yes/no, presence/absence, or true/false conditions
within a dataset. They are particularly useful for representing categorical data where there are only two
possible outcomes. For instance, in a medical study, a binary attribute could represent whether a patient
is affected or unaffected by a particular condition.
Symmetric: In a symmetric attribute, both values or states are considered equally important or
interchangeable. For example, in the attribute “Gender” with values “Male” and “Female,” neither
value holds precedence over the other, and they are considered equally significant for analysis
purposes.
Asymmetric: An asymmetric attribute indicates that the two values or states are not equally important
or interchangeable. For instance, in the attribute “Result” with values “Pass” and “Fail,” the states are
not of equal importance; passing may hold greater significance than failing in certain contexts, such as
academic grading or certification exams.
Attributes Values
Symmetric Gender Male, Female
Cancer Detected Yes, No
Asymmetric Result Pass, Fail
3. Ordinal Attributes
Ordinal attributes are a type of qualitative attribute where the values possess a meaningful order or
ranking, but the magnitude between values is not precisely quantified. In other words, while the order
of values indicates their relative importance or precedence, the numerical difference between them is
not standardized or known.
Attributes Values
Grade A,B,C,D,E,F
Basic Pay Scale 16,17,18
Quantitative Attributes
1. Numeric
A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types: interval , and ratio-scaled.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical attributes
do not have the correct reference point, or we can call zero points. Data can be added and subtracted at
an interval scale but can not be multiplied or divided. Consider an example of temperature in degrees
Centigrade. If a day’s temperature of one day is twice of the other day we cannot say that one day is
twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we
can say of a value as being a multiple (or ratio) of another value. The values are ordered, and we can
also compute the difference between values, and the mean, median, mode, Quantile-range, and Five
number summary can be given.
2. Discrete
Discrete data refer to information that can take on specific, separate values rather than a continuous
range. These values are often distinct and separate from one another, and they can be either numerical
or categorical in nature.
Attributes Values
Profession Teacher, Manager, Peon
ZIP code 44200, 21020
3. Continuous
Continuous data, unlike discrete data, can take on an infinite number of possible values within a given
range. It is characterized by being able to assume any value within a specified interval, often including
fractional or decimal values.
Attributes Values
Height 5.4, 5.8, 6.0,...etc.
Weight 68.0, 55.0, 45.5,...etc.
2. Median
Sum of the values of then observations Number of observations in the sample
Sum of the values of the N observations Number of observations in the population
The median of a data set is the value in the middle when the data items are arranged in ascending order.
Whenever a data set has extreme values, the median is the preferred measure of central location.
The median is the measure of location most often reported for annual income and property value data.
A few extremely large incomes of property values can inflate the mean.
For an off number of observations:
7 observations= 26, 18, 27, 12, 14, 29, 19.
Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29
The median is the middle value.
Median=19
For an even number of observations
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30
The median is the average of the middle two values.
3. Mode
The mode of a data set is the value that occurs with greatest frequency. The greatest frequency can
occur at two or more different values. If the data have exactly two modes, the data have exactly two
modes, the data are bimodal. If the data have more than two modes, the data are multimodal.
Weighted mean: Sometimes, each value in a set may be associated with a weight, the weights reflect
the significance, importance or occurrence frequency attached to their respective values.
First quartile (Q1): The first quartile is the value, where 25% of the values are smaller than Q1 and
75% are larger.
Third quartile (Q3): The third quartile is the value, where 75 % of the values are smaller than Q3 and
25% are larger.
The box plot is a useful graphical display for describing the behavior of the data in the middle as well
as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles. If
the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the
interquartile range or IQ.
Range: Difference between highest and lowest observed values
Variance: The variance is a measure of variability that utilizes all the data. It is based on the difference
between the value of each observation (x;) and the mean (x) for a sample, u for a population).
The variance is the average of the squared between each data value and the mean.
Standard Deviation
The standard deviation of a data set is the positive square root of the variance. It is measured in the
same in the same units as the data, making it more easily interpreted than the variance.
The standard deviation is computed as follows:
Record Linkage is the process of identifying and matching records from different datasets that refer to
the same entity, even if they are represented differently. It helps in combining data from various sources
by finding corresponding records based on common identifiers or attributes.
Data Fusion involves combining data from multiple sources to create a more comprehensive and
accurate dataset. It integrates information that may be inconsistent or incomplete from different
sources, ensuring a unified and richer dataset for analysis.
• Better Model Performance: Reduces noise and irrelevant data, leading to more accurate
predictions and insights.
• Efficient Data Analysis: Streamlines data for faster and easier processing.
• Enhanced Decision-Making: Provides clear and well-organized data for better business
decisions.