0% found this document useful (0 votes)
2 views

Unit 2 Data Preprocessing

The document provides an overview of data preprocessing, detailing the types of data objects and attributes, including qualitative and quantitative classifications. It outlines essential preprocessing concepts such as data cleaning, integration, reduction, and transformation, emphasizing their importance in improving data quality for analysis. Additionally, it discusses the advantages and disadvantages of data preprocessing and its applications in fields like data mining, machine learning, and business intelligence.

Uploaded by

donmanish98072
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 2 Data Preprocessing

The document provides an overview of data preprocessing, detailing the types of data objects and attributes, including qualitative and quantitative classifications. It outlines essential preprocessing concepts such as data cleaning, integration, reduction, and transformation, emphasizing their importance in improving data quality for analysis. Additionally, it discusses the advantages and disadvantages of data preprocessing and its applications in fields like data mining, machine learning, and business intelligence.

Uploaded by

donmanish98072
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1

Unit 2. Data Preprocessing


2.1. Data Objects and Attribute Types
Data objects are the essential part of a database. A data object represents the entity. Data Objects are
like a group of attributes of an entity. For example, a sales data object may represent customers, sales,
or purchases. When a data object is listed in a database they are called data tuples.
Data attributes refer to the specific characteristics or properties that describe individual data objects
within a dataset. These attributes provide meaningful information about the objects and are used to
analyze, classify, or manipulate the data.

Types of Attributes
This is the initial phase of data preprocessing involves categorizing attributes into different types,
which serves as a foundation for subsequent data processing steps. Attributes can be broadly classified
into two main types:
• Qualitative: Nominal (N), Ordinal (O), Binary(B)
• Quantitative: Numeric, Discrete, Continuous

Qualitative Attributes
1. Nominal Attributes
Nominal attributes, as related to names, refer to categorical data where the values represent different
categories or labels without any inherent order or ranking. These attributes are often used to represent
names or labels associated with objects, entities, or concepts.

Attributes Values
Colors Black, Brown, White
Categorical Data Lecturer, Professor, Assistant Professor
2. Binary Attributes
Binary attributes are a type of qualitative attribute where the data can take on only two distinct values
or states. These attributes are often used to represent yes/no, presence/absence, or true/false conditions
within a dataset. They are particularly useful for representing categorical data where there are only two
possible outcomes. For instance, in a medical study, a binary attribute could represent whether a patient
is affected or unaffected by a particular condition.
Symmetric: In a symmetric attribute, both values or states are considered equally important or
interchangeable. For example, in the attribute “Gender” with values “Male” and “Female,” neither

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI


2

value holds precedence over the other, and they are considered equally significant for analysis
purposes.
Asymmetric: An asymmetric attribute indicates that the two values or states are not equally important
or interchangeable. For instance, in the attribute “Result” with values “Pass” and “Fail,” the states are
not of equal importance; passing may hold greater significance than failing in certain contexts, such as
academic grading or certification exams.

Attributes Values
Symmetric Gender Male, Female
Cancer Detected Yes, No
Asymmetric Result Pass, Fail
3. Ordinal Attributes
Ordinal attributes are a type of qualitative attribute where the values possess a meaningful order or
ranking, but the magnitude between values is not precisely quantified. In other words, while the order
of values indicates their relative importance or precedence, the numerical difference between them is
not standardized or known.

Attributes Values
Grade A,B,C,D,E,F
Basic Pay Scale 16,17,18

Quantitative Attributes
1. Numeric
A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types: interval , and ratio-scaled.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical attributes
do not have the correct reference point, or we can call zero points. Data can be added and subtracted at
an interval scale but can not be multiplied or divided. Consider an example of temperature in degrees
Centigrade. If a day’s temperature of one day is twice of the other day we cannot say that one day is
twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we
can say of a value as being a multiple (or ratio) of another value. The values are ordered, and we can
also compute the difference between values, and the mean, median, mode, Quantile-range, and Five
number summary can be given.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI


3

2. Discrete
Discrete data refer to information that can take on specific, separate values rather than a continuous
range. These values are often distinct and separate from one another, and they can be either numerical
or categorical in nature.

Attributes Values
Profession Teacher, Manager, Peon
ZIP code 44200, 21020
3. Continuous
Continuous data, unlike discrete data, can take on an infinite number of possible values within a given
range. It is characterized by being able to assume any value within a specified interval, often including
fractional or decimal values.

Attributes Values
Height 5.4, 5.8, 6.0,...etc.
Weight 68.0, 55.0, 45.5,...etc.

2.2. Statistical Description of Data


• For data preprocessing to be successful, it is essential to have an overall picture of our data.
Basic statistical descriptions can be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.
• Basic statistical descriptions can be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.
• For data preprocessing tasks, we want to learn about data characteristics regarding both central
tendency and dispersion of the data.
• Measures of central tendency include mean, median, mode and midrange.
• Measures of data dispersion include quartiles, interquartile range (IQR) and variance.
• These descriptive statistics are of great help in understanding the distribution of the data.

Measuring the Central Tendency


1. Mean
The mean of a data set is the average of all the data values. The sample mean x is the point estimator of
the population mean μ.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI


4

2. Median
Sum of the values of then observations Number of observations in the sample
Sum of the values of the N observations Number of observations in the population
The median of a data set is the value in the middle when the data items are arranged in ascending order.
Whenever a data set has extreme values, the median is the preferred measure of central location.
The median is the measure of location most often reported for annual income and property value data.
A few extremely large incomes of property values can inflate the mean.
For an off number of observations:
7 observations= 26, 18, 27, 12, 14, 29, 19.
Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29
The median is the middle value.
Median=19
For an even number of observations
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30
The median is the average of the middle two values.

3. Mode
The mode of a data set is the value that occurs with greatest frequency. The greatest frequency can
occur at two or more different values. If the data have exactly two modes, the data have exactly two
modes, the data are bimodal. If the data have more than two modes, the data are multimodal.
Weighted mean: Sometimes, each value in a set may be associated with a weight, the weights reflect
the significance, importance or occurrence frequency attached to their respective values.

Measuring the Dispersion of Data


An outlier is an observation that lies an abnormal distance from other values in a random sample from
a population.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI


5

First quartile (Q1): The first quartile is the value, where 25% of the values are smaller than Q1 and
75% are larger.
Third quartile (Q3): The third quartile is the value, where 75 % of the values are smaller than Q3 and
25% are larger.
The box plot is a useful graphical display for describing the behavior of the data in the middle as well
as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles. If
the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the
interquartile range or IQ.
Range: Difference between highest and lowest observed values
Variance: The variance is a measure of variability that utilizes all the data. It is based on the difference
between the value of each observation (x;) and the mean (x) for a sample, u for a population).
The variance is the average of the squared between each data value and the mean.

Standard Deviation
The standard deviation of a data set is the positive square root of the variance. It is measured in the
same in the same units as the data, making it more easily interpreted than the variance.
The standard deviation is computed as follows:

Graphic Displays of Basic Statistical Descriptions


• There are many types of graphs for the display of data summaries and distributions, such as Bar
charts, Pie charts, Line graphs, Boxplot, Histograms, Quantile plots and Scatter plots.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI


6

2.3. Data Preprocessing Concepts


Data preprocessing is the process of preparing raw data for analysis by cleaning and transforming it
into a usable format. In data mining it refers to preparing raw data for mining by performing tasks like
cleaning, transforming, and organizing it into a format suitable for mining algorithms.
• Goal is to improve the quality of the data.
• Helps in handling missing values, removing duplicates, and normalizing data.
• Ensures the accuracy and consistency of the dataset.

2.3.1. Data Cleaning


It is the process of identifying and correcting errors or inconsistencies in the dataset. It involves
handling missing values, removing duplicates, and correcting incorrect or outlier data to ensure the
dataset is accurate and reliable. Clean data is essential for effective analysis, as it improves the quality
of results and enhances the performance of data models.
Missing Values: This occur when data is absent from a dataset. You can either ignore the rows with
missing data or fill the gaps manually, with the attribute mean, or by using the most probable value.
This ensures the dataset remains accurate and complete for analysis.
Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to interpret, often
caused by errors in data collection or entry. It can be handled in several ways:
Binning Method: The data is sorted into equal segments, and each segment is smoothed by replacing
values with the mean or boundary values.
Regression: Data can be smoothed by fitting it to a regression function, either linear or multiple, to
predict values.
Clustering: This method groups similar data points together, with outliers either being undetected or
falling outside the clusters. These techniques help remove noise and improve data quality.
Removing Duplicates: It involves identifying and eliminating repeated data entries to ensure accuracy
and consistency in the dataset. This process prevents errors and ensures reliable analysis by keeping
only unique records.

2.3.2. Data Integration


It involves merging data from various sources into a single, unified dataset. It can be challenging due to
differences in data formats, structures, and meanings. Techniques like record linkage and data fusion
help in combining data efficiently, ensuring consistency and accuracy.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI


7

Record Linkage is the process of identifying and matching records from different datasets that refer to
the same entity, even if they are represented differently. It helps in combining data from various sources
by finding corresponding records based on common identifiers or attributes.
Data Fusion involves combining data from multiple sources to create a more comprehensive and
accurate dataset. It integrates information that may be inconsistent or incomplete from different
sources, ensuring a unified and richer dataset for analysis.

2.3.3. Data Reduction


Data Reduction: It reduces the dataset’s size while maintaining key information. This can be done
through feature selection, which chooses the most relevant features, and feature extraction, which
transforms the data into a lower-dimensional space while preserving important details. It uses various
reduction techniques such as,
Dimensionality Reduction (e.g., Principal Component Analysis): A technique that reduces the number
of variables in a dataset while retaining its essential information.
Numerosity Reduction: Reducing the number of data points by methods like sampling to simplify the
dataset without losing critical patterns.
Data Compression: Reducing the size of data by encoding it in a more compact form, making it easier
to store and process.

2.3.4. Data Transformation


It involves converting data into a format suitable for analysis. Common techniques include
normalization, which scales data to a common range; standardization, which adjusts data to have zero
mean and unit variance; and discretization, which converts continuous data into discrete categories.
These techniques help prepare the data for more accurate analysis.
Data Normalization: The process of scaling data to a common range to ensure consistency across
variables.
Discretization: Converting continuous data into discrete categories for easier analysis.
Data Aggregation: Combining multiple data points into a summary form, such as averages or totals, to
simplify analysis.
Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to provide a higher-
level view for better understanding and analysis.

Advantages of Data Preprocessing


• Improved Data Quality: Ensures data is clean, consistent, and reliable for analysis.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI


8

• Better Model Performance: Reduces noise and irrelevant data, leading to more accurate
predictions and insights.
• Efficient Data Analysis: Streamlines data for faster and easier processing.
• Enhanced Decision-Making: Provides clear and well-organized data for better business
decisions.

Disadvantages of Data Preprocessing


• Time-Consuming: Requires significant time and effort to clean, transform, and organize data.
• Resource-Intensive: Demands computational power and skilled personnel for complex
preprocessing tasks.
• Potential Data Loss: Incorrect handling may result in losing valuable information.
• Complexity: Handling large datasets or diverse formats can be challenging.Uses of Data
Preprocessing

Uses of Data Preprocessing


Data Warehousing: In data warehousing, preprocessing is essential for cleaning, integrating, and
structuring data before it is stored in a centralized repository. This ensures the data is consistent and
reliable for future queries and reporting.
Data Mining: Data preprocessing in data mining involves cleaning and transforming raw data to make
it suitable for analysis. This step is crucial for identifying patterns and extracting insights from large
datasets.
Machine Learning: In machine learning, preprocessing prepares raw data for model training. This
includes handling missing values, normalizing features, encoding categorical variables, and splitting
datasets into training and testing sets to improve model performance and accuracy.
Data Science: Data preprocessing is a fundamental step in data science projects, ensuring that the data
used for analysis or building predictive models is clean, structured, and relevant. It enhances the overall
quality of insights derived from the data.
Web Mining: In web mining, preprocessing helps analyze web usage logs to extract meaningful user
behavior patterns. This can inform marketing strategies and improve user experience through
personalized recommendations.
Business Intelligence (BI): Preprocessing supports BI by organizing and cleaning data to create
dashboards and reports that provide actionable insights for decision-makers.
Deep Learning Purpose: Similar to machine learning, deep learning applications require
preprocessing to normalize or enhance features of the input data, optimizing model training processes.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

You might also like