0% found this document useful (0 votes)

48 views8 pages

Unit 2 Data Preprocessing

The document provides an overview of data preprocessing, detailing the types of data objects and attributes, including qualitative and quantitative classifications. It outlines essential preprocessing concepts such as data cleaning, integration, reduction, and transformation, emphasizing their importance in improving data quality for analysis. Additionally, it discusses the advantages and disadvantages of data preprocessing and its applications in fields like data mining, machine learning, and business intelligence.

Uploaded by

donmanish98072

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views8 pages

Unit 2 Data Preprocessing

Uploaded by

donmanish98072

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

Unit 2. Data Preprocessing

2.1. Data Objects and Attribute Types
Data objects are the essential part of a database. A data object represents the entity. Data Objects are
like a group of attributes of an entity. For example, a sales data object may represent customers, sales,
or purchases. When a data object is listed in a database they are called data tuples.
Data attributes refer to the specific characteristics or properties that describe individual data objects
within a dataset. These attributes provide meaningful information about the objects and are used to
analyze, classify, or manipulate the data.

Types of Attributes
This is the initial phase of data preprocessing involves categorizing attributes into different types,
which serves as a foundation for subsequent data processing steps. Attributes can be broadly classified
into two main types:
• Qualitative: Nominal (N), Ordinal (O), Binary(B)
• Quantitative: Numeric, Discrete, Continuous

Qualitative Attributes
1. Nominal Attributes
Nominal attributes, as related to names, refer to categorical data where the values represent different
categories or labels without any inherent order or ranking. These attributes are often used to represent
names or labels associated with objects, entities, or concepts.

Attributes Values
Colors Black, Brown, White
Categorical Data Lecturer, Professor, Assistant Professor
2. Binary Attributes
Binary attributes are a type of qualitative attribute where the data can take on only two distinct values
or states. These attributes are often used to represent yes/no, presence/absence, or true/false conditions
within a dataset. They are particularly useful for representing categorical data where there are only two
possible outcomes. For instance, in a medical study, a binary attribute could represent whether a patient
is affected or unaffected by a particular condition.
Symmetric: In a symmetric attribute, both values or states are considered equally important or
interchangeable. For example, in the attribute “Gender” with values “Male” and “Female,” neither

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

value holds precedence over the other, and they are considered equally significant for analysis
purposes.
Asymmetric: An asymmetric attribute indicates that the two values or states are not equally important
or interchangeable. For instance, in the attribute “Result” with values “Pass” and “Fail,” the states are
not of equal importance; passing may hold greater significance than failing in certain contexts, such as
academic grading or certification exams.

Attributes Values
Symmetric Gender Male, Female
Cancer Detected Yes, No
Asymmetric Result Pass, Fail
3. Ordinal Attributes
Ordinal attributes are a type of qualitative attribute where the values possess a meaningful order or
ranking, but the magnitude between values is not precisely quantified. In other words, while the order
of values indicates their relative importance or precedence, the numerical difference between them is
not standardized or known.

Attributes Values
Grade A,B,C,D,E,F
Basic Pay Scale 16,17,18

Quantitative Attributes
1. Numeric
A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types: interval , and ratio-scaled.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical attributes
do not have the correct reference point, or we can call zero points. Data can be added and subtracted at
an interval scale but can not be multiplied or divided. Consider an example of temperature in degrees
Centigrade. If a day’s temperature of one day is twice of the other day we cannot say that one day is
twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we
can say of a value as being a multiple (or ratio) of another value. The values are ordered, and we can
also compute the difference between values, and the mean, median, mode, Quantile-range, and Five
number summary can be given.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

2. Discrete
Discrete data refer to information that can take on specific, separate values rather than a continuous
range. These values are often distinct and separate from one another, and they can be either numerical
or categorical in nature.

Attributes Values
Profession Teacher, Manager, Peon
ZIP code 44200, 21020
3. Continuous
Continuous data, unlike discrete data, can take on an infinite number of possible values within a given
range. It is characterized by being able to assume any value within a specified interval, often including
fractional or decimal values.

Attributes Values
Height 5.4, 5.8, 6.0,...etc.
Weight 68.0, 55.0, 45.5,...etc.

2.2. Statistical Description of Data

• For data preprocessing to be successful, it is essential to have an overall picture of our data.
Basic statistical descriptions can be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.
• Basic statistical descriptions can be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.
• For data preprocessing tasks, we want to learn about data characteristics regarding both central
tendency and dispersion of the data.
• Measures of central tendency include mean, median, mode and midrange.
• Measures of data dispersion include quartiles, interquartile range (IQR) and variance.
• These descriptive statistics are of great help in understanding the distribution of the data.

Measuring the Central Tendency

1. Mean
The mean of a data set is the average of all the data values. The sample mean x is the point estimator of
the population mean μ.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

2. Median
Sum of the values of then observations Number of observations in the sample
Sum of the values of the N observations Number of observations in the population
The median of a data set is the value in the middle when the data items are arranged in ascending order.
Whenever a data set has extreme values, the median is the preferred measure of central location.
The median is the measure of location most often reported for annual income and property value data.
A few extremely large incomes of property values can inflate the mean.
For an off number of observations:
7 observations= 26, 18, 27, 12, 14, 29, 19.
Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29
The median is the middle value.
Median=19
For an even number of observations
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30
The median is the average of the middle two values.

3. Mode
The mode of a data set is the value that occurs with greatest frequency. The greatest frequency can
occur at two or more different values. If the data have exactly two modes, the data have exactly two
modes, the data are bimodal. If the data have more than two modes, the data are multimodal.
Weighted mean: Sometimes, each value in a set may be associated with a weight, the weights reflect
the significance, importance or occurrence frequency attached to their respective values.

Measuring the Dispersion of Data

An outlier is an observation that lies an abnormal distance from other values in a random sample from
a population.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

First quartile (Q1): The first quartile is the value, where 25% of the values are smaller than Q1 and
75% are larger.
Third quartile (Q3): The third quartile is the value, where 75 % of the values are smaller than Q3 and
25% are larger.
The box plot is a useful graphical display for describing the behavior of the data in the middle as well
as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles. If
the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the
interquartile range or IQ.
Range: Difference between highest and lowest observed values
Variance: The variance is a measure of variability that utilizes all the data. It is based on the difference
between the value of each observation (x;) and the mean (x) for a sample, u for a population).
The variance is the average of the squared between each data value and the mean.

Standard Deviation
The standard deviation of a data set is the positive square root of the variance. It is measured in the
same in the same units as the data, making it more easily interpreted than the variance.
The standard deviation is computed as follows:

Graphic Displays of Basic Statistical Descriptions

• There are many types of graphs for the display of data summaries and distributions, such as Bar
charts, Pie charts, Line graphs, Boxplot, Histograms, Quantile plots and Scatter plots.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

2.3. Data Preprocessing Concepts

Data preprocessing is the process of preparing raw data for analysis by cleaning and transforming it
into a usable format. In data mining it refers to preparing raw data for mining by performing tasks like
cleaning, transforming, and organizing it into a format suitable for mining algorithms.
• Goal is to improve the quality of the data.
• Helps in handling missing values, removing duplicates, and normalizing data.
• Ensures the accuracy and consistency of the dataset.

2.3.1. Data Cleaning

It is the process of identifying and correcting errors or inconsistencies in the dataset. It involves
handling missing values, removing duplicates, and correcting incorrect or outlier data to ensure the
dataset is accurate and reliable. Clean data is essential for effective analysis, as it improves the quality
of results and enhances the performance of data models.
Missing Values: This occur when data is absent from a dataset. You can either ignore the rows with
missing data or fill the gaps manually, with the attribute mean, or by using the most probable value.
This ensures the dataset remains accurate and complete for analysis.
Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to interpret, often
caused by errors in data collection or entry. It can be handled in several ways:
Binning Method: The data is sorted into equal segments, and each segment is smoothed by replacing
values with the mean or boundary values.
Regression: Data can be smoothed by fitting it to a regression function, either linear or multiple, to
predict values.
Clustering: This method groups similar data points together, with outliers either being undetected or
falling outside the clusters. These techniques help remove noise and improve data quality.
Removing Duplicates: It involves identifying and eliminating repeated data entries to ensure accuracy
and consistency in the dataset. This process prevents errors and ensures reliable analysis by keeping
only unique records.

2.3.2. Data Integration

It involves merging data from various sources into a single, unified dataset. It can be challenging due to
differences in data formats, structures, and meanings. Techniques like record linkage and data fusion
help in combining data efficiently, ensuring consistency and accuracy.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

Record Linkage is the process of identifying and matching records from different datasets that refer to
the same entity, even if they are represented differently. It helps in combining data from various sources
by finding corresponding records based on common identifiers or attributes.
Data Fusion involves combining data from multiple sources to create a more comprehensive and
accurate dataset. It integrates information that may be inconsistent or incomplete from different
sources, ensuring a unified and richer dataset for analysis.

2.3.3. Data Reduction

Data Reduction: It reduces the dataset’s size while maintaining key information. This can be done
through feature selection, which chooses the most relevant features, and feature extraction, which
transforms the data into a lower-dimensional space while preserving important details. It uses various
reduction techniques such as,
Dimensionality Reduction (e.g., Principal Component Analysis): A technique that reduces the number
of variables in a dataset while retaining its essential information.
Numerosity Reduction: Reducing the number of data points by methods like sampling to simplify the
dataset without losing critical patterns.
Data Compression: Reducing the size of data by encoding it in a more compact form, making it easier
to store and process.

2.3.4. Data Transformation

It involves converting data into a format suitable for analysis. Common techniques include
normalization, which scales data to a common range; standardization, which adjusts data to have zero
mean and unit variance; and discretization, which converts continuous data into discrete categories.
These techniques help prepare the data for more accurate analysis.
Data Normalization: The process of scaling data to a common range to ensure consistency across
variables.
Discretization: Converting continuous data into discrete categories for easier analysis.
Data Aggregation: Combining multiple data points into a summary form, such as averages or totals, to
simplify analysis.
Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to provide a higher-
level view for better understanding and analysis.

Advantages of Data Preprocessing

• Improved Data Quality: Ensures data is clean, consistent, and reliable for analysis.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

• Better Model Performance: Reduces noise and irrelevant data, leading to more accurate
predictions and insights.
• Efficient Data Analysis: Streamlines data for faster and easier processing.
• Enhanced Decision-Making: Provides clear and well-organized data for better business
decisions.

Disadvantages of Data Preprocessing

• Time-Consuming: Requires significant time and effort to clean, transform, and organize data.
• Resource-Intensive: Demands computational power and skilled personnel for complex
preprocessing tasks.
• Potential Data Loss: Incorrect handling may result in losing valuable information.
• Complexity: Handling large datasets or diverse formats can be [Link] of Data
Preprocessing

Uses of Data Preprocessing

Data Warehousing: In data warehousing, preprocessing is essential for cleaning, integrating, and
structuring data before it is stored in a centralized repository. This ensures the data is consistent and
reliable for future queries and reporting.
Data Mining: Data preprocessing in data mining involves cleaning and transforming raw data to make
it suitable for analysis. This step is crucial for identifying patterns and extracting insights from large
datasets.
Machine Learning: In machine learning, preprocessing prepares raw data for model training. This
includes handling missing values, normalizing features, encoding categorical variables, and splitting
datasets into training and testing sets to improve model performance and accuracy.
Data Science: Data preprocessing is a fundamental step in data science projects, ensuring that the data
used for analysis or building predictive models is clean, structured, and relevant. It enhances the overall
quality of insights derived from the data.
Web Mining: In web mining, preprocessing helps analyze web usage logs to extract meaningful user
behavior patterns. This can inform marketing strategies and improve user experience through
personalized recommendations.
Business Intelligence (BI): Preprocessing supports BI by organizing and cleaning data to create
dashboards and reports that provide actionable insights for decision-makers.
Deep Learning Purpose: Similar to machine learning, deep learning applications require
preprocessing to normalize or enhance features of the input data, optimizing model training processes.

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
Machine Learning Attribute Types Explained
No ratings yet
Machine Learning Attribute Types Explained
31 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
01 Data
No ratings yet
01 Data
100 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
02 Data
No ratings yet
02 Data
36 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
CH 2
No ratings yet
CH 2
35 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
IML U2
No ratings yet
IML U2
15 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
CH 2
No ratings yet
CH 2
68 pages
02 Data
No ratings yet
02 Data
64 pages
Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages
The Machine Learning Process Involves Several Steps That Help Develop and Deploy A Successful Machine Learning Model
No ratings yet
The Machine Learning Process Involves Several Steps That Help Develop and Deploy A Successful Machine Learning Model
62 pages
EDS Unit 2 ?
No ratings yet
EDS Unit 2 ?
13 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
ANL303 - Week - 3 - Jan 2023
No ratings yet
ANL303 - Week - 3 - Jan 2023
69 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Data Types and Statistical Analysis Guide
No ratings yet
Data Types and Statistical Analysis Guide
38 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
58 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
02 Data
No ratings yet
02 Data
35 pages
DM Lec02
No ratings yet
DM Lec02
32 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
ISM - Session 1 - May 2025
No ratings yet
ISM - Session 1 - May 2025
54 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Further Bound Reference
No ratings yet
Further Bound Reference
42 pages
About Data
No ratings yet
About Data
25 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
26 pages
IDS Notes Unit 2
No ratings yet
IDS Notes Unit 2
20 pages
Module 3
No ratings yet
Module 3
2 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
BBB 2025 FRR Secondary 2 Set 2 Solution Manual
No ratings yet
BBB 2025 FRR Secondary 2 Set 2 Solution Manual
16 pages
Application Form For Allied
No ratings yet
Application Form For Allied
1 page
Meeting Checklist
No ratings yet
Meeting Checklist
11 pages
CA Final FR Volume-4
No ratings yet
CA Final FR Volume-4
54 pages
PE 3 4D B.inggris Tema QW Be Going To
No ratings yet
PE 3 4D B.inggris Tema QW Be Going To
1 page
2025 - NATO - Topic - Emerging and Disruptive Technologies
No ratings yet
2025 - NATO - Topic - Emerging and Disruptive Technologies
5 pages
Prabowo's Cabinet Lineup
No ratings yet
Prabowo's Cabinet Lineup
4 pages
WACC and Cost of Capital Analysis
No ratings yet
WACC and Cost of Capital Analysis
7 pages
Ramsons Theory of Sexuality Agression Re PDF
No ratings yet
Ramsons Theory of Sexuality Agression Re PDF
177 pages
Wto Study Guide - Gcmun 24
No ratings yet
Wto Study Guide - Gcmun 24
17 pages
Chapter 1. Situation of Import Less Than Container Load (LCL) Cosmetics by Sea at Bach Viet Shipping Company Limited - Da Nang Branch
No ratings yet
Chapter 1. Situation of Import Less Than Container Load (LCL) Cosmetics by Sea at Bach Viet Shipping Company Limited - Da Nang Branch
38 pages
2022 Kompaktomat Range Brochure
No ratings yet
2022 Kompaktomat Range Brochure
12 pages
Concord
No ratings yet
Concord
1 page
ABCE Newsletter: Jan 2024 Highlights
No ratings yet
ABCE Newsletter: Jan 2024 Highlights
35 pages
Ilao-Oreta v. Sps. Ronquillo
No ratings yet
Ilao-Oreta v. Sps. Ronquillo
1 page
De Facto Corp. by Estoppel
No ratings yet
De Facto Corp. by Estoppel
21 pages
Whistleblower's Open Letter To Canadians
90% (10)
Whistleblower's Open Letter To Canadians
2 pages
Oblicon Quizzes Quiz 1
No ratings yet
Oblicon Quizzes Quiz 1
5 pages
The Effect of Word-of-Mouth Marketing On The Business Performance of Barbershops in Laoag City
No ratings yet
The Effect of Word-of-Mouth Marketing On The Business Performance of Barbershops in Laoag City
3 pages
2011 IndiaStateDistSbDistTwn 0000
No ratings yet
2011 IndiaStateDistSbDistTwn 0000
21 pages
Comprehensive Analysis of Hypertension by Harrison
No ratings yet
Comprehensive Analysis of Hypertension by Harrison
12 pages
C. Lee Tocci: Author & Producer Bio
No ratings yet
C. Lee Tocci: Author & Producer Bio
2 pages
PG For Cloud Computing Security - EN v2.1 (2024.07)
No ratings yet
PG For Cloud Computing Security - EN v2.1 (2024.07)
57 pages
Internship Report
No ratings yet
Internship Report
30 pages
Grade 1 Lesson Plan: Letter Ck
No ratings yet
Grade 1 Lesson Plan: Letter Ck
6 pages
NMADES Construction VR Feasibility Study
No ratings yet
NMADES Construction VR Feasibility Study
9 pages
Toega Maiden Underground Resource and Scoping Study: Press Release
No ratings yet
Toega Maiden Underground Resource and Scoping Study: Press Release
19 pages
Real Estate Portfolio Optimization Insights
No ratings yet
Real Estate Portfolio Optimization Insights
13 pages
SAT Suite Question Bank - Results
No ratings yet
SAT Suite Question Bank - Results
15 pages
Honda CRF 250 M New
No ratings yet
Honda CRF 250 M New
3 pages

Unit 2 Data Preprocessing

Uploaded by

Unit 2 Data Preprocessing

Uploaded by

1

Unit 2. Data Preprocessing

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

2.2. Statistical Description of Data

Measuring the Central Tendency

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

Measuring the Dispersion of Data

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

Graphic Displays of Basic Statistical Descriptions

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

2.3. Data Preprocessing Concepts

2.3.1. Data Cleaning

2.3.2. Data Integration

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

2.3.3. Data Reduction

2.3.4. Data Transformation

Advantages of Data Preprocessing

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

Disadvantages of Data Preprocessing

Uses of Data Preprocessing

Compiled by: Er Rupesh Shrestha DCOM III/I, NPI

You might also like