FDS CH 3
FDS CH 3
• Data preprocessing is the method of collecting raw data and translating it into usable/meaningful information.
• The data preprocessing is required to improve the quality of data.
DATA OBJECTS :
• Data is a collection of data objects and their attributes.
• A collection of attributes describe an object.
• Data objects can also be referred to as samples, examples, instances, case, entity, data points or objects.
Data Attributes :
• A data attribute is a singlevalue descriptor for a data object.
• An attribute is a property or characteristic of an object.
• There are broadly four types of attributes namely, Nominal attribute, Binary attribute, Ordinal attribute and
Numeric attributes.
DATA QUALITY :
• Data quality can be defined as, “the ability of a given data set to serve an intended purpose”.
• Data preprocessing is responsible for maintain the quality of data.
• There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness,
M
believability, and interpretability.
• There are many reasons for inaccurate, incomplete, and inconsistent in real-world databases and data
warehouses.
Inaccuracy: • Inaccurate data means having incorrect attribute values.
Data Cleaning :
• Data cleaning is used to handle missing data.
r.
• Data cleaning also known as data scrubbing.
• Data cleaning is the process of correcting or removing incorrect, incomplete or duplicate data within a dataset.
Missing Values :
• Some values in the data may not be filled up for various reasons and hence are considered missing.
• there can be three cases of missing data:
Missing Completely At Random (MCAR), Missing At Random Data (MAR) , Missing Not At Random (MNAR).
R
**Data Transformation :
• Data transformation is the process of converting raw data into a structure data.
• Data transformation is a data preprocessing technique that transforms the data into alternate forms.
• Data transformation is a process of converting raw data into a single and easy-to-read format.
• Data transformation is the process of changing the format, structure, or values of data.
oh
Rescaling:
• Rescaling means transforming the data so that it fits within a specific scale, like 0-100 or 0-1.
• Rescaling of data allows scaling all data values to lie between a specified minimum and maximum value.
Normalizing:
• To avoid dependence on the choice of measurement units, the data should be normalized.
• Normalization scaled data is fall within a smaller range, such as 0.0 to 1.0 or -1.0 to 1.0
it
• Normalizing the data attempts to give all attributes an equal weight.
Binarizing:
• It is the process of converting data to either 0 or 1 based on a threshold value.
• All the data values above the threshold value are marked 1 whereas all the data values equal to or below
the threshold value are marked as 0.
Standarizing
• Standardization also called mean removal.
• In other words, Standardization is another scaling technique where the values are centered around the
mean with a unit standard deviation.
Data Discretization :
• Data discretization is method of translating attribute values of continuous data into a finite set of intervals with
minimal information loss.
• The data discretization technique is used to divide the attributes of the continuous nature into data with
intervals.
it
oh
R
r.
M