Data Preprocessing
Data Preprocessing
Submitted by : Minha
Roll No: 21-CS-02
Submitted to: Ma’am Aziza
1
What is data preprocessing?
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.
Why we need data preprocessing?
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning
the data and making it suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
Best practices:
Data Cleaning
Data cleaning, or data cleansing, is the process of detecting and correcting errors and inconsistencies in
a dataset to enhance its quality and reliability. This involves tasks such as removing duplicate records,
handling missing values through methods like imputation, correcting inconsistent data formats or
representations, standardizing data, removing outliers, transforming data into a suitable format for
analysis, and validating the accuracy of the data. It also includes correcting typos, misspellings, or
inconsistent naming conventions to ensure uniformity. Data cleaning is a vital step in the data
preparation process, ensuring the accuracy and quality of the data for meaningful analysis and
decision-making.
Data Reduction
Data reduction is the process of reducing the volume of data by eliminating irrelevant or redundant
information while preserving the integrity and meaningfulness of the data. The primary objective is to
obtain a smaller representation of the dataset that retains the essential characteristics of the original
data. Techniques for data reduction include dimensionality reduction, such as Principal Component
2
Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), data sampling methods
like random sampling and stratified sampling, data aggregation methods like averaging or summing
values, feature selection techniques, and data compression methods like wavelet transformation and
Singular Value Decomposition (SVD). Data reduction is essential in managing and analyzing large
datasets efficiently, making it more manageable and computationally less expensive.
Integrating
Integrate the data set and prepare the raw material for processing in the machine learning algorithm.
Here in the table above, we can see that there are three variables, namely Name, Age and Gender. We
can see that #2 and #3 have been assigned the wrong gender.
We can use data cleaning here to remove the inappropriate data rows, as we know that this data is
already corrupt.
Else, we can do manual data transformation, which will make the table look like this:
3
Once the issue is fixed, the next step is to perform data reduction by descending the age.
Now, the issue is fixed, and the data set is complete and ready to be used for machine learning models
and algorithms.