Data Preprocessing in Python Pandas (With Code)
Data Preprocessing in Python Pandas (With Code)
1. Introduction
Definition of data pre-processing
Data preprocessing is the process of preparing data for analysis by cleaning, transforming, and
selecting relevant features. It involves identifying and handlingmissing or duplicate data, scaling
features, encoding categorical data, reducingdimensionality, and splitting data into training and
testing sets.
Proper data preprocessing helps to ensure data accuracy and consistency and leads to more
accurate and reliable results.
Python is a popular programming language used in data analysis and machine learning. It offers a
wide range of libraries and tools that can be used for data preprocessing tasks, such as data cleaning,
feature scaling, encoding categoricaldata, and reducing dimensionality.
Some of the popular libraries used for data preprocessing in Python include NumPy, Pandas,
Scikit-learn, and Matplotlib. These libraries provide various functions and methods that make it
easier to perform data preprocessing tasksefficiently and effectively.
Data preprocessing is an essential step in data analysis and machine learning. It helps to ensure
data accuracy, consistency, and suitability for downstream analysis.Some of the reasons why data
preprocessing is important are:
1. Improves data accuracy: By identifying and handling missing or duplicate data,data scientists
can improve data accuracy, reducing the risk of errors and inaccuracies in the results.
2. Handles outliers: Outliers can skew the results of data analysis or machine learning
models. Data preprocessing techniques such as normalization or standardization can
help to handle outliers and improve the performance ofmodels.
3. Enables feature scaling: Scaling features is an important step in data preprocessing that
helps to ensure that all features have the same scale. This isimportant for some machine
learning algorithms that are sensitive to the scaleof features.
4. Encodes categorical data: Many machine learning algorithms cannot handlecategorical
data. Therefore, data preprocessing techniques such as one-hot encoding or label
encoding can be used to convert categorical data into numerical data that can be used in
machine learning models.
5. Reduces dimensionality: Data preprocessing techniques such as principal component
analysis (PCA) can be used to reduce the dimensionality of data, making it easier to
analyze or model.
Overall, data preprocessing is critical to ensure that data is suitable for analysis and to obtain
reliable and accurate results. It helps to eliminate errors and inaccuracies,improve the
performance of machine learning models for decisions based on the data.
Jignesh Sanghvi
2
Python Data Processing (with code)
Data cleaning involves various techniques that can be used to identify and handle missing or
erroneous data. Some of the techniques used in data cleaning are:
1. Removing duplicates: Duplicates can skew the results of data analysis or machine learning
models. Removing duplicates can improve the accuracy ofresults and reduce the risk of
errors.
2. Handling missing data: Missing data can be handled using various techniques, such as
deleting missing data, imputing missing data, or replacing missing datawith values such as
mean or median.
3. Handling outliers: Outliers can also be considered as missing data. Various techniques, such
as winsorization or replacing outliers with missing data, canbe used to handle outliers.
4. Standardizing or normalizing data: Standardizing or normalizing data involvesscaling the
data to a common scale. This is important for some machine learning algorithms that are
sensitive to the scale of features.
5. Encoding categorical data: Categorical data can be encoded into numerical data using
techniques such as one-hot encoding or label encoding. This is important for some machine
learning algorithms that cannot handle categorical data.
6. Feature selection: Feature selection involves selecting relevant features foranalysis or
modeling. This is important for reducing dimensionality and improving the
performance of machine learning models.
7. Handling data errors: Data errors, such as data entry errors or formatting errors, can be
handled using various techniques, such as data validation or dataprofiling.
Overall, data cleaning is an important step in data preprocessing that ensures data accuracy,
consistency, and suitability for downstream analysis. It involves various techniques that can be
used to identify and handle missing or erroneous data and improve the performance of machine
learning models.
Data transformation is the process of converting data from one format or structureto another. It
is an important step in data preprocessing that can help to improve the quality of data and make
it more suitable for analysis or modeling. Some of thetechniques used in data transformation are:
1. Scaling: Scaling involves rescaling the data to a common scale, such as between 0 and 1 or -1
and 1. This is important for some machine learning algorithms thatare sensitive to the scale of
features.
2. Normalization: Normalization involves transforming the data so that it has a normal
distribution. This is important for some statistical analyses and machinelearning algorithms
that assume a normal distribution.
Jignesh Sanghvi
3
Python Data Processing (with code)
3. Aggregation: Aggregation involves combining multiple data points into a singledata point.
This can be useful for summarizing data and reducing dimensionality.
4. Discretization: Discretization involves converting continuous data into categorical data. This
can be useful for some machine learning algorithms thatcannot handle continuous data.
5. Encoding: Encoding involves converting categorical data into numerical data. This is
important for some machine learning algorithms that cannot handle categorical data.
6. Feature engineering: Feature engineering involves creating new features from existing
features. This can be useful for improving the performance of machinelearning models.
Overall, data transformation is an important step in data preprocessing that can help to
improve the quality of data and make it more suitable for downstream analysis or modeling.
It involves various techniques that can be used to rescale, normalize, aggregate, discretize,
encode, or engineer features.
Data selection is the process of selecting a subset of data from a larger dataset basedon certain
criteria. It is an important step in data preprocessing that can help to reduce the size of the dataset
and focus on relevant data for analysis or modeling.
1. Random sampling: Random sampling involves selecting a random subset ofdata from
the larger dataset. This is useful when the dataset is too large to process as a whole and
a representative sample is needed.
2. Stratified sampling: Stratified sampling involves dividing the dataset into subgroups based
on a specific variable and then selecting a random sample from each subgroup. This is
useful when the variable is important for analysisor modeling.
3. Feature selection: Feature selection involves selecting a subset of features fromthe dataset
based on their relevance to the analysis or modeling task. This is useful for reducing the
dimensionality of the dataset and improving the performance of the model.
4. Instance selection: Instance selection involves selecting a subset of instances from the
dataset based on their relevance to the analysis or modeling task. Thisis useful for reducing
the size of the dataset and focusing on relevant data.
Overall, data selection is an important step in data preprocessing that can help to reduce the size
of the dataset and focus on relevant data for analysis or modeling. There are several techniques
that can be used for data selection, including random sampling, stratified sampling, feature
selection, and instance selection. The choiceof technique will depend on the specific needs of the
analysis or modeling task.
Jignesh Sanghvi
4
Python Data Processing (with code)
Pandas provides two primary data structures for storing and manipulating data: Series and DataFrame. A Series
is a one-dimensional array-like object that can holdany data type, while a DataFrame is a two-dimensional table-
like object consisting of rows and columns.
1. Data cleaning and transformation: Pandas provides tools for cleaning and transforming data,
including methods for handling missing data, removingduplicates, and replacing values.
2. Data aggregation: Pandas can group data based on one or more variables and perform aggregate
operations on each group, such as sum, mean, and count.
3. Data merging and joining: Pandas can merge multiple datasets based on common columns or
indices, or join two datasets based on a common key.
4. Time series analysis: Pandas provides functionality for working with time seriesdata, including resampling,
moving window statistics, and time zone handling.
One common task in data preprocessing is cleaning the data, which involves handling missing values,
removing duplicates, and correcting errors. Pandas provides several methods for cleaning data,
including:
1. Handling missing data: Pandas provides methods for filling in missing data ordropping missing data points.
For example, the dropna() method drops any rows or columns that contain missing data, while the fillna()
method fills in missing data with a specified value.
2. Removing duplicates: Pandas provides a drop_duplicates() method thatremoves duplicate rows
from a DataFrame.
3. Correcting errors: Pandas provides methods for replacing or removing incorrect values. For example,
the replace() method can be used to replacespecific values with new values.
Jignesh Sanghvi
5
Python Data Processing (with code)
Another important task in data preprocessing is transforming the data to make it more suitable for analysis
or modeling. Pandas provides several methods for transforming data, including:
1. Filtering data: Pandas provides methods for selecting specific rows or columnsbased on criteria such as a
specific value, a range of values, or a boolean expression. For example, the loc[] method can be used to
select rows and columns by label, while the iloc[] method can be used to select rows andcolumns by index.
2. Sorting data: Pandas provides a sort_values() method for sorting a DataFrameby one or more columns or
indices.
3. Grouping data: Pandas provides a groupby() method for grouping a DataFrameby one or more variables
and performing aggregate operations on each group, such as sum, mean, and count.
When working with multiple datasets, it is often necessary to merge or join them together based on a common
column or key. Pandas provides several methods formerging and joining data, including:
Jignesh Sanghvi
6
Python Data Processing (with code)
import pandas as pd
df = pd.read_csv('filename.csv')
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
Jignesh Sanghvi
7
Python Data Processing (with code)
1. Dropping columns
2. Renaming columns
df['column'] = df['column'].astype('float')
df.dropna(inplace=True)
df.fillna(df.median(), inplace=True)
Jignesh Sanghvi
8
Python Data Processing (with code)
df.fillna(df.mean(), inplace=True)
df.fillna(0, inplace=True)
df = pd.get_dummies(df, columns=['categorical_column'])
encoder = LabelEncoder()
df['encoded_column'] = encoder.fit_transform(df['categorical_column'])
Jignesh Sanghvi
9
Python Data Processing (with code)
df['binned_column'] = pd.cut(
df['numerical_column'],
bins=5,
labels=['very_low', 'low', 'medium', 'high', 'very_high'])
df['datetime_column'] = pd.to_datetime(df['datetime_column'])
Jignesh Sanghvi
10
Python Data Processing (with code)
df['year'] = df['datetime_column'].dt.year
df['month'] = df['datetime_column'].dt.month
df['day'] = df['datetime_column'].dt.day
df['text_column'] = df['text_column'].str.lower()
Jignesh Sanghvi
11
Python Data Processing (with code)
import string
df['text_column'] = df['text_column'].str.translate(
str.maketrans('', '', string.punctuation))
Jignesh Sanghvi