Pandas 1
Pandas 1
Data Cleaning
What is Pandas?
linkedin.com/in/ileonjose
What is Data Cleaning?
Why bother with data cleaning? Well, imagine trying to analyze sales
trends when some entries are missing, or working with a dataset that
has duplicate records throwing off your calculations.
linkedin.com/in/ileonjose
What is Data Processing?
You may be wondering, "Does data cleaning and data preprocessing
mean the same thing?" The answer is no – they do not.
The goal is to turn your dataset into a refined masterpiece, ready for
analysis or modeling.
linkedin.com/in/ileonjose
How to Import the Necessary Libraries
Before we embark on data cleaning and preprocessing, let's import
the Pandas library.
To save time and typing, we often import Pandas as pd. This lets us
use the shorter pd.read_csv() instead of pandas.read_csv() for
reading CSV files, making our code more efficient and readable.
import pandas as pd
df = pd.read_csv('your_dataset.csv')
linkedin.com/in/ileonjose
Exploratory Data Analysis (EDA)
For example:
df.head() will call the first 5 rows of the dataset. You can specify
the number of rows to be displayed in the parentheses.
linkedin.com/in/ileonjose
#Display the first few rows of the dataset
print(df.head())
#Summary statistics
print(df.describe())
print(df.info())
linkedin.com/in/ileonjose
One way to do this is by removing the missing values altogether.
Code snippet below:
print(df.isnull().sum())
#Drop rows with missing valiues and place it in a new variable "df_cleaned"
df_cleaned = df.dropna()
#Fill missing values with mean for numerical data and place it ina new variable called df_filled
df_filled = df.fillna(df.mean())
But if the number of rows that have missing values is large, then this
method will be inadequate.
For numerical data, you can simply compute the mean and input it
into the rows that have missing values. Code snippet below:
linkedin.com/in/ileonjose
#Replace missing values with the mean of each column
df.fillna(df.mean(), inplace=True)
#If you want to replace missing values in a specific column, you can do it this way:
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
#Now, df contains no missing values, and NaNs have been replaced with column mean
linkedin.com/in/ileonjose
Code snippet below:
#Identify duplicates
print(df.duplicated().sum())
#Remove duplicates
df_no_duplicates = df.drop_duplicates()
Data from various sources are usually messy and the data types of
some values may be in the wrong format, for example some
numerical values may come in 'float' or 'string' format instead of
'integer' format and a mix up of these formats leads to errors and
wrong results.
linkedin.com/in/ileonjose
You can convert a Column of type int to float with the following code:
df['Column1'] = df['Column1'].astype(float)
print(df.dtypes)
linkedin.com/in/ileonjose
There are several methods to identify and remove outliers, they are:
#Using median calculations and IQR, outliers are identified and these data points should be removed
Q1 = df["column_name"].quantile(0.25)
Q3 = df["column_name"].quantile(0.75)
IQR = Q3 - Q1
df = df[df["column_name"].between(lower_bound, upper_bound)]
linkedin.com/in/ileonjose
If you find this helpful, Repost
linkedin.com/in/ileonjose