0% found this document useful (0 votes)
14 views9 pages

Data Cleaning and Preparation

Uploaded by

Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Data Cleaning and Preparation

Uploaded by

Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Cleaning and

Preparation
using Python

swatikulkarni24/
Data cleaning and preparation are crucial
steps in the data analysis process. They
involve transforming raw data into a clean,
structured format that is suitable for
analysis.

Python provides several libraries and tools


that can be used for data cleaning and
preparation tasks.

Let's explore some commonly used


techniques and libraries.
1. Importing Libraries:
Start by importing the necessary libraries,
such as pandas and NumPy, which are
widely used for data manipulation and
analysis in Python.
import pandas as pd
import numpy as np

swatikulkarni24/
2. Loading Data:
Load your data into a pandas DataFrame,
which provides a powerful data structure
for working with structured data.

df = pd.read_csv('data.csv')
Replace 'data.csv' with your file path or URL

3. Removing Duplicates:
Duplicates can skew analysis results, so it's
important to identify and remove them if
necessary.

df.drop_duplicates(inplace=True)

swatikulkarni24/
4. Handling Missing Values:
Missing values are common in datasets and
can cause issues during analysis. You can
handle missing values in various ways, such
as dropping rows or columns with missing
values, imputing missing values with mean
or median, or using more sophisticated
techniques.
Drop rows with missing values
df.dropna(inplace=True)

Drops columns with any missing value


df.dropna(axis=1)

Impute missing values with mean


df.fillna(df.mean(), inplace=True)

Interpolates missing values using various


methods
df.interpolate()
swatikulkarni24/
5. Handling Outliers:
Outliers are extreme values that deviate
significantly from the majority of the data.
Depending on your analysis, you may
choose to remove or transform outliers.
First quartile
Q1 = df['column'].quantile(0.25)
Third quartile
Q3 = df['column'].quantile(0.75)
Interquartile range
IQR = Q3 - Q1
Remove outliers
df = df[~((df['column'] < (Q1 - 1.5 * IQR)) |
(df['column'] > (Q3 + 1.5 * IQR)))]
Or
Remove outliers using z-score
from scipy import stats
z_scores = np.abs(stats.zscore(df['column_name']))
threshold = 3
df = df[(z_scores < threshold)]

swatikulkarni24/
6. Handling Inconsistent Data:
Deal with inconsistencies in your data, such as
inconsistent capitalization or spelling errors.
Convert text to lowercase
df['column'].str.lower()
Replace specific values
df['column'].replace({'old_value': 'new_value'},
inplace=True)
7. Text Cleaning and Regular Expressions:
Clean text data using regular expressions
(regex) to remove special characters, and
unwanted symbols, or extract specific
patterns. import re
Remove non-alphabetic characters
df['text_column'] =
df['text_column'].apply(lambda x: re.sub('[^a-
zA-Z]', ' ', x))

swatikulkarni24/
8. Correcting Data Types:
Ensure that columns have the correct data
types for analysis
Convert a column to an integer type
df['column'] = df['column'].astype('int')

9. Handling Date and Time:


If your data includes date or time
information, convert them to the appropriate
data types and extract useful features.

Convert to DateTime format


df['date_column'] =
pd.to_datetime(df['date_column'])
Extract year
df['year'] = df['date_column'].dt.year
Extract month
df['month'] = df['date_column'].dt.month

swatikulkarni24/
10. Feature Engineering:
Feature engineering involves creating new
features or modifying existing ones to
improve the predictive power of the
dataset.
Create a new feature
df['new_feature'] = df['feature1'] + df['feature2']

Binning numerical values into categories


df['category'] = pd.cut(df['numerical_feature'],
bins=3, labels=['low', 'medium', 'high'])

These are just some common techniques


for data cleaning in Python. The specific
steps you need to perform may vary
depending on your dataset and the cleaning
requirements.

swatikulkarni24/
Follow me for more such contents
https://www.linkedin.com/in/swatikulkarni24/

swatikulkarni24/

You might also like