100% found this document useful (2 votes)
299 views2 pages

Data Cleaning & Preparation

This document discusses data cleaning and preparation in pandas. It explains that data cleaning involves dealing with incorrect, missing, or duplicate values in a dataset to improve data quality. It then provides examples of different pandas functions for loading and exploring a dataset, finding and handling missing values, dropping columns, cleaning strings, converting data types, renaming columns, and more. The goal is to clean the data and make it suitable for building machine learning models.

Uploaded by

Nisha R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
299 views2 pages

Data Cleaning & Preparation

This document discusses data cleaning and preparation in pandas. It explains that data cleaning involves dealing with incorrect, missing, or duplicate values in a dataset to improve data quality. It then provides examples of different pandas functions for loading and exploring a dataset, finding and handling missing values, dropping columns, cleaning strings, converting data types, renaming columns, and more. The goal is to clean the data and make it suitable for building machine learning models.

Uploaded by

Nisha R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Ex.No.

: Data Cleaning and Preparation


Date :
Aim :

To implement data cleaning and data preparations operations in a dataset.

Description:

Data preparation involves data collection and data cleaning. When working with multiple
sources of data, there are instances where the collected data could be incorrect, mislabeled, or
even duplicated. This would lead to unreliable machine learning models and wrong
outcomes. Hence, it is important to clean your data and get it into a usable form beforehand.
In this article, we cover the concept of data cleaning using Pandas.

What is Data Cleaning?

Data cleaning is the process of dealing with messy, disordered data and eliminating incorrect,
missing, duplicated values in your dataset. It improves the quality and accuracy of the data
being fed to the algorithms that will solve your data science problem.

Dataset:
https://www.kaggle.com/datasets/andrewmvd/occupation-salary-and-likelihood-of-
automation

 Load the dataset and display the first 5 rows [head function]
 Get information about the dataset [info. Function]
 Find duplicated values in a DataFrame [duplicated function]
 Drop duplicate values in a dataframe [drop_duplicates function]

Finding missing elements in a DataFrame


 Display the dataset with boolean values. Display false for null values [isnull() function]
 Display the dataset with boolean values. Display false for N/A values [isna() function]
 Provide boolean values in a columnar format [isna().any() function]
 Give the column-wise sum of the null values present in the dataset [isna().sum() function]
 Count Missing Values in a Pandas DataFrame [isnull().sum()function]
 count non-missing data in a Pandas DataFrame [isnotnull().sum()function]
 Replace missing values with average,median and mode values [mean(), median(), mode()
functions]

Dropping columns in a DataFrame

 Drop a specified column in a dataframe [drop function]


 Dropping Missing Data in a Pandas DataFrame [dropna()function]
 Filling Missing Data in a Pandas DataFrame [fillna()function]

Cleaning Strings in Pandas


 Remove the white space [str.strip() function]
 Replace missing value with Forward values [fillna(method=’pad’) function]
 Replace missing values [replace({}) function]
 Extract a particular record based on a given condition [isin[] function]
 Splitting Strings into Columns in Pandas [str.split()function]
 Changing String Case in Pandas [upper(),lower(),title() function]

Convert data to different formats

 convert all cells in the 'Date' column into dates [to_datetime() function]

Renaming Columns of a DataFrame

 In many cases, you might require renaming the columns for better interpretation [rename()
function]

You might also like