0% found this document useful (0 votes)
137 views14 pages

Cleaning Data in Python: Pu!ing It All Together

This document discusses techniques for cleaning data in Python. It introduces loading data using Pandas, performing quality checks by viewing the data, combining multiple datasets, and cleaning data through functions and expressions. The goal is to tidy the data into a clean, unified format ready for analysis. Examples demonstrate loading, viewing, checking, and transforming Gapminder health and demographic data into a tidy format.

Uploaded by

NourheneMbarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views14 pages

Cleaning Data in Python: Pu!ing It All Together

This document discusses techniques for cleaning data in Python. It introduces loading data using Pandas, performing quality checks by viewing the data, combining multiple datasets, and cleaning data through functions and expressions. The goal is to tidy the data into a clean, unified format ready for analysis. Examples demonstrate loading, viewing, checking, and transforming Gapminder health and demographic data into a tidy format.

Uploaded by

NourheneMbarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CLEANING DATA IN PYTHON

Pu!ing it all
together
Cleaning Data in Python

Pu!ing it all together


● Use the techniques you’ve learned on Gapminder data
● Clean and tidy data saved to a file
● Ready to be loaded for analysis!
● Dataset consists of life expectancy by country and
year
● Data will come in multiple parts
● Load
● Preliminary quality diagnosis
● Combine into single dataset
Cleaning Data in Python

Useful methods
In [1]: import pandas as pd

In [2]: df = pd.read_csv('my_data.csv')

In [3]: df.head()

In [4]: df.info()

In [5]: df.columns

In [6]: df.describe()

In [7]: df.column.value_counts()

In [8]: df.column.plot('hist')
Cleaning Data in Python

Data quality
In [9]: def cleaning_function(row_data):
...: # data cleaning steps
...: return ...

In [10]: df.apply(cleaning_function, axis=1)

In [11]: assert (df.column_data > 0).all()


Cleaning Data in Python

Combining data
● pd.merge(df1, df2, …)
● pd.concat([df1, df2, df3, …])
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Initial impressions
of the data
Cleaning Data in Python

Principles of tidy data


● Rows form observations
● Columns form variables
● Tidying data will make data cleaning easier
● Melting turns columns into rows
● Pivot will take unique values from a column
and create new columns
Cleaning Data in Python

Checking data types


In [1]: df.dtypes

In [2]: df['column'] = df['column'].to_numeric()

In [3]: df['column'] = df['column'].astype(str)


Cleaning Data in Python

Additional calculations and saving your data


In [4]: df['new_column'] = df['column_1'] + df['column_2']

In [5]: df['new_column'] = df.apply(my_function, axis=1)

In [6]: df.to_csv['my_data.csv']
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Final thoughts
Cleaning Data in Python

You’ve learned how to…


● Load and view data in pandas
● Visually inspect data for errors and potential problems
● Tidy data for analysis and reshape it
● Combine datasets
● Clean data by using regular expressions and
functions
● Test your data and be proactive in finding
potential errors
CLEANING DATA IN PYTHON

Congratulations!

You might also like