0% found this document useful (0 votes)

1 views37 pages

Chapter 4

The document provides an introduction to using pandas for data analysis, focusing on efficient iteration methods over DataFrames. It covers various techniques such as using .iloc, .iterrows(), and .itertuples() for iterating, as well as the advantages of vectorization and the .apply() method for performance improvement. The document emphasizes best practices for writing efficient Python code when working with pandas.

Uploaded by

gmhuahinfc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views37 pages

Chapter 4

Uploaded by

gmhuahinfc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Intro to pandas

DataFrame iteration
WRITING EFFICIENT PYTHON CODE

Logan Thomas
Scientific Software Technical Trainer,
Enthought
pandas recap
See pandas overview in Intermediate Python
Library used for data analysis

Main data structure is the DataFrame

Tabular data with labeled rows and columns

Built on top of the NumPy array structure

Chapter Objective:
Best practice for iterating over a pandas DataFrame

WRITING EFFICIENT PYTHON CODE

Baseball stats
import pandas as pd

baseball_df = pd.read_csv('baseball_stats.csv')
print(baseball_df.head())

Team League Year RS RA W G Playoffs

0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
3 BOS AL 2012 734 806 69 162 0
4 CHC NL 2012 613 759 61 162 0

WRITING EFFICIENT PYTHON CODE

Baseball stats
Team
0 ARI
1 ATL
2 BAL
3 BOS
4 CHC

WRITING EFFICIENT PYTHON CODE

Baseball stats
Team League Year RS RA W G Playoffs
0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
3 BOS AL 2012 734 806 69 162 0
4 CHC NL 2012 613 759 61 162 0

WRITING EFFICIENT PYTHON CODE

Calculating win percentage
import numpy as np

def calc_win_perc(wins, games_played):

win_perc = wins / games_played

return np.round(win_perc,2)

win_perc = calc_win_perc(50, 100)

print(win_perc)

0.5

WRITING EFFICIENT PYTHON CODE

Adding win percentage to DataFrame
win_perc_list = []
for i in range(len(baseball_df)):
row = baseball_df.iloc[i]
wins = row['W']
games_played = row['G']
win_perc = calc_win_perc(wins, games_played)
win_perc_list.append(win_perc)
baseball_df['WP'] = win_perc_list

WRITING EFFICIENT PYTHON CODE

Adding win percentage to DataFrame
print(baseball_df.head())

Team League Year RS RA W G Playoffs WP

0 ARI NL 2012 734 688 81 162 0 0.50
1 ATL NL 2012 700 600 94 162 1 0.58
2 BAL AL 2012 712 705 93 162 1 0.57
3 BOS AL 2012 734 806 69 162 0 0.43
4 CHC NL 2012 613 759 61 162 0 0.38

WRITING EFFICIENT PYTHON CODE

Iterating with .iloc
%%timeit
win_perc_list = []

for i in range(len(baseball_df)):
row = baseball_df.iloc[i]

wins = row['W']
games_played = row['G']

win_perc = calc_win_perc(wins, games_played)

win_perc_list.append(win_perc)

baseball_df['WP'] = win_perc_list

183 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE

Iterating with .iterrows()
win_perc_list = []

for i,row in baseball_df.iterrows():

wins = row['W']
games_played = row['G']

win_perc = calc_win_perc(wins, games_played)

win_perc_list.append(win_perc)

baseball_df['WP'] = win_perc_list

WRITING EFFICIENT PYTHON CODE

Iterating with .iterrows()
%%timeit
win_perc_list = []

for i,row in baseball_df.iterrows():

wins = row['W']
games_played = row['G']

win_perc = calc_win_perc(wins, games_played)

win_perc_list.append(win_perc)

baseball_df['WP'] = win_perc_list

95.3 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE

Practice DataFrame
iterating with
.iterrows()
WRITING EFFICIENT PYTHON CODE
Another iterator
method: .itertuples()
WRITING EFFICIENT PYTHON CODE

Logan Thomas
Scientific Software Technical Trainer,
Enthought
Team wins data
print(team_wins_df)

Team Year W
0 ARI 2012 81
1 ATL 2012 94
2 BAL 2012 93
3 BOS 2012 69
4 CHC 2012 61
...

WRITING EFFICIENT PYTHON CODE

for row_tuple in team_wins_df.iterrows():
print(row_tuple)
print(type(row_tuple[1]))

(0, Team ARI

Year 2012
W 81
Name: 0, dtype: object)
<class 'pandas.core.series.Series'>

(1, Team ATL

Year 2012
W 94
Name: 1, dtype: object)
<class 'pandas.core.series.Series'>
...

WRITING EFFICIENT PYTHON CODE

Iterating with .itertuples()
for row_namedtuple in team_wins_df.itertuples():
print(row_namedtuple)

Pandas(Index=0, Team='ARI', Year=2012, W=81)

Pandas(Index=1, Team='ATL', Year=2012, W=94)
...

print(row_namedtuple.Index)

print(row_namedtuple.Team)

ATL

WRITING EFFICIENT PYTHON CODE

Comparing methods
%%timeit
for row_tuple in team_wins_df.iterrows():
print(row_tuple)

527 ms ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
for row_namedtuple in team_wins_df.itertuples():
print(row_namedtuple)

7.48 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

WRITING EFFICIENT PYTHON CODE

for row_tuple in team_wins_df.iterrows():
print(row_tuple[1]['Team'])

ARI
ATL
...

for row_namedtuple in team_wins_df.itertuples():

print(row_namedtuple['Team'])

TypeError: tuple indices must be integers or slices, not str

for row_namedtuple in team_wins_df.itertuples():

print(row_namedtuple.Team)

ARI
ATL
...

WRITING EFFICIENT PYTHON CODE

Let's keep iterating!
WRITING EFFICIENT PYTHON CODE
pandas alternative
to looping
WRITING EFFICIENT PYTHON CODE

Logan Thomas
Scientific Software Technical Trainer,
Enthought
print(baseball_df.head())

Team League Year RS RA W G Playoffs

0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
3 BOS AL 2012 734 806 69 162 0
4 CHC NL 2012 613 759 61 162 0

def calc_run_diff(runs_scored, runs_allowed):

run_diff = runs_scored - runs_allowed

return run_diff

WRITING EFFICIENT PYTHON CODE

Run differentials with a loop
run_diffs_iterrows = []

for i,row in baseball_df.iterrows():

run_diff = calc_run_diff(row['RS'], row['RA'])
run_diffs_iterrows.append(run_diff)

baseball_df['RD'] = run_diffs_iterrows
print(baseball_df)

Team League Year RS RA W G Playoffs RD

0 ARI NL 2012 734 688 81 162 0 46
1 ATL NL 2012 700 600 94 162 1 100
2 BAL AL 2012 712 705 93 162 1 7
...

WRITING EFFICIENT PYTHON CODE

pandas .apply() method
Takes a function and applies it to a DataFrame
Must specify an axis to apply ( 0 for columns; 1 for rows)

Can be used with anonymous functions ( lambda functions)

Example:

baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1
)

WRITING EFFICIENT PYTHON CODE

Run differentials with .apply()
run_diffs_apply = baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1)
baseball_df['RD'] = run_diffs_apply
print(baseball_df)

Team League Year RS RA W G Playoffs RD

0 ARI NL 2012 734 688 81 162 0 46
1 ATL NL 2012 700 600 94 162 1 100
2 BAL AL 2012 712 705 93 162 1 7
...

WRITING EFFICIENT PYTHON CODE

Comparing approaches
%%timeit
run_diffs_iterrows = []

for i,row in baseball_df.iterrows():

run_diff = calc_run_diff(row['RS'], row['RA'])
run_diffs_iterrows.append(run_diff)

baseball_df['RD'] = run_diffs_iterrows

86.8 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE

Comparing approaches
%%timeit
run_diffs_apply = baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1)

baseball_df['RD'] = run_diffs_apply

30.1 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE

Let's practice using
pandas .apply()
method!
WRITING EFFICIENT PYTHON CODE
Optimal pandas
iterating
WRITING EFFICIENT PYTHON CODE

Logan Thomas
Scientific Software Technical Trainer,
Enthought
pandas internals
Eliminating loops applies to using pandas as well
pandas is built on NumPy
Take advantage of NumPy array efficiencies

WRITING EFFICIENT PYTHON CODE

print(baseball_df)

Team League Year RS RA W G Playoffs

0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
...

wins_np = baseball_df['W'].values
print(type(wins_np))

print(wins_np)

[ 81 94 93 ...]

WRITING EFFICIENT PYTHON CODE

Power of vectorization
Broadcasting (vectorizing) is extremely efficient!

baseball_df['RS'].values - baseball_df['RA'].values

array([ 46, 100, 7, ..., 188, 110, -117])

WRITING EFFICIENT PYTHON CODE

Run differentials with arrays
run_diffs_np = baseball_df['RS'].values - baseball_df['RA'].values
baseball_df['RD'] = run_diffs_np
print(baseball_df)

Team League Year RS RA W G Playoffs RD

0 ARI NL 2012 734 688 81 162 0 46
1 ATL NL 2012 700 600 94 162 1 100
2 BAL AL 2012 712 705 93 162 1 7
3 BOS AL 2012 734 806 69 162 0 -72
4 CHC NL 2012 613 759 61 162 0 -146
...

WRITING EFFICIENT PYTHON CODE

Comparing approaches
%%timeit
run_diffs_np = baseball_df['RS'].values - baseball_df['RA'].values

baseball_df['RD'] = run_diffs_np

124 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

WRITING EFFICIENT PYTHON CODE

Let's put our skills
into practice!
WRITING EFFICIENT PYTHON CODE
Congratulations!
WRITING EFFICIENT PYTHON CODE

Logan Thomas
Scientific Software Technical Trainer,
Enthought
What you have learned
The definition of efficient and Pythonic code

How to use Python's powerful built-in library

The advantages of NumPy arrays

Some handy magic commands to profile code

How to deploy efficient solutions with zip() , itertools , collections , and set theory

The cost of looping and how to eliminate loops

Best practices for iterating with pandas DataFrames

WRITING EFFICIENT PYTHON CODE

Well done!
WRITING EFFICIENT PYTHON CODE

IP Practical File 2024-25
100% (7)
IP Practical File 2024-25
22 pages
Pandas Practicals - Term-1
100% (1)
Pandas Practicals - Term-1
18 pages
Programming Assignments of Deep Learning Specialization 5 Courses 1
No ratings yet
Programming Assignments of Deep Learning Specialization 5 Courses 1
304 pages
Chapter4 3
No ratings yet
Chapter4 3
37 pages
PMA Experiment 3
No ratings yet
PMA Experiment 3
8 pages
Data 1542431842578
No ratings yet
Data 1542431842578
11 pages
Practical File Question 28.09.2022
No ratings yet
Practical File Question 28.09.2022
15 pages
Ip Practical File
No ratings yet
Ip Practical File
23 pages
DS - Lab Manual
No ratings yet
DS - Lab Manual
31 pages
Lab Mannual
No ratings yet
Lab Mannual
49 pages
Info Practical
No ratings yet
Info Practical
111 pages
Ip Practical File
No ratings yet
Ip Practical File
39 pages
Chapter 1
No ratings yet
Chapter 1
28 pages
Practical File 2024
No ratings yet
Practical File 2024
25 pages
Ok - Hand On Lab - Introduction To Api
No ratings yet
Ok - Hand On Lab - Introduction To Api
8 pages
Badri Project New 1
No ratings yet
Badri Project New 1
26 pages
CLASS XII - IP List of Practicals With Coding 2020
No ratings yet
CLASS XII - IP List of Practicals With Coding 2020
15 pages
Creating A Series Using Scalar Values
No ratings yet
Creating A Series Using Scalar Values
15 pages
Class X - A.I. - Practical Lab Manual - VVA 2024-25
No ratings yet
Class X - A.I. - Practical Lab Manual - VVA 2024-25
50 pages
Data Science Practical Problems
No ratings yet
Data Science Practical Problems
40 pages
Lab Manual Python Programming Language
No ratings yet
Lab Manual Python Programming Language
21 pages
Practical File Ip Class 12
No ratings yet
Practical File Ip Class 12
40 pages
Practical - With Solution - XII - IP
No ratings yet
Practical - With Solution - XII - IP
13 pages
Python Practical Questions
No ratings yet
Python Practical Questions
13 pages
Pandas Worksheet
No ratings yet
Pandas Worksheet
3 pages
Chapter 1
No ratings yet
Chapter 1
28 pages
School File Python (1) Manan (1) Final
No ratings yet
School File Python (1) Manan (1) Final
20 pages
Practical For Class XII
No ratings yet
Practical For Class XII
19 pages
Foundations For Efficiencies Writing Efficiency Code With Python
No ratings yet
Foundations For Efficiencies Writing Efficiency Code With Python
28 pages
DSF Lab Exp Full
No ratings yet
DSF Lab Exp Full
88 pages
Ankit Class 12 Practical File
No ratings yet
Ankit Class 12 Practical File
33 pages
Rufh 4
No ratings yet
Rufh 4
24 pages
Python Lab Programs
No ratings yet
Python Lab Programs
58 pages
Python Lab PRG
No ratings yet
Python Lab PRG
20 pages
Ip 12th Practical
No ratings yet
Ip 12th Practical
22 pages
Practical File 2024-25
No ratings yet
Practical File 2024-25
25 pages
Ip Practical
No ratings yet
Ip Practical
31 pages
Loops: Genome 559: Introduction To Statistical and Computational Genomics Prof. James H. Thomas
No ratings yet
Loops: Genome 559: Introduction To Statistical and Computational Genomics Prof. James H. Thomas
27 pages
Fundamentals of Data Science Lab Manual New
No ratings yet
Fundamentals of Data Science Lab Manual New
33 pages
Practical File 12th
No ratings yet
Practical File 12th
19 pages
National Public School: Name-Karan Choudhary Class-XII Subject - Informatics Practices (065) Board Roll No.
No ratings yet
National Public School: Name-Karan Choudhary Class-XII Subject - Informatics Practices (065) Board Roll No.
24 pages
Vedant Aggarwal IP Project File
No ratings yet
Vedant Aggarwal IP Project File
27 pages
Dataframe Programs
No ratings yet
Dataframe Programs
12 pages
Ipclass 12
No ratings yet
Ipclass 12
21 pages
Dsa Lab
No ratings yet
Dsa Lab
28 pages
12 IP Practical Exampl
No ratings yet
12 IP Practical Exampl
6 pages
Data Science Python
No ratings yet
Data Science Python
21 pages
IP Practical File 2022
No ratings yet
IP Practical File 2022
26 pages
Xii - Ip - Holiday HW
No ratings yet
Xii - Ip - Holiday HW
2 pages
Ip Project Work 2
No ratings yet
Ip Project Work 2
52 pages
Class 12 Ip Practical Exercises 2022-23 (Updated)
No ratings yet
Class 12 Ip Practical Exercises 2022-23 (Updated)
29 pages
ANNANYA 12B (Practical File)
No ratings yet
ANNANYA 12B (Practical File)
36 pages
12 IP Practical
No ratings yet
12 IP Practical
14 pages
Final Print
No ratings yet
Final Print
43 pages
11th PGM
No ratings yet
11th PGM
9 pages
Modifiedip
No ratings yet
Modifiedip
27 pages
Week 1: 1 The Python Programming Language: Functions
No ratings yet
Week 1: 1 The Python Programming Language: Functions
9 pages
Poonam Practical File - Docx10101.docxsdnbds Dasksvfbqewuvqwodhwycdoxnxdg
No ratings yet
Poonam Practical File - Docx10101.docxsdnbds Dasksvfbqewuvqwodhwycdoxnxdg
34 pages
Hands-On Lab Access REST APIs & Request HTTP
No ratings yet
Hands-On Lab Access REST APIs & Request HTTP
6 pages
#Pip Install Pandas #Pandas Can Be Installed Using:: Import
No ratings yet
#Pip Install Pandas #Pandas Can Be Installed Using:: Import
6 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Numpy
No ratings yet
Numpy
2 pages
Numpy Python Cheat Sheet
100% (1)
Numpy Python Cheat Sheet
1 page
PWP Model Answer Summer 2022
100% (10)
PWP Model Answer Summer 2022
23 pages
1 - Interactive Data Visualization With Bokeh
No ratings yet
1 - Interactive Data Visualization With Bokeh
31 pages
Using Python For Large Scale Linear Alge
No ratings yet
Using Python For Large Scale Linear Alge
11 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
47 pages
Uncodemy's Python (Programming Language) Course Module
No ratings yet
Uncodemy's Python (Programming Language) Course Module
11 pages
Advanced Python Programming: National Institute of Technology Warangal
No ratings yet
Advanced Python Programming: National Institute of Technology Warangal
1 page
Basics of Python Programming
No ratings yet
Basics of Python Programming
29 pages
Data Science Book1
No ratings yet
Data Science Book1
9 pages
Advance Data Science and AI Certification Program Learnbay
No ratings yet
Advance Data Science and AI Certification Program Learnbay
38 pages
Mcqs
No ratings yet
Mcqs
30 pages
NUMPY Basics: Computation and File I/O Using Arrays
No ratings yet
NUMPY Basics: Computation and File I/O Using Arrays
9 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
Ai Record Work
No ratings yet
Ai Record Work
20 pages
Eda Lab
No ratings yet
Eda Lab
43 pages
Assignment - 1.ipynb - Colaboratory
No ratings yet
Assignment - 1.ipynb - Colaboratory
4 pages
Cheat Python
No ratings yet
Cheat Python
8 pages
Opencv Interview Questions
No ratings yet
Opencv Interview Questions
3 pages
Python Cheat Sheet Collection
80% (5)
Python Cheat Sheet Collection
30 pages
Free Resources For Self-Study Plan Data Science
No ratings yet
Free Resources For Self-Study Plan Data Science
3 pages
IML Lab Manual
No ratings yet
IML Lab Manual
31 pages
Mnist2.ipynb - Colaboratory
No ratings yet
Mnist2.ipynb - Colaboratory
6 pages
Graficar Curvas de Nivel en Python
0% (1)
Graficar Curvas de Nivel en Python
4 pages
Numpy Basics Part 1
No ratings yet
Numpy Basics Part 1
14 pages
CBSE Class 12 Computer Science (Python) Data Visualization Using Python Revision Notes
No ratings yet
CBSE Class 12 Computer Science (Python) Data Visualization Using Python Revision Notes
2 pages
01lab Intro To OpenCV
No ratings yet
01lab Intro To OpenCV
30 pages
Python Program160823
No ratings yet
Python Program160823
62 pages
Unit 2 MCA275 PPT Part 1
No ratings yet
Unit 2 MCA275 PPT Part 1
34 pages

Chapter 4

Uploaded by

Chapter 4

Uploaded by

Intro to pandas

Main data structure is the DataFrame

Built on top of the NumPy array structure

WRITING EFFICIENT PYTHON CODE

Team League Year RS RA W G Playoffs

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

def calc_win_perc(wins, games_played):

win_perc = wins / games_played

win_perc = calc_win_perc(50, 100)

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

Team League Year RS RA W G Playoffs WP

WRITING EFFICIENT PYTHON CODE

win_perc = calc_win_perc(wins, games_played)

WRITING EFFICIENT PYTHON CODE

for i,row in baseball_df.iterrows():

win_perc = calc_win_perc(wins, games_played)

WRITING EFFICIENT PYTHON CODE

for i,row in baseball_df.iterrows():

win_perc = calc_win_perc(wins, games_played)

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

(0, Team ARI

(1, Team ATL

WRITING EFFICIENT PYTHON CODE

Pandas(Index=0, Team='ARI', Year=2012, W=81)

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

for row_namedtuple in team_wins_df.itertuples():

TypeError: tuple indices must be integers or slices, not str

for row_namedtuple in team_wins_df.itertuples():

WRITING EFFICIENT PYTHON CODE

Team League Year RS RA W G Playoffs

def calc_run_diff(runs_scored, runs_allowed):

run_diff = runs_scored - runs_allowed

WRITING EFFICIENT PYTHON CODE

for i,row in baseball_df.iterrows():

Team League Year RS RA W G Playoffs RD

WRITING EFFICIENT PYTHON CODE

Can be used with anonymous functions ( lambda functions)

WRITING EFFICIENT PYTHON CODE

Team League Year RS RA W G Playoffs RD

WRITING EFFICIENT PYTHON CODE

for i,row in baseball_df.iterrows():

86.8 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

Team League Year RS RA W G Playoffs

WRITING EFFICIENT PYTHON CODE

array([ 46, 100, 7, ..., 188, 110, -117])

WRITING EFFICIENT PYTHON CODE

Team League Year RS RA W G Playoffs RD

WRITING EFFICIENT PYTHON CODE

WRITING EFFICIENT PYTHON CODE

How to use Python's powerful built-in library

The advantages of NumPy arrays

Some handy magic commands to profile code

The cost of looping and how to eliminate loops

Best practices for iterating with pandas DataFrames

WRITING EFFICIENT PYTHON CODE

You might also like