Chapter 4
Chapter 4
DataFrame iteration
WRITING EFFICIENT PYTHON CODE
Logan Thomas
Scientific Software Technical Trainer,
Enthought
pandas recap
See pandas overview in Intermediate Python
Library used for data analysis
Chapter Objective:
Best practice for iterating over a pandas DataFrame
baseball_df = pd.read_csv('baseball_stats.csv')
print(baseball_df.head())
return np.round(win_perc,2)
0.5
for i in range(len(baseball_df)):
row = baseball_df.iloc[i]
wins = row['W']
games_played = row['G']
baseball_df['WP'] = win_perc_list
183 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
win_perc_list.append(win_perc)
baseball_df['WP'] = win_perc_list
wins = row['W']
games_played = row['G']
baseball_df['WP'] = win_perc_list
95.3 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Logan Thomas
Scientific Software Technical Trainer,
Enthought
Team wins data
print(team_wins_df)
Team Year W
0 ARI 2012 81
1 ATL 2012 94
2 BAL 2012 93
3 BOS 2012 69
4 CHC 2012 61
...
print(row_namedtuple.Index)
print(row_namedtuple.Team)
ATL
527 ms ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
for row_namedtuple in team_wins_df.itertuples():
print(row_namedtuple)
7.48 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ARI
ATL
...
ARI
ATL
...
Logan Thomas
Scientific Software Technical Trainer,
Enthought
print(baseball_df.head())
return run_diff
baseball_df['RD'] = run_diffs_iterrows
print(baseball_df)
Example:
baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1
)
baseball_df['RD'] = run_diffs_iterrows
baseball_df['RD'] = run_diffs_apply
30.1 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Logan Thomas
Scientific Software Technical Trainer,
Enthought
pandas internals
Eliminating loops applies to using pandas as well
pandas is built on NumPy
Take advantage of NumPy array efficiencies
wins_np = baseball_df['W'].values
print(type(wins_np))
<class 'numpy.ndarray'>
print(wins_np)
[ 81 94 93 ...]
baseball_df['RS'].values - baseball_df['RA'].values
baseball_df['RD'] = run_diffs_np
124 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Logan Thomas
Scientific Software Technical Trainer,
Enthought
What you have learned
The definition of efficient and Pythonic code
How to deploy efficient solutions with zip() , itertools , collections , and set theory