0% found this document useful (0 votes)
11 views100 pages

Pandas Powerful

The document is the user guide for pandas, a powerful Python library for data analysis. It describes how to install and get started with pandas, provides tutorials on key pandas concepts like DataFrames and data manipulation techniques, and serves as a reference for pandas' many functions and methods. The guide is intended for users to learn how to use pandas for loading, manipulating, and analyzing data.

Uploaded by

Anmol Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views100 pages

Pandas Powerful

The document is the user guide for pandas, a powerful Python library for data analysis. It describes how to install and get started with pandas, provides tutorials on key pandas concepts like DataFrames and data manipulation techniques, and serves as a reference for pandas' many functions and methods. The guide is intended for users to learn how to use pandas for loading, manipulating, and analyzing data.

Uploaded by

Anmol Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

pandas: powerful Python data analysis

toolkit
Release 1.4.4

Wes McKinney and the Pandas Development Team

Aug 31, 2022


CONTENTS

1 Getting started 3
1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Intro to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Coming from. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 Package overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Getting started tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4 Comparison with other tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.4.5 Community tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

2 User Guide 149


2.1 10 minutes to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
2.1.1 Object creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
2.1.2 Viewing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.1.3 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
2.1.4 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.1.5 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
2.1.6 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.1.7 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.1.8 Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
2.1.9 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
2.1.10 Categoricals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
2.1.11 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
2.1.12 Getting data in/out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
2.1.13 Gotchas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
2.2 Intro to data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
2.2.1 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
2.2.2 DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
2.3 Essential basic functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
2.3.1 Head and tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
2.3.2 Attributes and underlying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
2.3.3 Accelerated operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
2.3.4 Flexible binary operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
2.3.5 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
2.3.6 Function application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
2.3.7 Reindexing and altering labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
2.3.8 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
2.3.9 .dt accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
2.3.10 Vectorized string methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

i
2.3.11 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
2.3.12 Copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
2.3.13 dtypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
2.3.14 Selecting columns based on dtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
2.4 IO tools (text, CSV, HDF5, . . . ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
2.4.1 CSV & text files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
2.4.2 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
2.4.3 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
2.4.4 LaTeX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
2.4.5 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
2.4.6 Excel files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
2.4.7 OpenDocument Spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
2.4.8 Binary Excel (.xlsb) files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
2.4.9 Clipboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
2.4.10 Pickling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
2.4.11 msgpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
2.4.12 HDF5 (PyTables) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
2.4.13 Feather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
2.4.14 Parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
2.4.15 ORC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
2.4.16 SQL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
2.4.17 Google BigQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
2.4.18 Stata format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
2.4.19 SAS formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
2.4.20 SPSS formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
2.4.21 Other file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
2.4.22 Performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
2.5 Indexing and selecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
2.5.1 Different choices for indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
2.5.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
2.5.3 Attribute access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
2.5.4 Slicing ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
2.5.5 Selection by label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
2.5.6 Selection by position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
2.5.7 Selection by callable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
2.5.8 Combining positional and label-based indexing . . . . . . . . . . . . . . . . . . . . . . . . 428
2.5.9 Indexing with list with missing labels is deprecated . . . . . . . . . . . . . . . . . . . . . . 429
2.5.10 Selecting random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
2.5.11 Setting with enlargement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
2.5.12 Fast scalar value getting and setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
2.5.13 Boolean indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
2.5.14 Indexing with isin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
2.5.15 The where() Method and Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
2.5.16 Setting with enlargement conditionally using numpy() . . . . . . . . . . . . . . . . . . . . 444
2.5.17 The query() Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
2.5.18 Duplicate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
2.5.19 Dictionary-like get() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
2.5.20 Looking up values by index/column labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
2.5.21 Index objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
2.5.22 Set / reset index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
2.5.23 Returning a view versus a copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
2.6 MultiIndex / advanced indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
2.6.1 Hierarchical indexing (MultiIndex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
2.6.2 Advanced indexing with hierarchical index . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

ii
2.6.3 Sorting a MultiIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
2.6.4 Take methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
2.6.5 Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
2.6.6 Miscellaneous indexing FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
2.7 Merge, join, concatenate and compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
2.7.1 Concatenating objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
2.7.2 Database-style DataFrame or named Series joining/merging . . . . . . . . . . . . . . . . . 518
2.7.3 Timeseries friendly merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
2.7.4 Comparing objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
2.8 Reshaping and pivot tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
2.8.1 Reshaping by pivoting DataFrame objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
2.8.2 Reshaping by stacking and unstacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
2.8.3 Reshaping by melt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
2.8.4 Combining with stats and GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
2.8.5 Pivot tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
2.8.6 Cross tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
2.8.7 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
2.8.8 Computing indicator / dummy variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
2.8.9 Factorizing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
2.8.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
2.8.11 Exploding a list-like column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
2.9 Working with text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
2.9.1 Text data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
2.9.2 String methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
2.9.3 Splitting and replacing strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
2.9.4 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
2.9.5 Indexing with .str . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
2.9.6 Extracting substrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
2.9.7 Testing for strings that match or contain a pattern . . . . . . . . . . . . . . . . . . . . . . . 595
2.9.8 Creating indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
2.9.9 Method summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
2.10 Working with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
2.10.1 Values considered “missing” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
2.10.2 Inserting missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
2.10.3 Calculations with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
2.10.4 Sum/prod of empties/nans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
2.10.5 NA values in GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
2.10.6 Filling missing values: fillna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
2.10.7 Filling with a PandasObject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
2.10.8 Dropping axis labels with missing data: dropna . . . . . . . . . . . . . . . . . . . . . . . . 609
2.10.9 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
2.10.10 Replacing generic values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
2.10.11 String/regular expression replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
2.10.12 Numeric replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
2.10.13 Experimental NA scalar to denote missing values . . . . . . . . . . . . . . . . . . . . . . . . 625
2.11 Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
2.11.1 Consequences of Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
2.11.2 Duplicate Label Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
2.11.3 Disallowing Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
2.12 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
2.12.1 Object creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
2.12.2 CategoricalDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
2.12.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
2.12.4 Working with categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645

iii
2.12.5 Sorting and order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
2.12.6 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
2.12.7 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
2.12.8 Data munging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
2.12.9 Getting data in/out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
2.12.10 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
2.12.11 Differences to R’s factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
2.12.12 Gotchas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
2.13 Nullable integer data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
2.13.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
2.13.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
2.13.3 Scalar NA Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
2.14 Nullable Boolean data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
2.14.1 Indexing with NA values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
2.14.2 Kleene logical operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
2.15 Chart Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
2.15.1 Basic plotting: plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
2.15.2 Other plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
2.15.3 Plotting with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
2.15.4 Plotting tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
2.15.5 Plot formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
2.15.6 Plotting directly with matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749
2.15.7 Plotting backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
2.16 Table Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
2.16.1 Styler Object and HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
2.16.2 Formatting the Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
2.16.3 Methods to Add Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
2.16.4 Table Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
2.16.5 Setting Classes and Linking to External CSS . . . . . . . . . . . . . . . . . . . . . . . . . 755
2.16.6 Styler Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
2.16.7 Tooltips and Captions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
2.16.8 Finer Control with Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
2.16.9 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
2.16.10 Builtin Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
2.16.11 Sharing styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
2.16.12 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
2.16.13 Other Fun and Useful Stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
2.16.14 Export to Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
2.16.15 Export to LaTeX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
2.16.16 More About CSS and HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
2.16.17 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770
2.17 Computational tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
2.17.1 Statistical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
2.18 Group by: split-apply-combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
2.18.1 Splitting an object into groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
2.18.2 Iterating through groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
2.18.3 Selecting a group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
2.18.4 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
2.18.5 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
2.18.6 Filtration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
2.18.7 Dispatching to instance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
2.18.8 Flexible apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
2.18.9 Numba Accelerated Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
2.18.10 Other useful features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808

iv
2.18.11 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
2.19 Windowing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
2.19.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
2.19.2 Rolling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
2.19.3 Weighted window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
2.19.4 Expanding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
2.19.5 Exponentially Weighted window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837
2.20 Time series / date functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
2.20.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841
2.20.2 Timestamps vs. time spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842
2.20.3 Converting to timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
2.20.4 Generating ranges of timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
2.20.5 Timestamp limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
2.20.6 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852
2.20.7 Time/date components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
2.20.8 DateOffset objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
2.20.9 Time series-related instance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
2.20.10 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
2.20.11 Time span representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
2.20.12 Converting between representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
2.20.13 Representing out-of-bounds spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
2.20.14 Time zone handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900
2.21 Time deltas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
2.21.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
2.21.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 911
2.21.3 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915
2.21.4 Frequency conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
2.21.5 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918
2.21.6 TimedeltaIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 920
2.21.7 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
2.22 Options and settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
2.22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
2.22.2 Available options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
2.22.3 Getting and setting options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933
2.22.4 Setting startup options in Python/IPython environment . . . . . . . . . . . . . . . . . . . . 934
2.22.5 Frequently used options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
2.22.6 Number formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 941
2.22.7 Unicode formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942
2.22.8 Table schema display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
2.23 Enhancing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
2.23.1 Cython (writing C extensions for pandas) . . . . . . . . . . . . . . . . . . . . . . . . . . . 944
2.23.2 Numba (JIT compilation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
2.23.3 Expression evaluation via eval() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951
2.24 Scaling to large datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
2.24.1 Load less data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
2.24.2 Use efficient datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962
2.24.3 Use chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
2.24.4 Use other libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
2.25 Sparse data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
2.25.1 SparseArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971
2.25.2 SparseDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971
2.25.3 Sparse accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972
2.25.4 Sparse calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972
2.25.5 Migrating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973

v
2.25.6 Interaction with scipy.sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975
2.26 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
2.26.1 DataFrame memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
2.26.2 Using if/truth statements with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
2.26.3 Mutating with User Defined Function (UDF) methods . . . . . . . . . . . . . . . . . . . . . 983
2.26.4 NaN, Integer NA values and NA type promotions . . . . . . . . . . . . . . . . . . . . . . . . 984
2.26.5 Differences with NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.26.6 Thread-safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.26.7 Byte-ordering issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.27 Cookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.27.1 Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988
2.27.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992
2.27.3 Multiindexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
2.27.4 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
2.27.5 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001
2.27.6 Timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
2.27.7 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
2.27.8 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016
2.27.9 Data in/out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
2.27.10 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023
2.27.11 Timedeltas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024
2.27.12 Creating example data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026

3 API reference 1027


3.1 Input/output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027
3.1.1 Pickling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027
3.1.2 Flat file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
3.1.3 Clipboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
3.1.4 Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046
3.1.5 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
3.1.6 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069
3.1.7 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074
3.1.8 Latex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
3.1.9 HDFStore: PyTables (HDF5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1091
3.1.10 Feather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096
3.1.11 Parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097
3.1.12 ORC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099
3.1.13 SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100
3.1.14 SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1101
3.1.15 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1101
3.1.16 Google BigQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108
3.1.17 STATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
3.2 General functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115
3.2.1 Data manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115
3.2.2 Top-level missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1151
3.2.3 Top-level dealing with numeric data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157
3.2.4 Top-level dealing with datetimelike data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159
3.2.5 Top-level dealing with Interval data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1172
3.2.6 Top-level evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174
3.2.7 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176
3.2.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
3.3 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
3.3.1 Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
3.3.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1444

vi
3.3.3 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446
3.3.4 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447
3.3.5 Binary operator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449
3.3.6 Function application, GroupBy & window . . . . . . . . . . . . . . . . . . . . . . . . . . . 1450
3.3.7 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1450
3.3.8 Reindexing / selection / label manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1452
3.3.9 Missing data handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1453
3.3.10 Reshaping, sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1453
3.3.11 Combining / comparing / joining / merging . . . . . . . . . . . . . . . . . . . . . . . . . . 1453
3.3.12 Time Series-related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1454
3.3.13 Accessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1454
3.3.14 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1572
3.3.15 Serialization / IO / conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621
3.4 DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621
3.4.1 Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621
3.4.2 Attributes and underlying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1964
3.4.3 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1964
3.4.4 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1965
3.4.5 Binary operator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1966
3.4.6 Function application, GroupBy & window . . . . . . . . . . . . . . . . . . . . . . . . . . . 1967
3.4.7 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1967
3.4.8 Reindexing / selection / label manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969
3.4.9 Missing data handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1970
3.4.10 Reshaping, sorting, transposing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1970
3.4.11 Combining / comparing / joining / merging . . . . . . . . . . . . . . . . . . . . . . . . . . 1971
3.4.12 Time Series-related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1971
3.4.13 Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1971
3.4.14 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1972
3.4.15 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1972
3.4.16 Sparse accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2026
3.4.17 Serialization / IO / conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2028
3.5 pandas arrays, scalars, and data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2028
3.5.1 pandas.array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2029
3.5.2 Datetime data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2032
3.5.3 Timedelta data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2065
3.5.4 Timespan data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075
3.5.5 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075
3.5.6 Interval data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2093
3.5.7 Nullable integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2107
3.5.8 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2112
3.5.9 Sparse data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118
3.5.10 Text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2120
3.5.11 Boolean data with missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2123
3.6 Index objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2125
3.6.1 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2125
3.6.2 Numeric Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2193
3.6.3 CategoricalIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2197
3.6.4 IntervalIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2207
3.6.5 MultiIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2219
3.6.6 DatetimeIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2239
3.6.7 TimedeltaIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2274
3.6.8 PeriodIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2284
3.7 Date offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2292
3.7.1 DateOffset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2292

vii
3.7.2 BusinessDay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2298
3.7.3 BusinessHour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2305
3.7.4 CustomBusinessDay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2311
3.7.5 CustomBusinessHour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2318
3.7.6 MonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2324
3.7.7 MonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2329
3.7.8 BusinessMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2334
3.7.9 BusinessMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2340
3.7.10 CustomBusinessMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2345
3.7.11 CustomBusinessMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2352
3.7.12 SemiMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2359
3.7.13 SemiMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2364
3.7.14 Week . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2370
3.7.15 WeekOfMonth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2375
3.7.16 LastWeekOfMonth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2381
3.7.17 BQuarterEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2387
3.7.18 BQuarterBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2393
3.7.19 QuarterEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2399
3.7.20 QuarterBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2404
3.7.21 BYearEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2410
3.7.22 BYearBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2416
3.7.23 YearEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2421
3.7.24 YearBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2426
3.7.25 FY5253 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2431
3.7.26 FY5253Quarter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2438
3.7.27 Easter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2445
3.7.28 Tick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2451
3.7.29 Day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2456
3.7.30 Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2462
3.7.31 Minute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2467
3.7.32 Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2473
3.7.33 Milli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2478
3.7.34 Micro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2484
3.7.35 Nano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2489
3.8 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2495
3.8.1 pandas.tseries.frequencies.to_offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2495
3.9 Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2496
3.9.1 Rolling window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2496
3.9.2 Weighted window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2514
3.9.3 Expanding window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2516
3.9.4 Exponentially-weighted window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2530
3.9.5 Window indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2534
3.10 GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2537
3.10.1 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2537
3.10.2 Function application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2542
3.10.3 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2554
3.11 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2610
3.11.1 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2610
3.11.2 Function application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2613
3.11.3 Upsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2618
3.11.4 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2630
3.12 Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2636
3.12.1 Styler constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2636
3.12.2 Styler properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2679

viii
3.12.3 Style application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2680
3.12.4 Builtin styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2681
3.12.5 Style export and import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2681
3.13 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2681
3.13.1 pandas.plotting.andrews_curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2682
3.13.2 pandas.plotting.autocorrelation_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2682
3.13.3 pandas.plotting.bootstrap_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2684
3.13.4 pandas.plotting.boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2686
3.13.5 pandas.plotting.deregister_matplotlib_converters . . . . . . . . . . . . . . . . . . . . . . . 2692
3.13.6 pandas.plotting.lag_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2692
3.13.7 pandas.plotting.parallel_coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2695
3.13.8 pandas.plotting.plot_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2695
3.13.9 pandas.plotting.radviz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2697
3.13.10 pandas.plotting.register_matplotlib_converters . . . . . . . . . . . . . . . . . . . . . . . . . 2698
3.13.11 pandas.plotting.scatter_matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2699
3.13.12 pandas.plotting.table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2701
3.14 General utility functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2701
3.14.1 Working with options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2701
3.14.2 Testing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2720
3.14.3 Exceptions and warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2725
3.14.4 Data types related functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2730
3.14.5 Bug report function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2758
3.15 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2759
3.15.1 pandas.api.extensions.register_extension_dtype . . . . . . . . . . . . . . . . . . . . . . . . 2759
3.15.2 pandas.api.extensions.register_dataframe_accessor . . . . . . . . . . . . . . . . . . . . . . 2759
3.15.3 pandas.api.extensions.register_series_accessor . . . . . . . . . . . . . . . . . . . . . . . . . 2761
3.15.4 pandas.api.extensions.register_index_accessor . . . . . . . . . . . . . . . . . . . . . . . . . 2762
3.15.5 pandas.api.extensions.ExtensionDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2763
3.15.6 pandas.api.extensions.ExtensionArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2767
3.15.7 pandas.arrays.PandasArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2781
3.15.8 pandas.api.indexers.check_array_indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2781

4 Development 2785
4.1 Contributing to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2785
4.1.1 Where to start? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2786
4.1.2 Bug reports and enhancement requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2786
4.1.3 Working with the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2787
4.1.4 Contributing your changes to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2788
4.1.5 Tips for a successful pull request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2791
4.2 Creating a development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2791
4.2.1 Creating an environment using Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2791
4.2.2 Creating an environment without Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2792
4.3 Contributing to the documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2795
4.3.1 About the pandas documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2796
4.3.2 Updating a pandas docstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2813
4.3.3 How to build the pandas documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2813
4.3.4 Previewing changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2814
4.4 Contributing to the code base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2815
4.4.1 Code standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2815
4.4.2 Pre-commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2816
4.4.3 Optional dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2816
4.4.4 Type hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2820
4.4.5 Testing with continuous integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2822
4.4.6 Test-driven development/code writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2823

ix
4.4.7 Running the test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2827
4.4.8 Running the performance test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2828
4.4.9 Documenting your code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2829
4.5 pandas code style guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2829
4.5.1 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2830
4.5.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2830
4.5.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2830
4.6 pandas maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.1 Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.3 Issue triage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.4 Closing issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2832
4.6.5 Reviewing pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.6 Backporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.7 Cleaning up old issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.8 Cleaning up old pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.9 Becoming a pandas maintainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.6.10 Merging pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.7 Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.7.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.7.2 Subclassing pandas data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2836
4.8 Test organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2836
4.9 Debugging C extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2839
4.9.1 Using a debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2839
4.9.2 Checking memory leaks with valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2840
4.10 Extending pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2840
4.10.1 Registering custom accessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2840
4.10.2 Extension types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2841
4.10.3 Subclassing pandas data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2844
4.10.4 Plotting backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2847
4.11 Developer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2848
4.11.1 Storing pandas DataFrame objects in Apache Parquet format . . . . . . . . . . . . . . . . . 2848
4.12 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.12.1 Version policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.12.2 Python support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.13 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.13.1 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.2 String data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.3 Consistent missing value handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.4 Apache Arrow interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.5 Block manager rewrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.6 Decoupling of indexing and internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.7 Numba-accelerated operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.8 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.9 Roadmap evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2854
4.13.10 Completed items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2854
4.14 Developer meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2854
4.14.1 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2855
4.14.2 Calendar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2855

5 Release notes 2857


5.1 Version 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2857
5.1.1 What’s new in 1.4.4 (August 31, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2857
5.1.2 What’s new in 1.4.3 (June 23, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2859

x
5.1.3 What’s new in 1.4.2 (April 2, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2861
5.1.4 What’s new in 1.4.1 (February 12, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2862
5.1.5 What’s new in 1.4.0 (January 22, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2864
5.2 Version 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2899
5.2.1 What’s new in 1.3.5 (December 12, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2899
5.2.2 What’s new in 1.3.4 (October 17, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2900
5.2.3 What’s new in 1.3.3 (September 12, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2901
5.2.4 What’s new in 1.3.2 (August 15, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2903
5.2.5 What’s new in 1.3.1 (July 25, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2904
5.2.6 What’s new in 1.3.0 (July 2, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2906
5.3 Version 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2945
5.3.1 What’s new in 1.2.5 (June 22, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2945
5.3.2 What’s new in 1.2.4 (April 12, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2946
5.3.3 What’s new in 1.2.3 (March 02, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2947
5.3.4 What’s new in 1.2.2 (February 09, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2948
5.3.5 What’s new in 1.2.1 (January 20, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2950
5.3.6 What’s new in 1.2.0 (December 26, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2953
5.4 Version 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2982
5.4.1 What’s new in 1.1.5 (December 07, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2982
5.4.2 What’s new in 1.1.4 (October 30, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2984
5.4.3 What’s new in 1.1.3 (October 5, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2986
5.4.4 What’s new in 1.1.2 (September 8, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2988
5.4.5 What’s new in 1.1.1 (August 20, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2990
5.4.6 What’s new in 1.1.0 (July 28, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2992
5.5 Version 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3034
5.5.1 What’s new in 1.0.5 (June 17, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3034
5.5.2 What’s new in 1.0.4 (May 28, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3035
5.5.3 What’s new in 1.0.3 (March 17, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3036
5.5.4 What’s new in 1.0.2 (March 12, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3037
5.5.5 What’s new in 1.0.1 (February 5, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3040
5.5.6 What’s new in 1.0.0 (January 29, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3042
5.6 Version 0.25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3081
5.6.1 What’s new in 0.25.3 (October 31, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3081
5.6.2 What’s new in 0.25.2 (October 15, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3082
5.6.3 What’s new in 0.25.1 (August 21, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3083
5.6.4 What’s new in 0.25.0 (July 18, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3086
5.7 Version 0.24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
5.7.1 What’s new in 0.24.2 (March 12, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
5.7.2 What’s new in 0.24.1 (February 3, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3127
5.7.3 What’s new in 0.24.0 (January 25, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3128
5.8 Version 0.23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3185
5.8.1 What’s new in 0.23.4 (August 3, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3185
5.8.2 What’s new in 0.23.3 (July 7, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3186
5.8.3 What’s new in 0.23.2 (July 5, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3187
5.8.4 What’s new in 0.23.1 (June 12, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3190
5.8.5 What’s new in 0.23.0 (May 15, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3194
5.9 Version 0.22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3245
5.9.1 Version 0.22.0 (December 29, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3245
5.10 Version 0.21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3249
5.10.1 Version 0.21.1 (December 12, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3249
5.10.2 Version 0.21.0 (October 27, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3254
5.11 Version 0.20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3287
5.11.1 Version 0.20.3 (July 7, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3287
5.11.2 Version 0.20.2 (June 4, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3290

xi
5.11.3 Version 0.20.1 (May 5, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3294
5.12 Version 0.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3343
5.12.1 Version 0.19.2 (December 24, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3343
5.12.2 Version 0.19.1 (November 3, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3346
5.12.3 Version 0.19.0 (October 2, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3349
5.13 Version 0.18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3395
5.13.1 Version 0.18.1 (May 3, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3395
5.13.2 Version 0.18.0 (March 13, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3415
5.14 Version 0.17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3451
5.14.1 Version 0.17.1 (November 21, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3451
5.14.2 Version 0.17.0 (October 9, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3459
5.15 Version 0.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3489
5.15.1 Version 0.16.2 (June 12, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3489
5.15.2 Version 0.16.1 (May 11, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3495
5.15.3 Version 0.16.0 (March 22, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3508
5.16 Version 0.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3526
5.16.1 Version 0.15.2 (December 12, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3526
5.16.2 Version 0.15.1 (November 9, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3533
5.16.3 Version 0.15.0 (October 18, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3540
5.17 Version 0.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3573
5.17.1 Version 0.14.1 (July 11, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3573
5.17.2 Version 0.14.0 (May 31 , 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3580
5.18 Version 0.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3611
5.18.1 Version 0.13.1 (February 3, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3611
5.18.2 Version 0.13.0 (January 3, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3623
5.19 Version 0.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3653
5.19.1 Version 0.12.0 (July 24, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3653
5.20 Version 0.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3666
5.20.1 Version 0.11.0 (April 22, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3666
5.21 Version 0.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3677
5.21.1 Version 0.10.1 (January 22, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3677
5.21.2 Version 0.10.0 (December 17, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3684
5.22 Version 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3696
5.22.1 Version 0.9.1 (November 14, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3696
5.22.2 Version 0.9.0 (October 7, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3700
5.23 Version 0.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3703
5.23.1 Version 0.8.1 (July 22, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3703
5.23.2 Version 0.8.0 (June 29, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3704
5.24 Version 0.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3710
5.24.1 Version 0.7.3 (April 12, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3710
5.24.2 Version 0.7.2 (March 16, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3713
5.24.3 Version 0.7.1 (February 29, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3714
5.24.4 Version 0.7.0 (February 9, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3715
5.25 Version 0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3721
5.25.1 Version 0.6.1 (December 13, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3721
5.25.2 Version 0.6.0 (November 25, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3722
5.26 Version 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3724
5.26.1 Version 0.5.0 (October 24, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3724
5.27 Version 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3725
5.27.1 Versions 0.4.1 through 0.4.3 (September 25 - October 9, 2011) . . . . . . . . . . . . . . . . 3725

Bibliography 3727

Python Module Index 3729

xii
pandas: powerful Python data analysis toolkit, Release 1.4.4

Date: Aug 31, 2022 Version: 1.4.4


Download documentation: PDF Version | Zipped HTML
Previous versions: Documentation of previous pandas versions is available at pandas.pydata.org.
Useful links: Binary Installers | Source Repository | Issues & Ideas | Q&A Support | Mailing List
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data
analysis tools for the Python programming language.

Getting started
New to pandas? Check out the getting started guides. They contain an introduction to pandas’ main concepts and links
to additional tutorials.
To the getting started guides

User guide
The user guide provides in-depth information on the key concepts of pandas with useful background information and
explanation.
To the user guide

API reference
The reference guide contains a detailed description of the pandas API. The reference describes how the methods work
and which parameters can be used. It assumes that you have an understanding of the key concepts.
To the reference guide

Developer guide
Saw a typo in the documentation? Want to improve existing functionalities? The contributing guidelines will guide
you through the process of improving pandas.
To the development guide

CONTENTS 1
pandas: powerful Python data analysis toolkit, Release 1.4.4

2 CONTENTS
CHAPTER

ONE

GETTING STARTED

1.1 Installation

Working with conda?


pandas is part of the Anaconda distribution and can be installed with Anaconda or Miniconda:

conda install pandas

Prefer pip?
pandas can be installed via pip from PyPI.

pip install pandas

In-depth instructions?
Installing a specific version? Installing from source? Check the advanced installation page.
Learn more

1.2 Intro to pandas

Straight to tutorial. . .
When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you.
pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

To introduction tutorial
To user guide
Straight to tutorial. . .
pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,. . . ).
Importing data from each of these data sources is provided by function with the prefix read_*. Similarly, the to_*
methods are used to store data.

To introduction tutorial
To user guide
Straight to tutorial. . .

3
pandas: powerful Python data analysis toolkit, Release 1.4.4

Selecting or filtering specific rows and/or columns? Filtering the data on a condition? Methods for slicing, selecting,
and extracting the data you need are available in pandas.

To introduction tutorial
To user guide
Straight to tutorial. . .
pandas provides plotting your data out of the box, using the power of Matplotlib. You can pick the plot type (scatter,
bar, boxplot,. . . ) corresponding to your data.

To introduction tutorial
To user guide
Straight to tutorial. . .
There is no need to loop over all rows of your data table to do calculations. Data manipulations on a column work
elementwise. Adding a column to a DataFrame based on existing data in other columns is straightforward.

To introduction tutorial
To user guide
Straight to tutorial. . .
Basic statistics (mean, median, min, max, counts. . . ) are easily calculable. These or custom aggregations can be
applied on the entire data set, a sliding window of the data, or grouped by categories. The latter is also known as the
split-apply-combine approach.

To introduction tutorial
To user guide
Straight to tutorial. . .
Change the structure of your data table in multiple ways. You can melt() your data table from wide to long/tidy form
or pivot() from long to wide format. With aggregations built-in, a pivot table is created with a single command.

To introduction tutorial
To user guide
Straight to tutorial. . .
Multiple tables can be concatenated both column wise and row wise as database-like join/merge operations are provided
to combine multiple tables of data.

To introduction tutorial
To user guide
Straight to tutorial. . .
pandas has great support for time series and has an extensive set of tools for working with dates, times, and time-indexed
data.

4 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

To introduction tutorial
To user guide
Straight to tutorial. . .
Data sets do not only contain numerical data. pandas provides a wide range of functions to clean textual data and extract
useful information from it.
To introduction tutorial
To user guide

1.3 Coming from. . .

Are you familiar with other software for manipulating tablular data? Learn the pandas-equivalent operations compared
to software you already know:

The R programming language provides the data.frame data structure and multiple packages, such as tidyverse use
and extend data.frame for convenient data handling functionalities similar to pandas.
Learn more

Already familiar to SELECT, GROUP BY, JOIN, etc.? Most of these SQL manipulations do have equivalents in pandas.
Learn more

The data set included in the STATA statistical software suite corresponds to the pandas DataFrame. Many of the
operations known from STATA have an equivalent in pandas.
Learn more

Users of Excel or other spreadsheet programs will find that many of the concepts are transferrable to pandas.
Learn more

The SAS statistical software suite also provides the data set corresponding to the pandas DataFrame. Also SAS
vectorized operations, filtering, string processing operations, and more have similar functions in pandas.
Learn more

1.4 Tutorials

For a quick overview of pandas functionality, see 10 Minutes to pandas.


You can also reference the pandas cheat sheet for a succinct guide for manipulating data with pandas.
The community produces a wide variety of tutorials available online. Some of the material is enlisted in the community
contributed Community tutorials.

1.3. Coming from. . . 5


pandas: powerful Python data analysis toolkit, Release 1.4.4

1.4.1 Installation

The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for
data analysis and scientific computing. This is the recommended installation method for most users.
Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development version are
also provided.

Python version support

Officially Python 3.8, 3.9 and 3.10.

Installing pandas

Installing with Anaconda

Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make up the SciPy stack
(IPython, NumPy, Matplotlib, . . . ) is with Anaconda, a cross-platform (Linux, macOS, Windows) Python distribution
for data analytics and scientific computing.
After running the installer, the user will have access to pandas and the rest of the SciPy stack without needing to install
anything else, and without needing to wait for any software to be compiled.
Installation instructions for Anaconda can be found here.
A full list of the packages available as part of the Anaconda distribution can be found here.
Another advantage to installing Anaconda is that you don’t need admin rights to install it. Anaconda can install in the
user’s home directory, which makes it trivial to delete Anaconda if you decide (just delete that folder).

Installing with Miniconda

The previous section outlined how to get pandas installed as part of the Anaconda distribution. However this approach
means you will install well over one hundred packages and involves downloading the installer which is a few hundred
megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then installing pandas with
Miniconda may be a better solution.
Conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is both cross-
platform and language agnostic (it can play a similar role to a pip and virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the Conda command to
install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this for you. The installer
can be found here
The next step is to create a new conda environment. A conda environment is like a virtualenv that allows you to specify
a specific version of Python and set of libraries. Run the following commands from a terminal window:

conda create -n name_of_my_env python

This will create a minimal environment with only Python installed in it. To put your self inside this environment run:

6 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

source activate name_of_my_env

On Windows the command is:

activate name_of_my_env

The final step required is to install pandas. This can be done with the following command:

conda install pandas

To install a specific pandas version:

conda install pandas=0.20.3

To install other packages, IPython for example:

conda install ipython

To install the full Anaconda distribution:

conda install anaconda

If you need packages that are available to pip but not conda, then install pip, and then use pip to install those packages:

conda install pip


pip install django

Installing from PyPI

pandas can be installed via pip from PyPI.

Note: You must have pip>=19.3 to install from PyPI.

pip install pandas

Installing with ActivePython

Installation instructions for ActivePython can be found here. Versions 2.7, 3.5 and 3.6 include pandas.

Installing using your Linux distribution’s package manager.

The commands in this table will install pandas for Python 3 from your distribution.

1.4. Tutorials 7
pandas: powerful Python data analysis toolkit, Release 1.4.4

Distribu- Status Download / Reposi- Install method


tion tory Link
Debian stable official Debian reposi- sudo apt-get install python3-pandas
tory
Debian & unstable NeuroDebian sudo apt-get install python3-pandas
Ubuntu (latest
packages)
Ubuntu stable official Ubuntu reposi- sudo apt-get install python3-pandas
tory
Open- stable OpenSuse Repository zypper in python3-pandas
Suse
Fedora stable official Fedora reposi- dnf install python3-pandas
tory
Cen- stable EPEL repository yum install python3-pandas
tos/RHEL

However, the packages in the linux package managers are often a few versions behind, so to get the newest version of
pandas, it’s recommended to install using the pip or conda methods described above.

Handling ImportErrors

If you encounter an ImportError, it usually means that Python couldn’t find pandas in the list of available libraries.
Python internally has a list of directories it searches through, to find packages. You can obtain these directories with:

import sys
sys.path

One way you could be encountering this error is if you have multiple Python installations on your system and you don’t
have pandas installed in the Python installation you’re currently using. In Linux/Mac you can run which python on
your terminal and it will tell you which Python installation you’re using. If it’s something like “/usr/bin/python”, you’re
using the Python from the system, which is not recommended.
It is highly recommended to use conda, for quick installation and for package and dependency updates. You can find
simple installation instructions for pandas in this document: installation instructions </getting_started.
html>.

Installing from source

See the contributing guide for complete instructions on building from the git source tree. Further, see creating a
development environment if you wish to create a pandas development environment.

Running the test suite

pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing. To
run it on your machine to verify that everything is working (and that you have all of the dependencies, soft and hard,
installed), make sure you have pytest >= 6.0 and Hypothesis >= 3.58, then run:

>>> pd.test()
running: pytest --skip-slow --skip-network C:\Users\TP\Anaconda3\envs\py36\lib\site-
˓→packages\pandas

(continues on next page)

8 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


============================= test session starts =============================
platform win32 -- Python 3.6.2, pytest-3.6.0, py-1.4.34, pluggy-0.4.0
rootdir: C:\Users\TP\Documents\Python\pandasdev\pandas, inifile: setup.cfg
collected 12145 items / 3 skipped

..................................................................S......
........S................................................................
.........................................................................

==================== 12130 passed, 12 skipped in 368.339 seconds =====================

Dependencies

Package Minimum supported version


NumPy 1.18.5
python-dateutil 2.8.1
pytz 2020.1

Recommended dependencies

• numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart chunking
and caching to achieve large speedups. If installed, must be Version 2.7.1 or higher.
• bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython routines to
achieve large speedups. If installed, must be Version 1.3.1 or higher.

Note: You are highly encouraged to install these libraries, as they provide speed improvements, especially when
working with large data sets.

Optional dependencies

pandas has many optional dependencies that are only used for specific methods. For example, pandas.read_hdf()
requires the pytables package, while DataFrame.to_markdown() requires the tabulate package. If the optional
dependency is not installed, pandas will raise an ImportError when the method requiring that dependency is called.

Visualization

Dependency Minimum Version Notes


matplotlib 3.3.2 Plotting library
Jinja2 2.11 Conditional formatting with DataFrame.style
tabulate 0.8.7 Printing in Markdown-friendly format (see tabulate)

1.4. Tutorials 9
pandas: powerful Python data analysis toolkit, Release 1.4.4

Computation

Depen- Minimum Ver- Notes


dency sion
SciPy 1.4.1 Miscellaneous statistical functions
numba 0.50.1 Alternative execution engine for rolling operations (see Enhancing Perfor-
mance)
xarray 0.15.1 pandas-like API for N-dimensional data

Excel files

Dependency Minimum Version Notes


xlrd 2.0.1 Reading Excel
xlwt 1.3.0 Writing Excel
xlsxwriter 1.2.2 Writing Excel
openpyxl 3.0.3 Reading / writing for xlsx files
pyxlsb 1.0.6 Reading for xlsb files

HTML

Dependency Minimum Version Notes


BeautifulSoup4 4.8.2 HTML parser for read_html
html5lib 1.1 HTML parser for read_html
lxml 4.5.0 HTML parser for read_html

One of the following combinations of libraries is needed to use the top-level read_html() function:
• BeautifulSoup4 and html5lib
• BeautifulSoup4 and lxml
• BeautifulSoup4 and html5lib and lxml
• Only lxml, although see HTML Table Parsing for reasons as to why you should probably not take this approach.

Warning:
• if you install BeautifulSoup4 you must install either lxml or html5lib or both. read_html() will not work
with only BeautifulSoup4 installed.
• You are highly encouraged to read HTML Table Parsing gotchas. It explains issues surrounding the installa-
tion and usage of the above three libraries.

10 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

XML

Dependency Minimum Version Notes


lxml 4.5.0 XML parser for read_xml and tree builder for to_xml

SQL databases

Dependency Minimum Version Notes


SQLAlchemy 1.4.0 SQL support for databases other than sqlite
psycopg2 2.8.4 PostgreSQL engine for sqlalchemy
pymysql 0.10.1 MySQL engine for sqlalchemy

Other data sources

Dependency Minimum Version Notes


PyTables 3.6.1 HDF5-based reading / writing
blosc 1.20.1 Compression for HDF5
zlib Compression for HDF5
fastparquet 0.4.0 Parquet reading / writing
pyarrow 1.0.1 Parquet, ORC, and feather reading / writing
pyreadstat 1.1.0 SPSS files (.sav) reading

Warning:
• If you want to use read_orc(), it is highly recommended to install pyarrow using conda. The following is
a summary of the environment in which read_orc() can work.

System Conda PyPI


Linux Successful Failed(pyarrow==3.0 Successful)
macOS Successful Failed
Windows Failed Failed

Access data in the cloud

Dependency Minimum Version Notes


fsspec 0.7.4 Handling files aside from simple local and HTTP
gcsfs 0.6.0 Google Cloud Storage access
pandas-gbq 0.14.0 Google Big Query access
s3fs 0.4.0 Amazon S3 access

1.4. Tutorials 11
pandas: powerful Python data analysis toolkit, Release 1.4.4

Clipboard

Dependency Minimum Version Notes


PyQt4/PyQt5 Clipboard I/O
qtpy Clipboard I/O
xclip Clipboard I/O on linux
xsel Clipboard I/O on linux

Compression

Dependency Minimum Version Notes


brotli 0.7.0 Brotli compression
python-snappy 0.6.0 Snappy compression
Zstandard 0.15.2 Zstandard compression

1.4.2 Package overview

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing
practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful
and flexible open source data analysis/manipulation tool available in any language. It is already well on its way
toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a
pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the
vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users,
DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and
is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply
ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both ag-
gregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into
DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

12 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

• Intuitive merging and joining data sets


• Flexible reshaping and pivoting of data sets
• Hierarchical labeling of axes (possible to have multiple labels per tick)
• Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading
data from the ultrafast HDF5 format
• Time series-specific functionality: date range generation and frequency conversion, moving window statistics,
date shifting, and lagging.
Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific
research environments. For data scientists, working with data is typically divided into multiple stages: munging and
cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or
tabular display. pandas is the ideal tool for all of these tasks.
Some other notes
• pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However,
as with anything else generalization usually sacrifices performance. So if you focus on one feature for your
application you may be able to create a faster specialized tool.
• pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in
Python.
• pandas has been used extensively in production in financial applications.

Data structures

Dimensions Name Description


1 Series 1D labeled homogeneously-typed array
2 DataFrame General 2D labeled, size-mutable tabular structure with potentially
heterogeneously-typed column

Why more than one data structure?

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example,
DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove
objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the typical
orientation of time series and cross-sectional data sets. When using the N-dimensional array (ndarrays) to store 2- and
3-dimensional data, a burden is placed on the user to consider the orientation of the data set when writing functions;
axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters for performance). In
pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a particular data set, there is likely to
be a “right” way to orient the data. The goal, then, is to reduce the amount of mental effort required to code up data
transformations in downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the
columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable
code:

for col in df.columns:


series = df[col]
# do something with series

1.4. Tutorials 13
pandas: powerful Python data analysis toolkit, Release 1.4.4

Mutability and copying of data

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The
length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast
majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability
where sensible.

Getting support

The first stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas community
experts can answer through Stack Overflow.

Community

pandas is actively supported today by a community of like-minded individuals around the world who contribute their
valuable time and energy to help make open source pandas possible. Thanks to all of our contributors.
If you’re interested in contributing, please visit the contributing guide.
pandas is a NumFOCUS sponsored project. This will help ensure the success of the development of pandas as a
world-class open-source project and makes it possible to donate to the project.

Project governance

The governance process that pandas project has used informally since its inception in 2008 is formalized in Project
Governance documents. The documents clarify how decisions are made and how the various elements of our commu-
nity interact, including the relationship between open source collaborative development and work that may be funded
by for-profit or non-profit entities.
Wes McKinney is the Benevolent Dictator for Life (BDFL).

Development team

The list of the Core Team members and more detailed information can be found on the people’s page of the governance
repo.

Institutional partners

The information about current institutional partners can be found on pandas website page.

License

BSD 3-Clause License

Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData␣
˓→Development Team

All rights reserved.

Copyright (c) 2011-2021, Open source contributors.

(continues on next page)

14 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,


this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

1.4.3 Getting started tutorials

What kind of data does pandas handle?

I want to start using pandas

In [1]: import pandas as pd

To load the pandas package and start working with it, import the package. The community agreed alias for pandas is
pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.

pandas data table representation

I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers)
and sex (male/female) data.

In [2]: df = pd.DataFrame(
...: {
...: "Name": [
...: "Braund, Mr. Owen Harris",
...: "Allen, Mr. William Henry",
...: "Bonnell, Miss. Elizabeth",
...: ],
(continues on next page)

1.4. Tutorials 15
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


...: "Age": [22, 35, 58],
...: "Sex": ["male", "male", "female"],
...: }
...: )
...:

In [3]: df
Out[3]:
Name Age Sex
0 Braund, Mr. Owen Harris 22 male
1 Allen, Mr. William Henry 35 male
2 Bonnell, Miss. Elizabeth 58 female

To manually store data in a table, create a DataFrame. When using a Python dictionary of lists, the dictionary keys
will be used as column headers and the values in each list as columns of the DataFrame.
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers,
floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.
frame in R.
• The table has 3 columns, each of them with a column label. The column labels are respectively Name, Age and
Sex.
• The column Name consists of textual data with each value a string, the column Age are numbers and the column
Sex is textual data.
In spreadsheet software, the table representation of our data would look very similar:

16 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Each column in a DataFrame is a Series

I’m just interested in working with the data in the column Age

In [4]: df["Age"]
Out[4]:
0 22
1 35
2 58
Name: Age, dtype: int64

When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the
column label in between square brackets [].

Note: If you are familiar to Python dictionaries, the selection of a single column is very similar to selection of
dictionary values based on the key.

You can create a Series from scratch as well:

In [5]: ages = pd.Series([22, 35, 58], name="Age")

In [6]: ages
Out[6]:
0 22
1 35
2 58
Name: Age, dtype: int64

A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

Do something with a DataFrame or Series

I want to know the maximum Age of the passengers


We can do this on the DataFrame by selecting the Age column and applying max():

In [7]: df["Age"].max()
Out[7]: 58

Or to the Series:

In [8]: ages.max()
Out[8]: 58

As illustrated by the max() method, you can do things with a DataFrame or Series. pandas provides a lot of func-
tionalities, each of them a method you can apply to a DataFrame or Series. As methods are functions, do not forget
to use parentheses ().
I’m interested in some basic statistics of the numerical data of my data table

1.4. Tutorials 17
pandas: powerful Python data analysis toolkit, Release 1.4.4

In [9]: df.describe()
Out[9]:
Age
count 3.000000
mean 38.333333
std 18.230012
min 22.000000
25% 28.500000
50% 35.000000
75% 46.500000
max 58.000000

The describe() method provides a quick overview of the numerical data in a DataFrame. As the Name and Sex
columns are textual data, these are by default not taken into account by the describe() method.
Many pandas operations return a DataFrame or a Series. The describe() method is an example of a pandas
operation returning a pandas Series or a pandas DataFrame.
Check more options on describe in the user guide section about aggregations with describe

Note: This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns
and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet
software are supported by pandas. Continue reading the next tutorials to get started!

• Import the package, aka import pandas as pd


• A table of data is stored as a pandas DataFrame
• Each column in a DataFrame is a Series
• You can do things by applying a method to a DataFrame or Series
A more extended explanation to DataFrame and Series is provided in the introduction to data structures.

In [1]: import pandas as pd

This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.

18 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

How do I read and write tabular data?

I want to analyze the Titanic passenger data, available as a CSV file.


In [2]: titanic = pd.read_csv("data/titanic.csv")

pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports
many different file formats or data sources out of the box (csv, excel, sql, json, parquet, . . . ), each of them with the
prefix read_*.
Make sure to always have a check on the data after reading in the data. When displaying a DataFrame, the first and
last 5 rows will be shown by default:
In [3]: titanic
Out[3]:
PassengerId Survived Pclass Name ..
˓→. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ..
˓→. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ..
˓→. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ..
˓→. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ..
˓→. 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ..
˓→. 373450 8.0500 NaN S
.. ... ... ... ... ..
˓→. ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas ..
˓→. 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith ..
˓→. 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" ..
˓→. W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell ..
˓→. 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick ..
˓→. 370376 7.7500 NaN Q

[891 rows x 12 columns]

I want to see the first 8 rows of a pandas DataFrame.

In [4]: titanic.head(8)
Out[4]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S (continues on next page)

1.4. Tutorials 19
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ␣
˓→male ... 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J ␣
˓→male ... 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard ␣
˓→male ... 1 349909 21.0750 NaN S

[8 rows x 12 columns]

To see the first N rows of a DataFrame, use the head() method with the required number of rows (in this case 8) as
argument.

Note: Interested in the last N rows instead? pandas also provides a tail() method. For example, titanic.tail(10)
will return the last 10 rows of the DataFrame.

A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes
attribute:

In [5]: titanic.dtypes
Out[5]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

For each of the columns, the used data type is enlisted. The data types in this DataFrame are integers (int64), floats
(float64) and strings (object).

Note: When asking for the dtypes, no brackets are used! dtypes is an attribute of a DataFrame and Series. At-
tributes of DataFrame or Series do not need brackets. Attributes represent a characteristic of a DataFrame/Series,
whereas a method (which requires brackets) do something with the DataFrame/Series as introduced in the first tuto-
rial.

My colleague requested the Titanic data as a spreadsheet.

In [6]: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)

Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The to_excel()

20 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default
Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.
The equivalent read function read_excel() will reload the data to a DataFrame:

In [7]: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")

In [8]: titanic.head()
Out[8]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S

[5 rows x 12 columns]

I’m interested in a technical summary of a DataFrame

In [9]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:
• It is indeed a DataFrame.
• There are 891 entries, i.e. 891 rows.
• Each row has a row label (aka the index) with values ranging from 0 to 890.
• The table has 12 columns. Most columns have a value for each of the rows (all 891 values are non-null). Some
columns do have missing values and less than 891 non-null values.

1.4. Tutorials 21
pandas: powerful Python data analysis toolkit, Release 1.4.4

• The columns Name, Sex, Cabin and Embarked consists of textual data (strings, aka object). The other columns
are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
• The kind of data (characters, integers,. . . ) in the different columns are summarized by listing the dtypes.
• The approximate amount of RAM used to hold the DataFrame is provided as well.
• Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
• Exporting data out of pandas is provided by different to_*methods.
• The head/tail/info methods and the dtypes attribute are convenient for a first check.
For a complete overview of the input and output possibilities from and to pandas, see the user guide section about
reader and writer functions.

In [1]: import pandas as pd

This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.

In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S

[5 rows x 12 columns]

22 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

How do I select a subset of a DataFrame?

How do I select specific columns from a DataFrame?

I’m interested in the age of the Titanic passengers.

In [4]: ages = titanic["Age"]

In [5]: ages.head()
Out[5]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

To select a single column, use square brackets [] with the column name of the column of interest.
Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series.
We can verify this by checking the type of the output:

In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series

And have a look at the shape of the output:

In [7]: titanic["Age"].shape
Out[7]: (891,)

DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of
a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series
is 1-dimensional and only the number of rows is returned.
I’m interested in the age and sex of the Titanic passengers.

In [8]: age_sex = titanic[["Age", "Sex"]]

In [9]: age_sex.head()
Out[9]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male

To select multiple columns, use a list of column names within the selection brackets [].

Note: The inner square brackets define a Python list with column names, whereas the outer brackets are used to select
the data from a pandas DataFrame as seen in the previous example.

The returned data type is a pandas DataFrame:

1.4. Tutorials 23
pandas: powerful Python data analysis toolkit, Release 1.4.4

In [10]: type(titanic[["Age", "Sex"]])


Out[10]: pandas.core.frame.DataFrame

In [11]: titanic[["Age", "Sex"]].shape


Out[11]: (891, 2)

The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional with
both a row and column dimension.
For basic information on indexing, see the user guide section on indexing and selecting data.

How do I filter specific rows from a DataFrame?

I’m interested in the passengers older than 35 years.

In [12]: above_35 = titanic[titanic["Age"] > 35]

In [13]: above_35.head()
Out[13]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J ␣
˓→male ... 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth ␣
˓→female ... 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan ␣
˓→male ... 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) ␣
˓→female ... 0 248706 16.0000 NaN S

[5 rows x 12 columns]

To select rows based on a conditional expression, use a condition inside the selection brackets [].
The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value
larger than 35:

In [14]: titanic["Age"] > 35


Out[14]:
0 False
1 True
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
(continues on next page)

24 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


890 False
Name: Age, Length: 891, dtype: bool

The output of the conditional expression (>, but also ==, !=, <, <=,. . . would work) is actually a pandas Series of
boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of
boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for
which the value is True will be selected.
We know from before that the original Titanic DataFrame consists of 891 rows. Let’s have a look at the number of
rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35:
In [15]: above_35.shape
Out[15]: (217, 12)

I’m interested in the Titanic passengers from cabin class 2 and 3.


In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]

In [17]: class_23.head()
Out[17]:
PassengerId Survived Pclass Name Sex Age SibSp ␣
˓→Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 ␣
˓→ 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 ␣
˓→ 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 ␣
˓→ 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 ␣
˓→ 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 ␣
˓→ 1 349909 21.0750 NaN S

Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in
the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets
[]. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which
rows the Pclass column is either 2 or 3.
The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with
an | (or) operator:
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

In [19]: class_23.head()
Out[19]:
PassengerId Survived Pclass Name Sex Age SibSp ␣
˓→Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 ␣
˓→ 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 ␣
˓→ 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 ␣
˓→ 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 ␣
˓→ 0 330877 8.4583 NaN Q (continues on next page)

1.4. Tutorials 25
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 ␣
˓→ 1 349909 21.0750 NaN S

Note: When combining multiple conditional statements, each condition must be surrounded by parentheses (). More-
over, you can not use or/and but need to use the or operator | and the and operator &.

See the dedicated section in the user guide about boolean indexing or about the isin function.
I want to work with passenger data for which the age is known.

In [20]: age_no_na = titanic[titanic["Age"].notna()]

In [21]: age_no_na.head()
Out[21]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S

[5 rows x 12 columns]

The notna() conditional function returns a True for each row the values are not an Null value. As such, this can be
combined with the selection brackets [] to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if
the shape has changed:

In [22]: age_no_na.shape
Out[22]: (714, 12)

For more dedicated functions on missing values, see the user guide section about handling missing data.

How do I select specific rows and columns from a DataFrame?

I’m interested in the names of the passengers older than 35 years.

In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]

In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
(continues on next page)

26 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object

In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient
anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part
before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use the loc operator in front of the selection
brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels,
a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
I’m interested in rows 10 till 25 and columns 3 to 5.

In [25]: titanic.iloc[9:25, 2:5]


Out[25]:
Pclass Name Sex
9 2 Nasser, Mrs. Nicholas (Adele Achem) female
10 3 Sandstrom, Miss. Marguerite Rut female
11 1 Bonnell, Miss. Elizabeth female
12 3 Saundercock, Mr. William Henry male
13 3 Andersson, Mr. Anders Johan male
.. ... ... ...
20 2 Fynney, Mr. Joseph J male
21 2 Beesley, Mr. Lawrence male
22 3 McGowan, Miss. Anna "Annie" female
23 1 Sloper, Mr. William Thompson male
24 3 Palsson, Miss. Torborg Danira female

[16 rows x 3 columns]

Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient
anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc
operator in front of the selection brackets [].
When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For
example, to assign the name anonymous to the first 3 elements of the third column:

In [26]: titanic.iloc[0:3, 3] = "anonymous"

In [27]: titanic.head()
Out[27]:
PassengerId Survived Pclass Name Sex .
˓→.. Parch Ticket Fare Cabin Embarked
0 1 0 3 anonymous male .
˓→.. 0 A/5 21171 7.2500 NaN S
1 2 1 1 anonymous female .
˓→.. 0 PC 17599 71.2833 C85 C
2 3 1 3 anonymous female .
˓→.. 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female .
˓→.. 0 113803 53.1000 C123 S
(continues on next page)

1.4. Tutorials 27
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


4 5 0 3 Allen, Mr. William Henry male .
˓→ .. 0 373450 8.0500 NaN S

[5 rows x 12 columns]

See the user guide section on different choices for indexing to get more insight in the usage of loc and iloc.
• When selecting subsets of data, square brackets [] are used.
• Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a
conditional expression or a colon.
• Select specific rows and/or columns using loc when using the row and column names
• Select specific rows and/or columns using iloc when using the positions in the table
• You can assign new values to a selection based on loc/iloc.
A full overview of indexing is provided in the user guide pages on indexing and selecting data.

In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt

For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and using the py-openaq package. The
air_quality_no2.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801 and London
Westminster in respectively Paris, Antwerp and London.

In [3]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_


˓→dates=True)

In [4]: air_quality.head()
Out[4]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN

Note: The usage of the index_col and parse_dates parameters of the read_csv function to define the first (0th)
column as index of the resulting DataFrame and convert the dates in the column to Timestamp objects, respectively.

28 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

How to create plots in pandas?

I want a quick visual check of the data.

In [5]: air_quality.plot()
Out[5]: <AxesSubplot:xlabel='datetime'>

With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data.
I want to plot only the columns of the data table with the data from Paris.

In [6]: air_quality["station_paris"].plot()
Out[6]: <AxesSubplot:xlabel='datetime'>

1.4. Tutorials 29
pandas: powerful Python data analysis toolkit, Release 1.4.4

To plot a specific column, use the selection method of the subset data tutorial in combination with the plot() method.
Hence, the plot() method works on both Series and DataFrame.
I want to visually compare the 𝑁 02 values measured in London versus Paris.

In [7]: air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)


Out[7]: <AxesSubplot:xlabel='station_london', ylabel='station_paris'>

30 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Apart from the default line plot when using the plot function, a number of alternatives are available to plot data.
Let’s use some standard Python to get an overview of the available plot methods:

In [8]: [
...: method_name
...: for method_name in dir(air_quality.plot)
...: if not method_name.startswith("_")
...: ]
...:
Out[8]:
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']

Note: In many development environments as well as IPython and Jupyter Notebook, use the TAB button to get an

1.4. Tutorials 31
pandas: powerful Python data analysis toolkit, Release 1.4.4

overview of the available methods, for example air_quality.plot. + TAB.

One of the options is DataFrame.plot.box(), which refers to a boxplot. The box method is applicable on the air
quality example data:

In [9]: air_quality.plot.box()
Out[9]: <AxesSubplot:>

For an introduction to plots other than the default line plot, see the user guide section about supported plot styles.
I want each of the columns in a separate subplot.

In [10]: axs = air_quality.plot.area(figsize=(12, 4), subplots=True)

32 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Separate subplots for each of the data columns are supported by the subplots argument of the plot functions. The
builtin options available in each of the pandas plot functions are worth reviewing.
Some more formatting options are explained in the user guide section on plot formatting.
I want to further customize, extend or save the resulting plot.

In [11]: fig, axs = plt.subplots(figsize=(12, 4))

In [12]: air_quality.plot.area(ax=axs)
Out[12]: <AxesSubplot:xlabel='datetime'>

In [13]: axs.set_ylabel("NO$_2$ concentration")


Out[13]: Text(0, 0.5, 'NO$_2$ concentration')

In [14]: fig.savefig("no2_concentrations.png")

Each of the plot objects created by pandas is a matplotlib object. As Matplotlib provides plenty of options to customize
plots, making the link between pandas and Matplotlib explicit enables all the power of matplotlib to the plot. This
strategy is applied in the previous example:

fig, axs = plt.subplots(figsize=(12, 4)) # Create an empty matplotlib Figure and␣


˓→Axes

air_quality.plot.area(ax=axs) # Use pandas to put the area plot on the␣


˓→prepared Figure/Axes

axs.set_ylabel("NO$_2$ concentration") # Do any matplotlib customization you␣


˓→like

(continues on next page)

1.4. Tutorials 33
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


fig.savefig("no2_concentrations.png") # Save the Figure/Axes using the␣
˓→existing matplotlib method.

• The .plot.* methods are applicable on both Series and DataFrames


• By default, each of the columns is plotted as a different element (line, boxplot,. . . )
• Any plot created by pandas is a Matplotlib object.
A full overview of plotting in pandas is provided in the visualization pages.

In [1]: import pandas as pd

For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and using the py-openaq package. The
air_quality_no2.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801 and London
Westminster in respectively Paris, Antwerp and London.

In [2]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_


˓→dates=True)

In [3]: air_quality.head()
Out[3]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN

How to create new columns derived from existing columns?

I want to express the 𝑁 𝑂2 concentration of the station in London in mg/m3


(If we assume temperature of 25 degrees Celsius and pressure of 1013 hPa, the conversion factor is 1.882)

In [4]: air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

In [5]: air_quality.head()
Out[5]:
station_antwerp station_paris station_london london_mg_per_cubic
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286
2019-05-07 03:00:00 50.5 25.0 19.0 35.758
2019-05-07 04:00:00 45.0 27.7 19.0 35.758
2019-05-07 05:00:00 NaN 50.4 16.0 30.112
2019-05-07 06:00:00 NaN 61.9 NaN NaN

To create a new column, use the [] brackets with the new column name at the left side of the assignment.

Note: The calculation of the values is done element_wise. This means all values in the given column are multiplied

34 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

by the value 1.882 at once. You do not need to use a loop to iterate each of the rows!

I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column

In [6]: air_quality["ratio_paris_antwerp"] = (
...: air_quality["station_paris"] / air_quality["station_antwerp"]
...: )
...:

In [7]: air_quality.head()
Out[7]:
station_antwerp station_paris station_london london_mg_per_cubic␣
˓→ ratio_paris_antwerp

datetime ␣
˓→

2019-05-07 02:00:00 NaN NaN 23.0 43.286␣


˓→ NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758␣
˓→ 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758␣
˓→ 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112␣
˓→ NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN␣
˓→ NaN

The calculation is again element-wise, so the / is applied for the values in each row.
Also other mathematical operators (+, -, \*, /) or logical operators (<, >, =,. . . ) work element wise. The latter was
already used in the subset data tutorial to filter rows of a table using a conditional expression.
If you need more advanced logic, you can use arbitrary Python code via apply().
I want to rename the data columns to the corresponding station identifiers used by openAQ
In [8]: air_quality_renamed = air_quality.rename(
...: columns={
...: "station_antwerp": "BETR801",
...: "station_paris": "FR04014",
...: "station_london": "London Westminster",
...: }
...: )
...:

In [9]: air_quality_renamed.head()
Out[9]:
BETR801 FR04014 London Westminster london_mg_per_cubic ratio_
˓→paris_antwerp

datetime ␣
˓→

2019-05-07 02:00:00 NaN NaN 23.0 43.286 ␣


˓→ NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 ␣
˓→ 0.495050 (continues on next page)

1.4. Tutorials 35
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


2019-05-07 04:00:00 45.0 27.7 19.0 35.758 ␣
˓→ 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 ␣
˓→ NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN ␣
˓→ NaN

The rename() function can be used for both row labels and column labels. Provide a dictionary with the keys the
current names and the values the new names to update the corresponding names.
The mapping should not be restricted to fixed names only, but can be a mapping function as well. For example,
converting the column names to lowercase letters can be done using a function as well:

In [10]: air_quality_renamed = air_quality_renamed.rename(columns=str.lower)

In [11]: air_quality_renamed.head()
Out[11]:
betr801 fr04014 london westminster london_mg_per_cubic ratio_
˓→paris_antwerp

datetime ␣
˓→

2019-05-07 02:00:00 NaN NaN 23.0 43.286 ␣


˓→ NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 ␣
˓→ 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 ␣
˓→ 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 ␣
˓→ NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN ␣
˓→ NaN

Details about column or row label renaming is provided in the user guide section on renaming labels.
• Create a new column by assigning the output to the DataFrame with a new column name in between the [].
• Operations are element-wise, no need to loop over rows.
• Use rename with a dictionary or function to rename row labels or column names.
The user guide contains a separate section on column addition and deletion.

In [1]: import pandas as pd

This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.

36 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

• Parch: Whether a passenger is alone or have family.


• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.

In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S

[5 rows x 12 columns]

How to calculate summary statistics?

Aggregating statistics

What is the average age of the Titanic passengers?

In [4]: titanic["Age"].mean()
Out[4]: 29.69911764705882

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude
missing data and operate across rows by default.

What is the median age and ticket fare price of the Titanic passengers?

In [5]: titanic[["Age", "Fare"]].median()


Out[5]:
Age 28.0000
Fare 14.4542
dtype: float64

The statistic applied to multiple columns of a DataFrame (the selection of two columns return a DataFrame, see the
subset data tutorial) is calculated for each numeric column.

1.4. Tutorials 37
pandas: powerful Python data analysis toolkit, Release 1.4.4

The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function
from first tutorial?

In [6]: titanic[["Age", "Fare"]].describe()


Out[6]:
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200

Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined
using the DataFrame.agg() method:

In [7]: titanic.agg(
...: {
...: "Age": ["min", "max", "median", "skew"],
...: "Fare": ["min", "max", "median", "mean"],
...: }
...: )
...:
Out[7]:
Age Fare
min 0.420000 0.000000
max 80.000000 512.329200
median 28.000000 14.454200
skew 0.389108 NaN
mean NaN 32.204208

Details about descriptive statistics are provided in the user guide section on descriptive statistics.

Aggregating statistics grouped by category

What is the average age for male versus female Titanic passengers?

In [8]: titanic[["Sex", "Age"]].groupby("Sex").mean()


Out[8]:
Age
Sex
female 27.915709
male 30.726645

As our interest is the average age for each gender, a subselection on these two columns is made first: titanic[["Sex
", "Age"]]. Next, the groupby() method is applied on the Sex column to make a group per category. The average
age for each gender is calculated and returned.
Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a
common pattern. The groupby method is used to support this type of operations. More general, this fits in the more
general split-apply-combine pattern:

38 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

• Split the data into groups


• Apply a function to each group independently
• Combine the results into a data structure
The apply and combine steps are typically done together in pandas.
In the previous example, we explicitly selected the 2 columns first. If not, the mean method is applied to each column
containing numerical columns:

In [9]: titanic.groupby("Sex").mean()
Out[9]:
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893

It does not make much sense to get the average value of the Pclass. if we are only interested in the average age for
each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:

In [10]: titanic.groupby("Sex")["Age"].mean()
Out[10]:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64

Note: The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively
the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a
Categorical data type to handle this type of data. More information is provided in the user guide Categorical data
section.

What is the mean ticket fare price for each of the sex and cabin class combinations?

In [11]: titanic.groupby(["Sex", "Pclass"])["Fare"].mean()


Out[11]:
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
Name: Fare, dtype: float64

Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby()
method.
A full description on the split-apply-combine approach is provided in the user guide section on groupby operations.

1.4. Tutorials 39
pandas: powerful Python data analysis toolkit, Release 1.4.4

Count number of records by category

What is the number of passengers in each of the cabin classes?

In [12]: titanic["Pclass"].value_counts()
Out[12]:
3 491
1 216
2 184
Name: Pclass, dtype: int64

The value_counts() method counts the number of records for each category in a column.
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records
within each group:

In [13]: titanic.groupby("Pclass")["Pclass"].count()
Out[13]:
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64

Note: Both size and count can be used in combination with groupby. Whereas size includes NaN values and just
provides the number of rows (size of the table), count excludes the missing values. In the value_counts method, use
the dropna argument to include or exclude the NaN values.

The user guide has a dedicated section on value_counts , see page on discretization.
• Aggregation statistics can be calculated on entire columns or rows
• groupby provides the power of the split-apply-combine pattern
• value_counts is a convenient shortcut to count the number of entries in each category of a variable
A full description on the split-apply-combine approach is provided in the user guide pages about groupby operations.

In [1]: import pandas as pd

This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.

40 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

• Ticket: Ticket number of passenger.


• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.

In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S

[5 rows x 12 columns]

This tutorial uses air quality data about 𝑁 𝑂2 and Particulate matter less than 2.5 micrometers, made available by
openaq and using the py-openaq package. The air_quality_long.csv data set provides 𝑁 𝑂2 and 𝑃 𝑀25 values for
the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.
The air-quality data set has the following columns:
• city: city where the sensor is used, either Paris, Antwerp or London
• country: country where the sensor is used, either FR, BE or GB
• location: the id of the sensor, either FR04014, BETR801 or London Westminster
• parameter: the parameter measured by the sensor, either 𝑁 𝑂2 or Particulate matter
• value: the measured value
• unit: the unit of the measured parameter, in this case ‘µg/m3 ’
and the index of the DataFrame is datetime, the datetime of the measurement.

Note: The air-quality data is provided in a so-called long format data representation with each observation on a
separate row and each variable a separate column of the data table. The long/narrow format is also known as the tidy
data format.

In [4]: air_quality = pd.read_csv(


...: "data/air_quality_long.csv", index_col="date.utc", parse_dates=True
...: )
...:

In [5]: air_quality.head()
(continues on next page)

1.4. Tutorials 41
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


Out[5]:
city country location parameter value unit
date.utc
2019-06-18 06:00:00+00:00 Antwerpen BE BETR801 pm25 18.0 µg/m3
2019-06-17 08:00:00+00:00 Antwerpen BE BETR801 pm25 6.5 µg/m3
2019-06-17 07:00:00+00:00 Antwerpen BE BETR801 pm25 18.5 µg/m3
2019-06-17 06:00:00+00:00 Antwerpen BE BETR801 pm25 16.0 µg/m3
2019-06-17 05:00:00+00:00 Antwerpen BE BETR801 pm25 7.5 µg/m3

How to reshape the layout of tables?

Sort table rows

I want to sort the Titanic data according to the age of the passengers.

In [6]: titanic.sort_values(by="Age").head()
Out[6]:
PassengerId Survived Pclass Name Sex Age SibSp␣
˓→ Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0␣
˓→ 1 2625 8.5167 NaN C
755 756 1 2 Hamalainen, Master. Viljo male 0.67 1␣
˓→ 1 250649 14.5000 NaN S
644 645 1 3 Baclini, Miss. Eugenie female 0.75 2␣
˓→ 1 2666 19.2583 NaN C
469 470 1 3 Baclini, Miss. Helene Barbara female 0.75 2␣
˓→ 1 2666 19.2583 NaN C
78 79 1 2 Caldwell, Master. Alden Gates male 0.83 0␣
˓→ 2 248738 29.0000 NaN S

I want to sort the Titanic data according to the cabin class and age in descending order.

In [7]: titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()


Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp ␣
˓→Parch Ticket Fare Cabin Embarked
851 852 0 3 Svensson, Mr. Johan male 74.0 0 ␣
˓→0 347060 7.7750 NaN S
116 117 0 3 Connors, Mr. Patrick male 70.5 0 ␣
˓→0 370369 7.7500 NaN Q
280 281 0 3 Duane, Mr. Frank male 65.0 0 ␣
˓→0 336439 7.7500 NaN Q
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 ␣
˓→0 4134 9.5875 NaN S
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0 0 ␣
˓→0 345364 6.2375 NaN S

With Series.sort_values(), the rows in the table are sorted according to the defined column(s). The index will
follow the row order.
More details about sorting of tables is provided in the using guide section on sorting data.

42 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Long to wide table format

Let’s use a small subset of the air quality data set. We focus on 𝑁 𝑂2 data and only use the first two measurements of
each location (i.e. the head of each group). The subset of data will be called no2_subset

# filter for no2 data only


In [8]: no2 = air_quality[air_quality["parameter"] == "no2"]

# use 2 measurements (head) for each location (groupby)


In [9]: no2_subset = no2.sort_index().groupby(["location"]).head(2)

In [10]: no2_subset
Out[10]:
city country location parameter value unit
date.utc
2019-04-09 01:00:00+00:00 Antwerpen BE BETR801 no2 22.5 µg/m3
2019-04-09 01:00:00+00:00 Paris FR FR04014 no2 24.4 µg/m3
2019-04-09 02:00:00+00:00 London GB London Westminster no2 67.0 µg/m3
2019-04-09 02:00:00+00:00 Antwerpen BE BETR801 no2 53.5 µg/m3
2019-04-09 02:00:00+00:00 Paris FR FR04014 no2 27.4 µg/m3
2019-04-09 03:00:00+00:00 London GB London Westminster no2 67.0 µg/m3

I want the values for the three stations as separate columns next to each other

In [11]: no2_subset.pivot(columns="location", values="value")


Out[11]:
location BETR801 FR04014 London Westminster
date.utc
2019-04-09 01:00:00+00:00 22.5 24.4 NaN
2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2019-04-09 03:00:00+00:00 NaN NaN 67.0

The pivot() function is purely reshaping of the data: a single value for each index/column combination is required.
As pandas support plotting of multiple columns (see plotting tutorial) out of the box, the conversion from long to wide
table format enables the plotting of the different time series at the same time:

In [12]: no2.head()
Out[12]:
city country location parameter value unit
date.utc
2019-06-21 00:00:00+00:00 Paris FR FR04014 no2 20.0 µg/m3
2019-06-20 23:00:00+00:00 Paris FR FR04014 no2 21.8 µg/m3
2019-06-20 22:00:00+00:00 Paris FR FR04014 no2 26.5 µg/m3
2019-06-20 21:00:00+00:00 Paris FR FR04014 no2 24.9 µg/m3
2019-06-20 20:00:00+00:00 Paris FR FR04014 no2 21.4 µg/m3

In [13]: no2.pivot(columns="location", values="value").plot()


Out[13]: <AxesSubplot:xlabel='date.utc'>

1.4. Tutorials 43
pandas: powerful Python data analysis toolkit, Release 1.4.4

Note: When the index parameter is not defined, the existing index (row labels) is used.

For more information about pivot(), see the user guide section on pivoting DataFrame objects.

Pivot table

I want the mean concentrations for 𝑁 𝑂2 and 𝑃 𝑀2.5 in each of the stations in table form

In [14]: air_quality.pivot_table(
....: values="value", index="location", columns="parameter", aggfunc="mean"
....: )
....:
Out[14]:
parameter no2 pm25
location
BETR801 26.950920 23.169492
FR04014 29.374284 NaN
London Westminster 29.740050 13.443568

In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated (in this specific case,

44 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

the values on different time steps) pivot_table() can be used, providing an aggregation function (e.g. mean) on how
to combine these values.
Pivot table is a well known concept in spreadsheet software. When interested in summary columns for each variable
separately as well, put the margin parameter to True:

In [15]: air_quality.pivot_table(
....: values="value",
....: index="location",
....: columns="parameter",
....: aggfunc="mean",
....: margins=True,
....: )
....:
Out[15]:
parameter no2 pm25 All
location
BETR801 26.950920 23.169492 24.982353
FR04014 29.374284 NaN 29.374284
London Westminster 29.740050 13.443568 21.491708
All 29.430316 14.386849 24.222743

For more information about pivot_table(), see the user guide section on pivot tables.

Note: In case you are wondering, pivot_table() is indeed directly linked to groupby(). The same result can be
derived by grouping on both parameter and location:

air_quality.groupby(["parameter", "location"]).mean()

Have a look at groupby() in combination with unstack() at the user guide section on combining stats and groupby.

Wide to long format

Starting again from the wide format table created in the previous section:

In [16]: no2_pivoted = no2.pivot(columns="location", values="value").reset_index()

In [17]: no2_pivoted.head()
Out[17]:
location date.utc BETR801 FR04014 London Westminster
0 2019-04-09 01:00:00+00:00 22.5 24.4 NaN
1 2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2 2019-04-09 03:00:00+00:00 54.5 34.2 67.0
3 2019-04-09 04:00:00+00:00 34.5 48.5 41.0
4 2019-04-09 05:00:00+00:00 46.5 59.5 41.0

I want to collect all air quality 𝑁 𝑂2 measurements in a single column (long format)

In [18]: no_2 = no2_pivoted.melt(id_vars="date.utc")

(continues on next page)

1.4. Tutorials 45
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


In [19]: no_2.head()
Out[19]:
date.utc location value
0 2019-04-09 01:00:00+00:00 BETR801 22.5
1 2019-04-09 02:00:00+00:00 BETR801 53.5
2 2019-04-09 03:00:00+00:00 BETR801 54.5
3 2019-04-09 04:00:00+00:00 BETR801 34.5
4 2019-04-09 05:00:00+00:00 BETR801 46.5

The pandas.melt() method on a DataFrame converts the data table from wide format to long format. The column
headers become the variable names in a newly created column.
The solution is the short version on how to apply pandas.melt(). The method will melt all columns NOT mentioned
in id_vars together into two columns: A column with the column header names and a column with the values itself.
The latter column gets by default the name value.
The pandas.melt() method can be defined in more detail:

In [20]: no_2 = no2_pivoted.melt(


....: id_vars="date.utc",
....: value_vars=["BETR801", "FR04014", "London Westminster"],
....: value_name="NO_2",
....: var_name="id_location",
....: )
....:

In [21]: no_2.head()
Out[21]:
date.utc id_location NO_2
0 2019-04-09 01:00:00+00:00 BETR801 22.5
1 2019-04-09 02:00:00+00:00 BETR801 53.5
2 2019-04-09 03:00:00+00:00 BETR801 54.5
3 2019-04-09 04:00:00+00:00 BETR801 34.5
4 2019-04-09 05:00:00+00:00 BETR801 46.5

The result in the same, but in more detail defined:


• value_vars defines explicitly which columns to melt together
• value_name provides a custom column name for the values column instead of the default column name value
• var_name provides a custom column name for the column collecting the column header names. Otherwise it
takes the index name or a default variable
Hence, the arguments value_name and var_name are just user-defined names for the two generated columns. The
columns to melt are defined by id_vars and value_vars.
Conversion from wide to long format with pandas.melt() is explained in the user guide section on reshaping by melt.
• Sorting by one or more columns is supported by sort_values
• The pivot function is purely restructuring of the data, pivot_table supports aggregations
• The reverse of pivot (long to wide format) is melt (wide to long format)
A full overview is available in the user guide on the pages about reshaping and pivoting.

46 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

In [1]: import pandas as pd

For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and downloaded using the py-openaq
package.
The air_quality_no2_long.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801
and London Westminster in respectively Paris, Antwerp and London.

In [2]: air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",


...: parse_dates=True)
...:

In [3]: air_quality_no2 = air_quality_no2[["date.utc", "location",


...: "parameter", "value"]]
...:

In [4]: air_quality_no2.head()
Out[4]:
date.utc location parameter value
0 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
1 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
2 2019-06-20 22:00:00+00:00 FR04014 no2 26.5
3 2019-06-20 21:00:00+00:00 FR04014 no2 24.9
4 2019-06-20 20:00:00+00:00 FR04014 no2 21.4

For this tutorial, air quality data about Particulate matter less than 2.5 micrometers is used, made available by openaq
and downloaded using the py-openaq package.
The air_quality_pm25_long.csv data set provides 𝑃 𝑀25 values for the measurement stations FR04014, BETR801
and London Westminster in respectively Paris, Antwerp and London.

In [5]: air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",


...: parse_dates=True)
...:

In [6]: air_quality_pm25 = air_quality_pm25[["date.utc", "location",


...: "parameter", "value"]]
...:

In [7]: air_quality_pm25.head()
Out[7]:
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5

1.4. Tutorials 47
pandas: powerful Python data analysis toolkit, Release 1.4.4

How to combine data from multiple tables?

Concatenating objects

I want to combine the measurements of 𝑁 𝑂2 and 𝑃 𝑀25 , two tables with a similar structure, in a single table

In [8]: air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)

In [9]: air_quality.head()
Out[9]:
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5

The concat() function performs concatenation operations of multiple tables along one of the axis (row-wise or
column-wise).
By default concatenation is along axis 0, so the resulting table combines the rows of the input tables. Let’s check the
shape of the original and the concatenated tables to verify the operation:

In [10]: print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape)


Shape of the ``air_quality_pm25`` table: (1110, 4)

In [11]: print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape)


Shape of the ``air_quality_no2`` table: (2068, 4)

In [12]: print('Shape of the resulting ``air_quality`` table: ', air_quality.shape)


Shape of the resulting ``air_quality`` table: (3178, 4)

Hence, the resulting table has 3178 = 1110 + 2068 rows.

Note: The axis argument will return in a number of pandas methods that can be applied along an axis. A DataFrame
has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running hori-
zontally across columns (axis 1). Most operations like concatenation or summary statistics are by default across rows
(axis 0), but can be applied across columns as well.

Sorting the table on the datetime information illustrates also the combination of both tables, with the parameter
column defining the origin of the table (either no2 from table air_quality_no2 or pm25 from table
air_quality_pm25):

In [13]: air_quality = air_quality.sort_values("date.utc")

In [14]: air_quality.head()
Out[14]:
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
(continues on next page)

48 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0

In this specific example, the parameter column provided by the data ensures that each of the original tables can be
identified. This is not always the case. the concat function provides a convenient solution with the keys argument,
adding an additional (hierarchical) row index. For example:

In [15]: air_quality_ = pd.concat([air_quality_pm25, air_quality_no2], keys=["PM25", "NO2


˓→"])

In [16]: air_quality_.head()
Out[16]:
date.utc location parameter value
PM25 0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5

Note: The existence of multiple row/column indices at the same time has not been mentioned within these tutorials.
Hierarchical indexing or MultiIndex is an advanced and powerful pandas feature to analyze higher dimensional data.
Multi-indexing is out of scope for this pandas introduction. For the moment, remember that the function reset_index
can be used to convert any level of an index to a column, e.g. air_quality.reset_index(level=0)
Feel free to dive into the world of multi-indexing at the user guide section on advanced indexing.

More options on table concatenation (row and column wise) and how concat can be used to define the logic (union or
intersection) of the indexes on the other axes is provided at the section on object concatenation.

Join tables using a common identifier

Add the station coordinates, provided by the stations metadata table, to the corresponding rows in the measurements
table.

Warning: The air quality measurement station coordinates are stored in a data file air_quality_stations.csv,
downloaded using the py-openaq package.

In [17]: stations_coord = pd.read_csv("data/air_quality_stations.csv")

In [18]: stations_coord.head()
Out[18]:
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226

1.4. Tutorials 49
pandas: powerful Python data analysis toolkit, Release 1.4.4

Note: The stations used in this example (FR04014, BETR801 and London Westminster) are just three entries enlisted
in the metadata table. We only want to add the coordinates of these three to the measurements table, each on the
corresponding rows of the air_quality table.

In [19]: air_quality.head()
Out[19]:
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0

In [20]: air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")

In [21]: air_quality.head()
Out[21]:
date.utc location parameter value coordinates.latitude ␣
˓→coordinates.longitude

0 2019-05-07 01:00:00+00:00 London Westminster no2 23.0 51.49467 ␣


˓→ -0.13193
1 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 48.83724 ␣
˓→ 2.39390
2 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 48.83722 ␣
˓→ 2.39390
3 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5 51.20966 ␣
˓→ 4.43182
4 2019-05-07 01:00:00+00:00 BETR801 no2 50.5 51.20966 ␣
˓→ 4.43182

Using the merge() function, for each of the rows in the air_quality table, the corresponding coordinates are added
from the air_quality_stations_coord table. Both tables have the column location in common which is used as
a key to combine the information. By choosing the left join, only the locations available in the air_quality (left)
table, i.e. FR04014, BETR801 and London Westminster, end up in the resulting table. The merge function supports
multiple join options similar to database-style operations.
Add the parameter full description and name, provided by the parameters metadata table, to the measurements table

Warning: The air quality parameters metadata are stored in a data file air_quality_parameters.csv, down-
loaded using the py-openaq package.

In [22]: air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")

In [23]: air_quality_parameters.head()
Out[23]:
id description name
0 bc Black Carbon BC
1 co Carbon Monoxide CO
2 no2 Nitrogen Dioxide NO2
3 o3 Ozone O3
(continues on next page)

50 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


4 pm10 Particulate matter less than 10 micrometers in... PM10

In [24]: air_quality = pd.merge(air_quality, air_quality_parameters,


....: how='left', left_on='parameter', right_on='id')
....:

In [25]: air_quality.head()
Out[25]:
date.utc location parameter ... id ␣
˓→ description name
0 2019-05-07 01:00:00+00:00 London Westminster no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2
1 2019-05-07 01:00:00+00:00 FR04014 no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2
2 2019-05-07 01:00:00+00:00 FR04014 no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2
3 2019-05-07 01:00:00+00:00 BETR801 pm25 ... pm25 Particulate␣
˓→matter less than 2.5 micrometers i... PM2.5
4 2019-05-07 01:00:00+00:00 BETR801 no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2

[5 rows x 9 columns]

Compared to the previous example, there is no common column name. However, the parameter column in the
air_quality table and the id column in the air_quality_parameters_name both provide the measured vari-
able in a common format. The left_on and right_on arguments are used here (instead of just on) to make the link
between the two tables.
pandas supports also inner, outer, and right joins. More information on join/merge of tables is provided in the user
guide section on database style merging of tables. Or have a look at the comparison with SQL page.
• Multiple tables can be concatenated both column-wise and row-wise using the concat function.
• For database-like merging/joining of tables, use the merge function.
See the user guide for a full description of the various facilities to combine data tables.

In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt

For this tutorial, air quality data about 𝑁 𝑂2 and Particulate matter less than 2.5 micrometers is used, made available
by openaq and downloaded using the py-openaq package. The air_quality_no2_long.csv" data set provides 𝑁 𝑂2
values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and
London.

In [3]: air_quality = pd.read_csv("data/air_quality_no2_long.csv")

In [4]: air_quality = air_quality.rename(columns={"date.utc": "datetime"})

In [5]: air_quality.head()
Out[5]:
city country datetime location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m3
(continues on next page)

1.4. Tutorials 51
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m3
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m3
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m3
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m3

In [6]: air_quality.city.unique()
Out[6]: array(['Paris', 'Antwerpen', 'London'], dtype=object)

How to handle time series data with ease?

Using pandas datetime properties

I want to work with the dates in the column datetime as datetime objects instead of plain text

In [7]: air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])

In [8]: air_quality["datetime"]
Out[8]:
0 2019-06-21 00:00:00+00:00
1 2019-06-20 23:00:00+00:00
2 2019-06-20 22:00:00+00:00
3 2019-06-20 21:00:00+00:00
4 2019-06-20 20:00:00+00:00
...
2063 2019-05-07 06:00:00+00:00
2064 2019-05-07 04:00:00+00:00
2065 2019-05-07 03:00:00+00:00
2066 2019-05-07 02:00:00+00:00
2067 2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]

Initially, the values in datetime are character strings and do not provide any datetime operations (e.g. extract the year,
day of the week,. . . ). By applying the to_datetime function, pandas interprets the strings and convert these to datetime
(i.e. datetime64[ns, UTC]) objects. In pandas we call these datetime objects similar to datetime.datetime from
the standard library as pandas.Timestamp.

Note: As many data sets do contain datetime information in one of the columns, pandas input function like
pandas.read_csv() and pandas.read_json() can do the transformation to dates when reading the data using
the parse_dates parameter with a list of the columns to read as Timestamp:

pd.read_csv("../data/air_quality_no2_long.csv", parse_dates=["datetime"])

Why are these pandas.Timestamp objects useful? Let’s illustrate the added value with some example cases.
What is the start and end date of the time series data set we are working with?

In [9]: air_quality["datetime"].min(), air_quality["datetime"].max()


Out[9]:
(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))

52 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Using pandas.Timestamp for datetimes enables us to calculate with date information and make them comparable.
Hence, we can use this to get the length of our time series:

In [10]: air_quality["datetime"].max() - air_quality["datetime"].min()


Out[10]: Timedelta('44 days 23:00:00')

The result is a pandas.Timedelta object, similar to datetime.timedelta from the standard Python library and
defining a time duration.
The various time concepts supported by pandas are explained in the user guide section on time related concepts.
I want to add a new column to the DataFrame containing only the month of the measurement

In [11]: air_quality["month"] = air_quality["datetime"].dt.month

In [12]: air_quality.head()
Out[12]:
city country datetime location parameter value unit month
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m3 6
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m3 6
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m3 6
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m3 6
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m3 6

By using Timestamp objects for dates, a lot of time-related properties are provided by pandas. For example the month,
but also year, weekofyear, quarter,. . . All of these properties are accessible by the dt accessor.
An overview of the existing date properties is given in the time and date components overview table. More details
about the dt accessor to return datetime like properties are explained in a dedicated section on the dt accessor.
What is the average 𝑁 𝑂2 concentration for each day of the week for each of the measurement locations?

In [13]: air_quality.groupby(
....: [air_quality["datetime"].dt.weekday, "location"])["value"].mean()
....:
Out[13]:
datetime location
0 BETR801 27.875000
FR04014 24.856250
London Westminster 23.969697
1 BETR801 22.214286
FR04014 30.999359
...
5 FR04014 25.266154
London Westminster 24.977612
6 BETR801 21.896552
FR04014 23.274306
London Westminster 24.859155
Name: value, Length: 21, dtype: float64

Remember the split-apply-combine pattern provided by groupby from the tutorial on statistics calculation? Here, we
want to calculate a given statistic (e.g. mean 𝑁 𝑂2 ) for each weekday and for each measurement location. To group
on weekdays, we use the datetime property weekday (with Monday=0 and Sunday=6) of pandas Timestamp, which is
also accessible by the dt accessor. The grouping on both locations and weekdays can be done to split the calculation
of the mean on each of these combinations.

1.4. Tutorials 53
pandas: powerful Python data analysis toolkit, Release 1.4.4

Danger: As we are working with a very short time series in these examples, the analysis does not provide a
long-term representative result!

Plot the typical 𝑁 𝑂2 pattern during the day of our time series of all stations together. In other words, what is the
average value for each hour of the day?

In [14]: fig, axs = plt.subplots(figsize=(12, 4))

In [15]: air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
....: kind='bar', rot=0, ax=axs
....: )
....:
Out[15]: <AxesSubplot:xlabel='datetime'>

In [16]: plt.xlabel("Hour of the day"); # custom x label using matplotlib

In [17]: plt.ylabel("$NO_2 (µg/m^3)$");

Similar to the previous case, we want to calculate a given statistic (e.g. mean 𝑁 𝑂2 ) for each hour of the day and we can
use the split-apply-combine approach again. For this case, we use the datetime property hour of pandas Timestamp,
which is also accessible by the dt accessor.

Datetime as index

In the tutorial on reshaping, pivot() was introduced to reshape the data table with each of the measurements locations
as a separate column:

In [18]: no_2 = air_quality.pivot(index="datetime", columns="location", values="value")

In [19]: no_2.head()
Out[19]:
location BETR801 FR04014 London Westminster
datetime
2019-05-07 01:00:00+00:00 50.5 25.0 23.0
2019-05-07 02:00:00+00:00 45.0 27.7 19.0
2019-05-07 03:00:00+00:00 NaN 50.4 19.0
2019-05-07 04:00:00+00:00 NaN 61.9 16.0
2019-05-07 05:00:00+00:00 NaN 72.4 NaN

54 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Note: By pivoting the data, the datetime information became the index of the table. In general, setting a column as an
index can be achieved by the set_index function.

Working with a datetime index (i.e. DatetimeIndex) provides powerful functionalities. For example, we do not need
the dt accessor to get the time series properties, but have these properties available on the index directly:

In [20]: no_2.index.year, no_2.index.weekday


Out[20]:
(Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
...
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
dtype='int64', name='datetime', length=1033),
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
...
3, 3, 3, 3, 3, 3, 3, 3, 3, 4],
dtype='int64', name='datetime', length=1033))

Some other advantages are the convenient subsetting of time period or the adapted time scale on plots. Let’s apply this
on our data.
Create a plot of the 𝑁 𝑂2 values in the different stations from the 20th of May till the end of 21st of May

In [21]: no_2["2019-05-20":"2019-05-21"].plot();

1.4. Tutorials 55
pandas: powerful Python data analysis toolkit, Release 1.4.4

By providing a string that parses to a datetime, a specific subset of the data can be selected on a DatetimeIndex.
More information on the DatetimeIndex and the slicing by using strings is provided in the section on time series
indexing.

Resample a time series to another frequency

Aggregate the current hourly time series values to the monthly maximum value in each of the stations.

In [22]: monthly_max = no_2.resample("M").max()

In [23]: monthly_max
Out[23]:
location BETR801 FR04014 London Westminster
datetime
2019-05-31 00:00:00+00:00 74.5 97.0 97.0
2019-06-30 00:00:00+00:00 52.5 84.7 52.0

A very powerful method on time series data with a datetime index, is the ability to resample() time series to another
frequency (e.g., converting secondly data into 5-minutely data).
The resample() method is similar to a groupby operation:
• it provides a time-based grouping, by using a string (e.g. M, 5H,. . . ) that defines the target frequency
• it requires an aggregation function such as mean, max,. . .
An overview of the aliases used to define time series frequencies is given in the offset aliases overview table.
When defined, the frequency of the time series is provided by the freq attribute:

In [24]: monthly_max.index.freq
Out[24]: <MonthEnd>

Make a plot of the daily mean 𝑁 𝑂2 value in each of the stations.

In [25]: no_2.resample("D").mean().plot(style="-o", figsize=(10, 5));

56 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

More details on the power of time series resampling is provided in the user guide section on resampling.
• Valid date strings can be converted to datetime objects using to_datetime function or as part of read functions.
• Datetime objects in pandas support calculations, logical operations and convenient date-related properties using
the dt accessor.
• A DatetimeIndex contains these date-related properties and supports convenient slicing.
• Resample is a powerful method to change the frequency of a time series.
A full overview on time series is given on the pages on time series and date functionality.

In [1]: import pandas as pd

This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.

1.4. Tutorials 57
pandas: powerful Python data analysis toolkit, Release 1.4.4

In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S

[5 rows x 12 columns]

How to manipulate textual data?

Make all name characters lowercase.

In [4]: titanic["Name"].str.lower()
Out[4]:
0 braund, mr. owen harris
1 cumings, mrs. john bradley (florence briggs th...
2 heikkinen, miss. laina
3 futrelle, mrs. jacques heath (lily may peel)
4 allen, mr. william henry
...
886 montvila, rev. juozas
887 graham, miss. margaret edith
888 johnston, miss. catherine helen "carrie"
889 behr, mr. karl howell
890 dooley, mr. patrick
Name: Name, Length: 891, dtype: object

To make each of the strings in the Name column lowercase, select the Name column (see the tutorial on selection of
data), add the str accessor and apply the lower method. As such, each of the strings is converted element-wise.
Similar to datetime objects in the time series tutorial having a dt accessor, a number of specialized string methods are
available when using the str accessor. These methods have in general matching names with the equivalent built-in
string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of
the values of the columns.
Create a new column Surname that contains the surname of the passengers by extracting the part before the comma.

In [5]: titanic["Name"].str.split(",")
Out[5]:
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
(continues on next page)

58 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


3 [Futrelle,
Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
...
886 [Montvila, Rev. Juozas]
887 [Graham, Miss. Margaret Edith]
888 [Johnston, Miss. Catherine Helen "Carrie"]
889 [Behr, Mr. Karl Howell]
890 [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object

Using the Series.str.split() method, each of the values is returned as a list of 2 elements. The first element is
the part before the comma and the second element is the part after the comma.

In [6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)

In [7]: titanic["Surname"]
Out[7]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Surname, Length: 891, dtype: object

As we are only interested in the first part representing the surname (element 0), we can again use the str accessor and
apply Series.str.get() to extract the relevant part. Indeed, these string functions can be concatenated to combine
multiple functions at once!
More information on extracting parts of strings is available in the user guide section on splitting and replacing strings.
Extract the passenger data about the countesses on board of the Titanic.

In [8]: titanic["Name"].str.contains("Countess")
Out[8]:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Name, Length: 891, dtype: bool

1.4. Tutorials 59
pandas: powerful Python data analysis toolkit, Release 1.4.4

In [9]: titanic[titanic["Name"].str.contains("Countess")]
Out[9]:
PassengerId Survived Pclass Name ␣
˓→ Sex ... Ticket Fare Cabin Embarked Surname
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... ␣
˓→female ... 110152 86.5 B77 S Rothes

[1 rows x 13 columns]

(Interested in her story? See Wikipedia!)


The string method Series.str.contains() checks for each of the values in the column Name if the string contains
the word Countess and returns for each of the values True (Countess is part of the name) or False (Countess is
not part of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in
the subsetting of data tutorial. As there was only one countess on the Titanic, we get one row as a result.

Note: More powerful extractions on strings are supported, as the Series.str.contains() and Series.str.
extract() methods accept regular expressions, but out of scope of this tutorial.

More information on extracting parts of strings is available in the user guide section on string matching and extracting.
Which passenger of the Titanic has the longest name?

In [10]: titanic["Name"].str.len()
Out[10]:
0 23
1 51
2 22
3 44
4 24
..
886 21
887 28
888 40
889 21
890 19
Name: Name, Length: 891, dtype: int64

To get the longest name we first have to get the lengths of each of the names in the Name column. By using pandas
string methods, the Series.str.len() function is applied to each of the names individually (element-wise).

In [11]: titanic["Name"].str.len().idxmax()
Out[11]: 307

Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is
the largest. The idxmax() method does exactly that. It is not a string method and is applied to integers, so no str is
used.

In [12]: titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]


Out[12]: 'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y␣
˓→Vallejo)'

Based on the index name of the row (307) and the column (Name), we can do a selection using the loc operator,
introduced in the tutorial on subsetting.

60 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”.

In [13]: titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})

In [14]: titanic["Sex_short"]
Out[14]:
0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex_short, Length: 891, dtype: object

Whereas replace() is not a string method, it provides a convenient way to use mappings or vocabularies to translate
certain values. It requires a dictionary to define the mapping {from : to}.

Warning: There is also a replace() method available to replace a specific set of characters. However, when
having a mapping of multiple values, this would become:
titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")

This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if
those two statements are applied in the opposite order. . .

• String methods are available using the str accessor.


• String methods work element-wise and can be used for conditional indexing.
• The replace method is a convenient method to convert values according to a given dictionary.
A full overview is provided in the user guide pages on working with text data.

1.4.4 Comparison with other tools

Comparison with R / R libraries

Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page
was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas.
In comparisons with R and CRAN libraries, we care about the following things:
• Functionality / flexibility: what can/cannot be done with each tool
• Performance: how fast are operations. Hard numbers/benchmarks are preferable
• Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code
comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.

1.4. Tutorials 61
pandas: powerful Python data analysis toolkit, Release 1.4.4

For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External compatibility for
an example.

Quick reference

We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.

Querying, filtering, sampling

R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[['col1', 'col2']]
select(df, col1:col3) df.loc[:, 'col1':'col3']
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see1
distinct(select(df, col1)) df[['col1']].drop_duplicates()
distinct(select(df, col1, col2)) df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)

Sorting

R pandas
arrange(df, col1, col2) df.sort_values(['col1', 'col2'])
arrange(df, desc(col1)) df.sort_values('col1', ascending=False)

Transforming

R pandas
select(df, col_one = col1) df.rename(columns={'col1': 'col_one'})['col_one']
rename(df, col_one = col1) df.rename(columns={'col1': 'col_one'})
mutate(df, c=a-b) df.assign(c=df['a']-df['b'])
1 R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the list of columns, for

example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.

62 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Grouping and summarizing

R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na.rm=TRUE)) df.groupby('col1').agg({'col1': 'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()

Base R

Slicing with R’s c

R makes it easy to access data.frame columns by name

df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))


df[, c("a", "c", "e")]

or by integer location

df <- data.frame(matrix(rnorm(1000), ncol=100))


df[, c(1:10, 25:30, 40, 50:100)]

Selecting multiple columns by name in pandas is straightforward

In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list("abc"))

In [2]: df[["a", "c"]]


Out[2]:
a c
0 0.469112 -1.509059
1 -1.135632 -0.173215
2 0.119209 -0.861849
3 -2.104569 1.071804
4 0.721555 -1.039575
5 0.271860 0.567020
6 0.276232 -0.673690
7 0.113648 0.524988
8 0.404705 -1.715002
9 -1.039268 -1.157892

In [3]: df.loc[:, ["a", "c"]]


Out[3]:
a c
0 0.469112 -1.509059
1 -1.135632 -0.173215
2 0.119209 -0.861849
3 -2.104569 1.071804
4 0.721555 -1.039575
5 0.271860 0.567020
6 0.276232 -0.673690
7 0.113648 0.524988
(continues on next page)

1.4. Tutorials 63
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


8 0.404705 -1.715002
9 -1.039268 -1.157892

Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer
attribute and numpy.r_.

In [4]: named = list("abcdefg")

In [5]: n = 30

In [6]: columns = named + np.arange(len(named), n).tolist()

In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns)

In [8]: df.iloc[:, np.r_[:10, 24:30]]


Out[8]:
a b c d e f ... 24 25 ␣
˓→ 26 27 28 29
0 -1.344312 0.844885 1.075770 -0.109050 1.643563 -1.469388 ... -1.170299 -0.226169 ␣
˓→0.410835 0.813850 0.132003 -0.827317
1 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920 ... 0.959726 -1.110336 -
˓→0.619976 0.149748 -0.732339 0.687738
2 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 ... 0.084844 0.432390 ␣
˓→1.519970 -0.493662 0.600178 0.274230
3 0.132885 -0.023688 2.410179 1.450520 0.206053 -0.251905 ... -2.484478 -0.281461 ␣
˓→0.030711 0.109121 1.126203 -0.977349
4 1.474071 -0.064034 -1.282782 0.781836 -1.071357 0.441153 ... -1.197071 -1.066969 -
˓→0.303421 -0.858447 0.306996 -0.028665
.. ... ... ... ... ... ... ... ... ... ␣
˓→ ... ... ... ...
25 1.492125 -0.068190 0.681456 1.221829 -0.434352 1.204815 ... 1.944517 0.042344 -
˓→0.307904 0.428572 0.880609 0.487645
26 0.725238 0.624607 -0.141185 -0.143948 -0.328162 2.095086 ... -0.846188 1.190624 ␣
˓→0.778507 1.008500 1.424017 0.717110
27 1.262419 1.950057 0.301038 -0.933858 0.814946 0.181439 ... -1.341814 0.334281 -
˓→0.162227 1.007824 2.826008 1.458383
28 -1.585746 -0.899734 0.921494 -0.211762 -0.059182 0.058308 ... 0.403620 -0.026602 -
˓→0.240481 0.577223 -1.088417 0.326687
29 -0.986248 0.169729 -1.158091 1.019673 0.646039 0.917399 ... -1.209247 -0.671466 ␣
˓→0.332872 -2.013086 -1.602549 0.333109

[30 rows x 16 columns]

64 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

aggregate

In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting
it into groups by1 and by2:

df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)

The groupby() method is similar to base R aggregate function.

In [9]: df = pd.DataFrame(
...: {
...: "v1": [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
...: "v2": [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
...: "by1": ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: "by2": [
...: "wet",
...: "dry",
...: 99,
...: 95,
...: np.nan,
...: "damp",
...: 95,
...: 99,
...: "red",
...: 99,
...: np.nan,
...: np.nan,
...: ],
...: }
...: )
...:

In [10]: g = df.groupby(["by1", "by2"])

In [11]: g[["v1", "v2"]].mean()


Out[11]:
v1 v2
by1 by2
1 95 5.0 55.0
99 5.0 55.0
2 95 7.0 77.0
99 NaN NaN
big damp 3.0 33.0
blue dry 3.0 33.0
red red 4.0 44.0
wet 1.0 11.0

For more details and examples see the groupby documentation.

1.4. Tutorials 65
pandas: powerful Python data analysis toolkit, Release 1.4.4

match / %in%

A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used
to return a logical vector indicating if there is a match or not:

s <- 0:4
s %in% c(2,4)

The isin() method is similar to R %in% operator:

In [12]: s = pd.Series(np.arange(5), dtype=np.float32)

In [13]: s.isin([2, 4])


Out[13]:
0 False
1 False
2 True
3 False
4 True
dtype: bool

The match function returns a vector of the positions of matches of its first argument in its second:

s <- 0:4
match(s, c(2,4))

For more details and examples see the reshaping documentation.

tapply

tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using
a data.frame called baseball, and retrieving information based on the array team:

baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))

tapply(baseball$batting.average, baseball.example$team,
max)

In pandas we may use pivot_table() method to handle this:

In [14]: import random

In [15]: import string

In [16]: baseball = pd.DataFrame(


....: {
....: "team": ["team %d" % (x + 1) for x in range(5)] * 5,
....: "player": random.sample(list(string.ascii_lowercase), 25),
....: "batting avg": np.random.uniform(0.200, 0.400, 25),
(continues on next page)

66 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


....: }
....: )
....:

In [17]: baseball.pivot_table(values="batting avg", columns="team", aggfunc=np.max)


Out[17]:
team team 1 team 2 team 3 team 4 team 5
batting avg 0.352134 0.295327 0.397191 0.394457 0.396194

For more details and examples see the reshaping documentation.

subset

The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame
where one column’s values are less than another column’s values:

df <- data.frame(a=rnorm(10), b=rnorm(10))


subset(df, a <= b)
df[df$a <= df$b,] # note the comma

In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an
index/slice as well as standard boolean indexing:

In [18]: df = pd.DataFrame({"a": np.random.randn(10), "b": np.random.randn(10)})

In [19]: df.query("a <= b")


Out[19]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550

In [20]: df[df["a"] <= df["b"]]


Out[20]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550

In [21]: df.loc[df["a"] <= df["b"]]


Out[21]:
a b
1 0.174950 0.552887
(continues on next page)

1.4. Tutorials 67
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550

For more details and examples see the query documentation.

with

An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:

df <- data.frame(a=rnorm(10), b=rnorm(10))


with(df, a + b)
df$a + df$b # same as the previous expression

In pandas the equivalent expression, using the eval() method, would be:

In [22]: df = pd.DataFrame({"a": np.random.randn(10), "b": np.random.randn(10)})

In [23]: df.eval("a + b")


Out[23]:
0 -0.091430
1 -2.483890
2 -0.252728
3 -0.626444
4 -0.261740
5 2.149503
6 -0.332214
7 0.799331
8 -2.377245
9 2.104677
dtype: float64

In [24]: df["a"] + df["b"] # same as the previous expression


Out[24]:
0 -0.091430
1 -2.483890
2 -0.252728
3 -0.626444
4 -0.261740
5 2.149503
6 -0.332214
7 0.799331
8 -2.377245
9 2.104677
dtype: float64

In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval
documentation.

68 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

plyr

plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data
structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures
could be mapped in Python.

R Python
array list
lists dictionary or list of objects
data.frame dataframe

ddply

An expression using a data.frame called df in R where you want to summarize x by month:

require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)

ddply(df, .(month, week), summarize,


mean = round(mean(x), 2),
sd = round(sd(x), 2))

In pandas the equivalent expression, using the groupby() method, would be:
In [25]: df = pd.DataFrame(
....: {
....: "x": np.random.uniform(1.0, 168.0, 120),
....: "y": np.random.uniform(7.0, 334.0, 120),
....: "z": np.random.uniform(1.7, 20.7, 120),
....: "month": [5, 6, 7, 8] * 30,
....: "week": np.random.randint(1, 4, 120),
....: }
....: )
....:

In [26]: grouped = df.groupby(["month", "week"])

In [27]: grouped["x"].agg([np.mean, np.std])


Out[27]:
mean std
month week
5 1 63.653367 40.601965
2 78.126605 53.342400
3 92.091886 57.630110
6 1 81.747070 54.339218
2 70.971205 54.687287
(continues on next page)

1.4. Tutorials 69
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


3 100.968344 54.010081
7 1 61.576332 38.844274
2 61.733510 48.209013
3 71.688795 37.595638
8 1 62.741922 34.618153
2 91.774627 49.790202
3 73.936856 60.773900

For more details and examples see the groupby documentation.

reshape / reshape2

meltarray

An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:

a <- array(c(1:23, NA), c(2,3,4))


data.frame(melt(a))

In Python, since a is a list, you can simply use list comprehension.

In [28]: a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)

In [29]: pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)])


Out[29]:
0 1 2 3
0 0 0 0 1.0
1 0 0 1 2.0
2 0 0 2 3.0
3 0 0 3 4.0
4 0 1 0 5.0
.. .. .. .. ...
19 1 1 3 20.0
20 1 2 0 21.0
21 1 2 1 22.0
22 1 2 2 23.0
23 1 2 3 NaN

[24 rows x 4 columns]

meltlist

An expression using a list called a in R where you want to melt it into a data.frame:

a <- as.list(c(1:4, NA))


data.frame(melt(a))

In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.

70 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

In [30]: a = list(enumerate(list(range(1, 5)) + [np.NAN]))

In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN

For more details and examples see the Into to Data Structures documentation.

meltdf

An expression using a data.frame called cheese in R where you want to reshape the data.frame:

cheese <- data.frame(


first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))

In Python, the melt() method is the R equivalent:

In [32]: cheese = pd.DataFrame(


....: {
....: "first": ["John", "Mary"],
....: "last": ["Doe", "Bo"],
....: "height": [5.5, 6.0],
....: "weight": [130, 150],
....: }
....: )
....:

In [33]: pd.melt(cheese, id_vars=["first", "last"])


Out[33]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

In [34]: cheese.set_index(["first", "last"]).stack() # alternative way


Out[34]:
first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
(continues on next page)

1.4. Tutorials 71
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


dtype: float64

For more details and examples see the reshaping documentation.

cast

In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:

df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)

mdf <- melt(df, id=c("month", "week"))


acast(mdf, week ~ month ~ variable, mean)

In Python the best way is to make use of pivot_table():

In [35]: df = pd.DataFrame(
....: {
....: "x": np.random.uniform(1.0, 168.0, 12),
....: "y": np.random.uniform(7.0, 334.0, 12),
....: "z": np.random.uniform(1.7, 20.7, 12),
....: "month": [5, 6, 7] * 4,
....: "week": [1, 2] * 6,
....: }
....: )
....:

In [36]: mdf = pd.melt(df, id_vars=["month", "week"])

In [37]: pd.pivot_table(
....: mdf,
....: values="value",
....: index=["variable", "week"],
....: columns=["month"],
....: aggfunc=np.mean,
....: )
....:
Out[37]:
month 5 6 7
variable week
x 1 93.888747 98.762034 55.219673
2 94.391427 38.112932 83.942781
y 1 94.306912 279.454811 227.840449
2 87.392662 193.028166 173.899260
z 1 11.016009 10.079307 16.170549
2 8.476111 17.638509 19.003494

72 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:

df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)

dcast(df, Animal ~ FeedType, sum, fill=NaN)


# Alternative method using base R
with(df, tapply(Amount, list(Animal, FeedType), sum))

Python can approach this in two different ways. Firstly, similar to above using pivot_table():

In [38]: df = pd.DataFrame(
....: {
....: "Animal": [
....: "Animal1",
....: "Animal2",
....: "Animal3",
....: "Animal2",
....: "Animal1",
....: "Animal2",
....: "Animal3",
....: ],
....: "FeedType": ["A", "B", "A", "A", "B", "B", "A"],
....: "Amount": [10, 7, 4, 2, 5, 6, 2],
....: }
....: )
....:

In [39]: df.pivot_table(values="Amount", index="Animal", columns="FeedType", aggfunc="sum


˓→")

Out[39]:
FeedType A B
Animal
Animal1 10.0 5.0
Animal2 2.0 13.0
Animal3 6.0 NaN

The second approach is to use the groupby() method:

In [40]: df.groupby(["Animal", "FeedType"])["Amount"].sum()


Out[40]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64

For more details and examples see the reshaping documentation or the groupby documentation.

1.4. Tutorials 73
pandas: powerful Python data analysis toolkit, Release 1.4.4

factor

pandas has a data type for categorical data.

cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))

In pandas this is accomplished with pd.cut and astype("category"):

In [41]: pd.cut(pd.Series([1, 2, 3, 4, 5, 6]), 3)


Out[41]:
0 (0.995, 2.667]
1 (0.995, 2.667]
2 (2.667, 4.333]
3 (2.667, 4.333]
4 (4.333, 6.0]
5 (4.333, 6.0]
dtype: category
Categories (3, interval[float64, right]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.
˓→0]]

In [42]: pd.Series([1, 2, 3, 2, 2, 3]).astype("category")


Out[42]:
0 1
1 2
2 3
3 2
4 2
5 3
dtype: category
Categories (3, int64): [1, 2, 3]

For more details and examples see categorical introduction and the API documentation. There is also a documentation
regarding the differences to R’s factor.

Comparison with SQL

Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of
how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame
called tips and assume we have a database table of the same name and structure.

In [3]: url = (
...: "https://raw.github.com/pandas-dev"
(continues on next page)

74 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


...: "/pandas/main/pandas/tests/io/data/csv/tips.csv"
...: )
...:

In [4]: tips = pd.read_csv(url)

In [5]: tips
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

[244 rows x 7 columns]

Copies vs. in place operations

Most pandas operations return copies of the Series/DataFrame. To make the changes “stick”, you’ll need to either
assign to a new variable:

sorted_df = df.sort_values("col1")

or overwrite the original one:

df = df.sort_values("col1")

Note: You will see an inplace=True keyword argument available for some methods:

df.sort_values("col1", inplace=True)

Its use is discouraged. More information.

1.4. Tutorials 75
pandas: powerful Python data analysis toolkit, Release 1.4.4

SELECT

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):

SELECT total_bill, tip, smoker, time


FROM tips;

With pandas, column selection is done by passing a list of column names to your DataFrame:

In [6]: tips[["total_bill", "tip", "smoker", "time"]]


Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
.. ... ... ... ...
239 29.03 5.92 No Dinner
240 27.18 2.00 Yes Dinner
241 22.67 2.00 Yes Dinner
242 17.82 1.75 No Dinner
243 18.78 3.00 No Dinner

[244 rows x 4 columns]

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).
In SQL, you can add a calculated column:

SELECT *, tip/total_bill as tip_rate


FROM tips;

With pandas, you can use the DataFrame.assign() method of a DataFrame to append a new column:

In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"])


Out[7]:
total_bill tip sex smoker day time size tip_rate
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
.. ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744

[244 rows x 8 columns]

76 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

WHERE

Filtering in SQL is done via a WHERE clause.

SELECT *
FROM tips
WHERE time = 'Dinner';

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [8]: tips[tips["total_bill"] > 10]


Out[8]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

[227 rows x 7 columns]

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with
True.

In [9]: is_dinner = tips["time"] == "Dinner"

In [10]: is_dinner
Out[10]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool

In [11]: is_dinner.value_counts()
Out[11]:
True 176
False 68
Name: time, dtype: int64

(continues on next page)

1.4. Tutorials 77
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


In [12]: tips[is_dinner]
Out[12]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

[176 rows x 7 columns]

Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
Tips of more than $5 at Dinner meals:

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]


Out[13]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3

Tips by parties of at least 5 diners OR bill total was more than $45:

SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;

In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]


Out[14]:
(continues on next page)

78 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5

NULL checking is done using the notna() and isna() methods.

In [15]: frame = pd.DataFrame(


....: {"col1": ["A", "B", np.NaN, "C", "D"], "col2": ["F", np.NaN, "G", "H", "I"]}
....: )
....:

In [16]: frame
Out[16]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I

Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS
NULL with the following query:

SELECT *
FROM frame
WHERE col2 IS NULL;

In [17]: frame[frame["col2"].isna()]
Out[17]:
col1 col2
1 B NaN

Getting items where col1 IS NOT NULL can be done with notna().

SELECT *
FROM frame
WHERE col1 IS NOT NULL;

In [18]: frame[frame["col1"].notna()]
Out[18]:
col1 col2
(continues on next page)

1.4. Tutorials 79
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


0 A F
1 B NaN
3 C H
4 D I

GROUP BY

In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby()
typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation)
, and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a
query getting us the number of tips left by sex:

SELECT sex, count(*)


FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/

The pandas equivalent would be:

In [19]: tips.groupby("sex").size()
Out[19]:
sex
Female 87
Male 157
dtype: int64

Notice that in the pandas code we used size() and not count(). This is because count() applies the function to
each column, returning the number of NOT NULL records within each.

In [20]: tips.groupby("sex").count()
Out[20]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157

Alternatively, we could have applied the count() method to an individual column:

In [21]: tips.groupby("sex")["total_bill"].count()
Out[21]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64

Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs by day of
the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to
specific columns.

80 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

SELECT day, AVG(tip), COUNT(*)


FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thu 2.771452 62
*/

In [22]: tips.groupby("day").agg({"tip": np.mean, "day": np.size})


Out[22]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SELECT smoker, day, COUNT(*), AVG(tip)


FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thu 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thu 17 3.030000
*/

In [23]: tips.groupby(["smoker", "day"]).agg({"tip": [np.size, np.mean]})


Out[23]:
tip
size mean
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000

1.4. Tutorials 81
pandas: powerful Python data analysis toolkit, Release 1.4.4

JOIN

JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices. Each
method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns
to join on (column names or indices).

Warning: If both key columns contain rows where the key is a null value, those rows will be matched against each
other. This is different from usual SQL join behaviour and can lead to unexpected results.

In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})

In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.

INNER JOIN

SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;

# merge performs an INNER JOIN by default


In [26]: pd.merge(df1, df2, on="key")
Out[26]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209

merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s
index.

In [27]: indexed_df2 = df2.set_index("key")

In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True)


Out[28]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
3 D -1.135632 0.119209

82 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

LEFT OUTER JOIN

Show all records from df1.

SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;

In [29]: pd.merge(df1, df2, on="key", how="left")


Out[29]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

RIGHT JOIN

Show all records from df2.

SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;

In [30]: pd.merge(df1, df2, on="key", how="right")


Out[30]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236

FULL JOIN

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a
match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
Show all records from both tables.

SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;

In [31]: pd.merge(df1, df2, on="key", how="outer")


Out[31]:
key value_x value_y
(continues on next page)

1.4. Tutorials 83
pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

UNION

UNION ALL can be performed using concat().

In [32]: df1 = pd.DataFrame(


....: {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
....: )
....:

In [33]: df2 = pd.DataFrame(


....: {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
....: )
....:

SELECT city, rank


FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/

In [34]: pd.concat([df1, df2])


Out[34]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.

SELECT city, rank


FROM df1
(continues on next page)

84 Chapter 1. Getting started


pandas: powerful Python data analysis toolkit, Release 1.4.4

(continued from previous page)


UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/

In pandas, you can use concat() in conjunction with drop_duplicates().

In [35]: pd.concat([df1, df2]).drop_duplicates()


Out[35]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5

LIMIT

SELECT * FROM tips


LIMIT 10;

In [36]: tips.head(10)
Out[36]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2

1.4. Tutorials 85
pandas: powerful Python data analysis toolkit, Release 1.4.4

pandas equivalents for some SQL analytic and aggregate functions

Top n rows with offset

-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;

In [37]: tips.nlargest(10 + 5, columns="tip").tail(10)


Out[37]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4

Top n rows per group

-- Oracle's ROW_NUMBER() analytic function


SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;

In [38]: (
....: tips.assign(
....: rn=tips.sort_values(["total_bill"], ascending=False)
....: .groupby(["day"])
....: .cumcount()
....: + 1
....: )
....: .query("rn < 3")
....: .sort_values(["day", "rn"])
....: )
....:
Out[38]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
(continues on next page)

86 Chapter 1. Getting started

You might also like