Pandas Powerful
Pandas Powerful
toolkit
Release 1.4.4
1 Getting started 3
1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Intro to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Coming from. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 Package overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Getting started tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4 Comparison with other tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.4.5 Community tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
i
2.3.11 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
2.3.12 Copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
2.3.13 dtypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
2.3.14 Selecting columns based on dtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
2.4 IO tools (text, CSV, HDF5, . . . ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
2.4.1 CSV & text files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
2.4.2 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
2.4.3 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
2.4.4 LaTeX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
2.4.5 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
2.4.6 Excel files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
2.4.7 OpenDocument Spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
2.4.8 Binary Excel (.xlsb) files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
2.4.9 Clipboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
2.4.10 Pickling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
2.4.11 msgpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
2.4.12 HDF5 (PyTables) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
2.4.13 Feather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
2.4.14 Parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
2.4.15 ORC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
2.4.16 SQL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
2.4.17 Google BigQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
2.4.18 Stata format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
2.4.19 SAS formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
2.4.20 SPSS formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
2.4.21 Other file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
2.4.22 Performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
2.5 Indexing and selecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
2.5.1 Different choices for indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
2.5.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
2.5.3 Attribute access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
2.5.4 Slicing ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
2.5.5 Selection by label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
2.5.6 Selection by position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
2.5.7 Selection by callable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
2.5.8 Combining positional and label-based indexing . . . . . . . . . . . . . . . . . . . . . . . . 428
2.5.9 Indexing with list with missing labels is deprecated . . . . . . . . . . . . . . . . . . . . . . 429
2.5.10 Selecting random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
2.5.11 Setting with enlargement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
2.5.12 Fast scalar value getting and setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
2.5.13 Boolean indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
2.5.14 Indexing with isin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
2.5.15 The where() Method and Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
2.5.16 Setting with enlargement conditionally using numpy() . . . . . . . . . . . . . . . . . . . . 444
2.5.17 The query() Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
2.5.18 Duplicate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
2.5.19 Dictionary-like get() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
2.5.20 Looking up values by index/column labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
2.5.21 Index objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
2.5.22 Set / reset index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
2.5.23 Returning a view versus a copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
2.6 MultiIndex / advanced indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
2.6.1 Hierarchical indexing (MultiIndex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
2.6.2 Advanced indexing with hierarchical index . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
ii
2.6.3 Sorting a MultiIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
2.6.4 Take methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
2.6.5 Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
2.6.6 Miscellaneous indexing FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
2.7 Merge, join, concatenate and compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
2.7.1 Concatenating objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
2.7.2 Database-style DataFrame or named Series joining/merging . . . . . . . . . . . . . . . . . 518
2.7.3 Timeseries friendly merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
2.7.4 Comparing objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
2.8 Reshaping and pivot tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
2.8.1 Reshaping by pivoting DataFrame objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
2.8.2 Reshaping by stacking and unstacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
2.8.3 Reshaping by melt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
2.8.4 Combining with stats and GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
2.8.5 Pivot tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
2.8.6 Cross tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
2.8.7 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
2.8.8 Computing indicator / dummy variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
2.8.9 Factorizing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
2.8.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
2.8.11 Exploding a list-like column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
2.9 Working with text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
2.9.1 Text data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
2.9.2 String methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
2.9.3 Splitting and replacing strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
2.9.4 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
2.9.5 Indexing with .str . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
2.9.6 Extracting substrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
2.9.7 Testing for strings that match or contain a pattern . . . . . . . . . . . . . . . . . . . . . . . 595
2.9.8 Creating indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
2.9.9 Method summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
2.10 Working with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
2.10.1 Values considered “missing” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
2.10.2 Inserting missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
2.10.3 Calculations with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
2.10.4 Sum/prod of empties/nans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
2.10.5 NA values in GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
2.10.6 Filling missing values: fillna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
2.10.7 Filling with a PandasObject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
2.10.8 Dropping axis labels with missing data: dropna . . . . . . . . . . . . . . . . . . . . . . . . 609
2.10.9 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
2.10.10 Replacing generic values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
2.10.11 String/regular expression replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
2.10.12 Numeric replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
2.10.13 Experimental NA scalar to denote missing values . . . . . . . . . . . . . . . . . . . . . . . . 625
2.11 Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
2.11.1 Consequences of Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
2.11.2 Duplicate Label Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
2.11.3 Disallowing Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
2.12 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
2.12.1 Object creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
2.12.2 CategoricalDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
2.12.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
2.12.4 Working with categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
iii
2.12.5 Sorting and order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
2.12.6 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
2.12.7 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
2.12.8 Data munging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
2.12.9 Getting data in/out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
2.12.10 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
2.12.11 Differences to R’s factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
2.12.12 Gotchas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
2.13 Nullable integer data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
2.13.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
2.13.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
2.13.3 Scalar NA Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
2.14 Nullable Boolean data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
2.14.1 Indexing with NA values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
2.14.2 Kleene logical operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
2.15 Chart Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
2.15.1 Basic plotting: plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
2.15.2 Other plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
2.15.3 Plotting with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
2.15.4 Plotting tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
2.15.5 Plot formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
2.15.6 Plotting directly with matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749
2.15.7 Plotting backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
2.16 Table Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
2.16.1 Styler Object and HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
2.16.2 Formatting the Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
2.16.3 Methods to Add Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
2.16.4 Table Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
2.16.5 Setting Classes and Linking to External CSS . . . . . . . . . . . . . . . . . . . . . . . . . 755
2.16.6 Styler Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
2.16.7 Tooltips and Captions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
2.16.8 Finer Control with Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
2.16.9 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
2.16.10 Builtin Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
2.16.11 Sharing styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
2.16.12 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
2.16.13 Other Fun and Useful Stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
2.16.14 Export to Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
2.16.15 Export to LaTeX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
2.16.16 More About CSS and HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
2.16.17 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770
2.17 Computational tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
2.17.1 Statistical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
2.18 Group by: split-apply-combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
2.18.1 Splitting an object into groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
2.18.2 Iterating through groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
2.18.3 Selecting a group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
2.18.4 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
2.18.5 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
2.18.6 Filtration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
2.18.7 Dispatching to instance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
2.18.8 Flexible apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
2.18.9 Numba Accelerated Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
2.18.10 Other useful features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
iv
2.18.11 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
2.19 Windowing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
2.19.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
2.19.2 Rolling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
2.19.3 Weighted window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
2.19.4 Expanding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
2.19.5 Exponentially Weighted window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837
2.20 Time series / date functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
2.20.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841
2.20.2 Timestamps vs. time spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842
2.20.3 Converting to timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
2.20.4 Generating ranges of timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
2.20.5 Timestamp limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
2.20.6 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852
2.20.7 Time/date components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
2.20.8 DateOffset objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
2.20.9 Time series-related instance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
2.20.10 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
2.20.11 Time span representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
2.20.12 Converting between representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
2.20.13 Representing out-of-bounds spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
2.20.14 Time zone handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900
2.21 Time deltas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
2.21.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
2.21.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 911
2.21.3 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915
2.21.4 Frequency conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
2.21.5 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918
2.21.6 TimedeltaIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 920
2.21.7 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
2.22 Options and settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
2.22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
2.22.2 Available options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
2.22.3 Getting and setting options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933
2.22.4 Setting startup options in Python/IPython environment . . . . . . . . . . . . . . . . . . . . 934
2.22.5 Frequently used options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
2.22.6 Number formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 941
2.22.7 Unicode formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942
2.22.8 Table schema display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
2.23 Enhancing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
2.23.1 Cython (writing C extensions for pandas) . . . . . . . . . . . . . . . . . . . . . . . . . . . 944
2.23.2 Numba (JIT compilation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
2.23.3 Expression evaluation via eval() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951
2.24 Scaling to large datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
2.24.1 Load less data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
2.24.2 Use efficient datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962
2.24.3 Use chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
2.24.4 Use other libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
2.25 Sparse data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
2.25.1 SparseArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971
2.25.2 SparseDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971
2.25.3 Sparse accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972
2.25.4 Sparse calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972
2.25.5 Migrating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973
v
2.25.6 Interaction with scipy.sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975
2.26 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
2.26.1 DataFrame memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
2.26.2 Using if/truth statements with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
2.26.3 Mutating with User Defined Function (UDF) methods . . . . . . . . . . . . . . . . . . . . . 983
2.26.4 NaN, Integer NA values and NA type promotions . . . . . . . . . . . . . . . . . . . . . . . . 984
2.26.5 Differences with NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.26.6 Thread-safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.26.7 Byte-ordering issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.27 Cookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
2.27.1 Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988
2.27.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992
2.27.3 Multiindexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
2.27.4 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
2.27.5 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001
2.27.6 Timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
2.27.7 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
2.27.8 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016
2.27.9 Data in/out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
2.27.10 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023
2.27.11 Timedeltas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024
2.27.12 Creating example data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026
vi
3.3.3 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446
3.3.4 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447
3.3.5 Binary operator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449
3.3.6 Function application, GroupBy & window . . . . . . . . . . . . . . . . . . . . . . . . . . . 1450
3.3.7 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1450
3.3.8 Reindexing / selection / label manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1452
3.3.9 Missing data handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1453
3.3.10 Reshaping, sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1453
3.3.11 Combining / comparing / joining / merging . . . . . . . . . . . . . . . . . . . . . . . . . . 1453
3.3.12 Time Series-related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1454
3.3.13 Accessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1454
3.3.14 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1572
3.3.15 Serialization / IO / conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621
3.4 DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621
3.4.1 Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621
3.4.2 Attributes and underlying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1964
3.4.3 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1964
3.4.4 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1965
3.4.5 Binary operator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1966
3.4.6 Function application, GroupBy & window . . . . . . . . . . . . . . . . . . . . . . . . . . . 1967
3.4.7 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1967
3.4.8 Reindexing / selection / label manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969
3.4.9 Missing data handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1970
3.4.10 Reshaping, sorting, transposing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1970
3.4.11 Combining / comparing / joining / merging . . . . . . . . . . . . . . . . . . . . . . . . . . 1971
3.4.12 Time Series-related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1971
3.4.13 Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1971
3.4.14 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1972
3.4.15 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1972
3.4.16 Sparse accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2026
3.4.17 Serialization / IO / conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2028
3.5 pandas arrays, scalars, and data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2028
3.5.1 pandas.array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2029
3.5.2 Datetime data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2032
3.5.3 Timedelta data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2065
3.5.4 Timespan data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075
3.5.5 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075
3.5.6 Interval data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2093
3.5.7 Nullable integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2107
3.5.8 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2112
3.5.9 Sparse data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118
3.5.10 Text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2120
3.5.11 Boolean data with missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2123
3.6 Index objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2125
3.6.1 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2125
3.6.2 Numeric Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2193
3.6.3 CategoricalIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2197
3.6.4 IntervalIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2207
3.6.5 MultiIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2219
3.6.6 DatetimeIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2239
3.6.7 TimedeltaIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2274
3.6.8 PeriodIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2284
3.7 Date offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2292
3.7.1 DateOffset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2292
vii
3.7.2 BusinessDay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2298
3.7.3 BusinessHour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2305
3.7.4 CustomBusinessDay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2311
3.7.5 CustomBusinessHour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2318
3.7.6 MonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2324
3.7.7 MonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2329
3.7.8 BusinessMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2334
3.7.9 BusinessMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2340
3.7.10 CustomBusinessMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2345
3.7.11 CustomBusinessMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2352
3.7.12 SemiMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2359
3.7.13 SemiMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2364
3.7.14 Week . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2370
3.7.15 WeekOfMonth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2375
3.7.16 LastWeekOfMonth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2381
3.7.17 BQuarterEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2387
3.7.18 BQuarterBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2393
3.7.19 QuarterEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2399
3.7.20 QuarterBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2404
3.7.21 BYearEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2410
3.7.22 BYearBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2416
3.7.23 YearEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2421
3.7.24 YearBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2426
3.7.25 FY5253 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2431
3.7.26 FY5253Quarter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2438
3.7.27 Easter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2445
3.7.28 Tick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2451
3.7.29 Day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2456
3.7.30 Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2462
3.7.31 Minute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2467
3.7.32 Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2473
3.7.33 Milli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2478
3.7.34 Micro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2484
3.7.35 Nano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2489
3.8 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2495
3.8.1 pandas.tseries.frequencies.to_offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2495
3.9 Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2496
3.9.1 Rolling window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2496
3.9.2 Weighted window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2514
3.9.3 Expanding window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2516
3.9.4 Exponentially-weighted window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2530
3.9.5 Window indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2534
3.10 GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2537
3.10.1 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2537
3.10.2 Function application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2542
3.10.3 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2554
3.11 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2610
3.11.1 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2610
3.11.2 Function application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2613
3.11.3 Upsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2618
3.11.4 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2630
3.12 Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2636
3.12.1 Styler constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2636
3.12.2 Styler properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2679
viii
3.12.3 Style application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2680
3.12.4 Builtin styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2681
3.12.5 Style export and import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2681
3.13 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2681
3.13.1 pandas.plotting.andrews_curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2682
3.13.2 pandas.plotting.autocorrelation_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2682
3.13.3 pandas.plotting.bootstrap_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2684
3.13.4 pandas.plotting.boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2686
3.13.5 pandas.plotting.deregister_matplotlib_converters . . . . . . . . . . . . . . . . . . . . . . . 2692
3.13.6 pandas.plotting.lag_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2692
3.13.7 pandas.plotting.parallel_coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2695
3.13.8 pandas.plotting.plot_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2695
3.13.9 pandas.plotting.radviz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2697
3.13.10 pandas.plotting.register_matplotlib_converters . . . . . . . . . . . . . . . . . . . . . . . . . 2698
3.13.11 pandas.plotting.scatter_matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2699
3.13.12 pandas.plotting.table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2701
3.14 General utility functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2701
3.14.1 Working with options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2701
3.14.2 Testing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2720
3.14.3 Exceptions and warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2725
3.14.4 Data types related functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2730
3.14.5 Bug report function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2758
3.15 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2759
3.15.1 pandas.api.extensions.register_extension_dtype . . . . . . . . . . . . . . . . . . . . . . . . 2759
3.15.2 pandas.api.extensions.register_dataframe_accessor . . . . . . . . . . . . . . . . . . . . . . 2759
3.15.3 pandas.api.extensions.register_series_accessor . . . . . . . . . . . . . . . . . . . . . . . . . 2761
3.15.4 pandas.api.extensions.register_index_accessor . . . . . . . . . . . . . . . . . . . . . . . . . 2762
3.15.5 pandas.api.extensions.ExtensionDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2763
3.15.6 pandas.api.extensions.ExtensionArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2767
3.15.7 pandas.arrays.PandasArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2781
3.15.8 pandas.api.indexers.check_array_indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2781
4 Development 2785
4.1 Contributing to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2785
4.1.1 Where to start? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2786
4.1.2 Bug reports and enhancement requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2786
4.1.3 Working with the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2787
4.1.4 Contributing your changes to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2788
4.1.5 Tips for a successful pull request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2791
4.2 Creating a development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2791
4.2.1 Creating an environment using Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2791
4.2.2 Creating an environment without Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2792
4.3 Contributing to the documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2795
4.3.1 About the pandas documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2796
4.3.2 Updating a pandas docstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2813
4.3.3 How to build the pandas documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2813
4.3.4 Previewing changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2814
4.4 Contributing to the code base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2815
4.4.1 Code standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2815
4.4.2 Pre-commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2816
4.4.3 Optional dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2816
4.4.4 Type hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2820
4.4.5 Testing with continuous integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2822
4.4.6 Test-driven development/code writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2823
ix
4.4.7 Running the test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2827
4.4.8 Running the performance test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2828
4.4.9 Documenting your code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2829
4.5 pandas code style guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2829
4.5.1 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2830
4.5.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2830
4.5.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2830
4.6 pandas maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.1 Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.3 Issue triage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2831
4.6.4 Closing issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2832
4.6.5 Reviewing pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.6 Backporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.7 Cleaning up old issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.8 Cleaning up old pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2833
4.6.9 Becoming a pandas maintainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.6.10 Merging pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.7 Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.7.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
4.7.2 Subclassing pandas data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2836
4.8 Test organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2836
4.9 Debugging C extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2839
4.9.1 Using a debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2839
4.9.2 Checking memory leaks with valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2840
4.10 Extending pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2840
4.10.1 Registering custom accessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2840
4.10.2 Extension types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2841
4.10.3 Subclassing pandas data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2844
4.10.4 Plotting backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2847
4.11 Developer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2848
4.11.1 Storing pandas DataFrame objects in Apache Parquet format . . . . . . . . . . . . . . . . . 2848
4.12 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.12.1 Version policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.12.2 Python support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.13 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2851
4.13.1 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.2 String data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.3 Consistent missing value handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.4 Apache Arrow interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852
4.13.5 Block manager rewrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.6 Decoupling of indexing and internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.7 Numba-accelerated operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.8 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2853
4.13.9 Roadmap evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2854
4.13.10 Completed items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2854
4.14 Developer meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2854
4.14.1 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2855
4.14.2 Calendar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2855
x
5.1.3 What’s new in 1.4.2 (April 2, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2861
5.1.4 What’s new in 1.4.1 (February 12, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2862
5.1.5 What’s new in 1.4.0 (January 22, 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2864
5.2 Version 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2899
5.2.1 What’s new in 1.3.5 (December 12, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2899
5.2.2 What’s new in 1.3.4 (October 17, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2900
5.2.3 What’s new in 1.3.3 (September 12, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2901
5.2.4 What’s new in 1.3.2 (August 15, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2903
5.2.5 What’s new in 1.3.1 (July 25, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2904
5.2.6 What’s new in 1.3.0 (July 2, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2906
5.3 Version 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2945
5.3.1 What’s new in 1.2.5 (June 22, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2945
5.3.2 What’s new in 1.2.4 (April 12, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2946
5.3.3 What’s new in 1.2.3 (March 02, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2947
5.3.4 What’s new in 1.2.2 (February 09, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2948
5.3.5 What’s new in 1.2.1 (January 20, 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2950
5.3.6 What’s new in 1.2.0 (December 26, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2953
5.4 Version 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2982
5.4.1 What’s new in 1.1.5 (December 07, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2982
5.4.2 What’s new in 1.1.4 (October 30, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2984
5.4.3 What’s new in 1.1.3 (October 5, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2986
5.4.4 What’s new in 1.1.2 (September 8, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2988
5.4.5 What’s new in 1.1.1 (August 20, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2990
5.4.6 What’s new in 1.1.0 (July 28, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2992
5.5 Version 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3034
5.5.1 What’s new in 1.0.5 (June 17, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3034
5.5.2 What’s new in 1.0.4 (May 28, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3035
5.5.3 What’s new in 1.0.3 (March 17, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3036
5.5.4 What’s new in 1.0.2 (March 12, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3037
5.5.5 What’s new in 1.0.1 (February 5, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3040
5.5.6 What’s new in 1.0.0 (January 29, 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3042
5.6 Version 0.25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3081
5.6.1 What’s new in 0.25.3 (October 31, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3081
5.6.2 What’s new in 0.25.2 (October 15, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3082
5.6.3 What’s new in 0.25.1 (August 21, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3083
5.6.4 What’s new in 0.25.0 (July 18, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3086
5.7 Version 0.24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
5.7.1 What’s new in 0.24.2 (March 12, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
5.7.2 What’s new in 0.24.1 (February 3, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3127
5.7.3 What’s new in 0.24.0 (January 25, 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3128
5.8 Version 0.23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3185
5.8.1 What’s new in 0.23.4 (August 3, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3185
5.8.2 What’s new in 0.23.3 (July 7, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3186
5.8.3 What’s new in 0.23.2 (July 5, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3187
5.8.4 What’s new in 0.23.1 (June 12, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3190
5.8.5 What’s new in 0.23.0 (May 15, 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3194
5.9 Version 0.22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3245
5.9.1 Version 0.22.0 (December 29, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3245
5.10 Version 0.21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3249
5.10.1 Version 0.21.1 (December 12, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3249
5.10.2 Version 0.21.0 (October 27, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3254
5.11 Version 0.20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3287
5.11.1 Version 0.20.3 (July 7, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3287
5.11.2 Version 0.20.2 (June 4, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3290
xi
5.11.3 Version 0.20.1 (May 5, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3294
5.12 Version 0.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3343
5.12.1 Version 0.19.2 (December 24, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3343
5.12.2 Version 0.19.1 (November 3, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3346
5.12.3 Version 0.19.0 (October 2, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3349
5.13 Version 0.18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3395
5.13.1 Version 0.18.1 (May 3, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3395
5.13.2 Version 0.18.0 (March 13, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3415
5.14 Version 0.17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3451
5.14.1 Version 0.17.1 (November 21, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3451
5.14.2 Version 0.17.0 (October 9, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3459
5.15 Version 0.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3489
5.15.1 Version 0.16.2 (June 12, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3489
5.15.2 Version 0.16.1 (May 11, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3495
5.15.3 Version 0.16.0 (March 22, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3508
5.16 Version 0.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3526
5.16.1 Version 0.15.2 (December 12, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3526
5.16.2 Version 0.15.1 (November 9, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3533
5.16.3 Version 0.15.0 (October 18, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3540
5.17 Version 0.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3573
5.17.1 Version 0.14.1 (July 11, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3573
5.17.2 Version 0.14.0 (May 31 , 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3580
5.18 Version 0.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3611
5.18.1 Version 0.13.1 (February 3, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3611
5.18.2 Version 0.13.0 (January 3, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3623
5.19 Version 0.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3653
5.19.1 Version 0.12.0 (July 24, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3653
5.20 Version 0.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3666
5.20.1 Version 0.11.0 (April 22, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3666
5.21 Version 0.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3677
5.21.1 Version 0.10.1 (January 22, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3677
5.21.2 Version 0.10.0 (December 17, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3684
5.22 Version 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3696
5.22.1 Version 0.9.1 (November 14, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3696
5.22.2 Version 0.9.0 (October 7, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3700
5.23 Version 0.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3703
5.23.1 Version 0.8.1 (July 22, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3703
5.23.2 Version 0.8.0 (June 29, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3704
5.24 Version 0.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3710
5.24.1 Version 0.7.3 (April 12, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3710
5.24.2 Version 0.7.2 (March 16, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3713
5.24.3 Version 0.7.1 (February 29, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3714
5.24.4 Version 0.7.0 (February 9, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3715
5.25 Version 0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3721
5.25.1 Version 0.6.1 (December 13, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3721
5.25.2 Version 0.6.0 (November 25, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3722
5.26 Version 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3724
5.26.1 Version 0.5.0 (October 24, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3724
5.27 Version 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3725
5.27.1 Versions 0.4.1 through 0.4.3 (September 25 - October 9, 2011) . . . . . . . . . . . . . . . . 3725
Bibliography 3727
xii
pandas: powerful Python data analysis toolkit, Release 1.4.4
Getting started
New to pandas? Check out the getting started guides. They contain an introduction to pandas’ main concepts and links
to additional tutorials.
To the getting started guides
User guide
The user guide provides in-depth information on the key concepts of pandas with useful background information and
explanation.
To the user guide
API reference
The reference guide contains a detailed description of the pandas API. The reference describes how the methods work
and which parameters can be used. It assumes that you have an understanding of the key concepts.
To the reference guide
Developer guide
Saw a typo in the documentation? Want to improve existing functionalities? The contributing guidelines will guide
you through the process of improving pandas.
To the development guide
CONTENTS 1
pandas: powerful Python data analysis toolkit, Release 1.4.4
2 CONTENTS
CHAPTER
ONE
GETTING STARTED
1.1 Installation
Prefer pip?
pandas can be installed via pip from PyPI.
In-depth instructions?
Installing a specific version? Installing from source? Check the advanced installation page.
Learn more
Straight to tutorial. . .
When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you.
pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
To introduction tutorial
To user guide
Straight to tutorial. . .
pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,. . . ).
Importing data from each of these data sources is provided by function with the prefix read_*. Similarly, the to_*
methods are used to store data.
To introduction tutorial
To user guide
Straight to tutorial. . .
3
pandas: powerful Python data analysis toolkit, Release 1.4.4
Selecting or filtering specific rows and/or columns? Filtering the data on a condition? Methods for slicing, selecting,
and extracting the data you need are available in pandas.
To introduction tutorial
To user guide
Straight to tutorial. . .
pandas provides plotting your data out of the box, using the power of Matplotlib. You can pick the plot type (scatter,
bar, boxplot,. . . ) corresponding to your data.
To introduction tutorial
To user guide
Straight to tutorial. . .
There is no need to loop over all rows of your data table to do calculations. Data manipulations on a column work
elementwise. Adding a column to a DataFrame based on existing data in other columns is straightforward.
To introduction tutorial
To user guide
Straight to tutorial. . .
Basic statistics (mean, median, min, max, counts. . . ) are easily calculable. These or custom aggregations can be
applied on the entire data set, a sliding window of the data, or grouped by categories. The latter is also known as the
split-apply-combine approach.
To introduction tutorial
To user guide
Straight to tutorial. . .
Change the structure of your data table in multiple ways. You can melt() your data table from wide to long/tidy form
or pivot() from long to wide format. With aggregations built-in, a pivot table is created with a single command.
To introduction tutorial
To user guide
Straight to tutorial. . .
Multiple tables can be concatenated both column wise and row wise as database-like join/merge operations are provided
to combine multiple tables of data.
To introduction tutorial
To user guide
Straight to tutorial. . .
pandas has great support for time series and has an extensive set of tools for working with dates, times, and time-indexed
data.
To introduction tutorial
To user guide
Straight to tutorial. . .
Data sets do not only contain numerical data. pandas provides a wide range of functions to clean textual data and extract
useful information from it.
To introduction tutorial
To user guide
Are you familiar with other software for manipulating tablular data? Learn the pandas-equivalent operations compared
to software you already know:
The R programming language provides the data.frame data structure and multiple packages, such as tidyverse use
and extend data.frame for convenient data handling functionalities similar to pandas.
Learn more
Already familiar to SELECT, GROUP BY, JOIN, etc.? Most of these SQL manipulations do have equivalents in pandas.
Learn more
The data set included in the STATA statistical software suite corresponds to the pandas DataFrame. Many of the
operations known from STATA have an equivalent in pandas.
Learn more
Users of Excel or other spreadsheet programs will find that many of the concepts are transferrable to pandas.
Learn more
The SAS statistical software suite also provides the data set corresponding to the pandas DataFrame. Also SAS
vectorized operations, filtering, string processing operations, and more have similar functions in pandas.
Learn more
1.4 Tutorials
1.4.1 Installation
The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for
data analysis and scientific computing. This is the recommended installation method for most users.
Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development version are
also provided.
Installing pandas
Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make up the SciPy stack
(IPython, NumPy, Matplotlib, . . . ) is with Anaconda, a cross-platform (Linux, macOS, Windows) Python distribution
for data analytics and scientific computing.
After running the installer, the user will have access to pandas and the rest of the SciPy stack without needing to install
anything else, and without needing to wait for any software to be compiled.
Installation instructions for Anaconda can be found here.
A full list of the packages available as part of the Anaconda distribution can be found here.
Another advantage to installing Anaconda is that you don’t need admin rights to install it. Anaconda can install in the
user’s home directory, which makes it trivial to delete Anaconda if you decide (just delete that folder).
The previous section outlined how to get pandas installed as part of the Anaconda distribution. However this approach
means you will install well over one hundred packages and involves downloading the installer which is a few hundred
megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then installing pandas with
Miniconda may be a better solution.
Conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is both cross-
platform and language agnostic (it can play a similar role to a pip and virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the Conda command to
install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this for you. The installer
can be found here
The next step is to create a new conda environment. A conda environment is like a virtualenv that allows you to specify
a specific version of Python and set of libraries. Run the following commands from a terminal window:
This will create a minimal environment with only Python installed in it. To put your self inside this environment run:
activate name_of_my_env
The final step required is to install pandas. This can be done with the following command:
If you need packages that are available to pip but not conda, then install pip, and then use pip to install those packages:
Installation instructions for ActivePython can be found here. Versions 2.7, 3.5 and 3.6 include pandas.
The commands in this table will install pandas for Python 3 from your distribution.
1.4. Tutorials 7
pandas: powerful Python data analysis toolkit, Release 1.4.4
However, the packages in the linux package managers are often a few versions behind, so to get the newest version of
pandas, it’s recommended to install using the pip or conda methods described above.
Handling ImportErrors
If you encounter an ImportError, it usually means that Python couldn’t find pandas in the list of available libraries.
Python internally has a list of directories it searches through, to find packages. You can obtain these directories with:
import sys
sys.path
One way you could be encountering this error is if you have multiple Python installations on your system and you don’t
have pandas installed in the Python installation you’re currently using. In Linux/Mac you can run which python on
your terminal and it will tell you which Python installation you’re using. If it’s something like “/usr/bin/python”, you’re
using the Python from the system, which is not recommended.
It is highly recommended to use conda, for quick installation and for package and dependency updates. You can find
simple installation instructions for pandas in this document: installation instructions </getting_started.
html>.
See the contributing guide for complete instructions on building from the git source tree. Further, see creating a
development environment if you wish to create a pandas development environment.
pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing. To
run it on your machine to verify that everything is working (and that you have all of the dependencies, soft and hard,
installed), make sure you have pytest >= 6.0 and Hypothesis >= 3.58, then run:
>>> pd.test()
running: pytest --skip-slow --skip-network C:\Users\TP\Anaconda3\envs\py36\lib\site-
˓→packages\pandas
..................................................................S......
........S................................................................
.........................................................................
Dependencies
Recommended dependencies
• numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart chunking
and caching to achieve large speedups. If installed, must be Version 2.7.1 or higher.
• bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython routines to
achieve large speedups. If installed, must be Version 1.3.1 or higher.
Note: You are highly encouraged to install these libraries, as they provide speed improvements, especially when
working with large data sets.
Optional dependencies
pandas has many optional dependencies that are only used for specific methods. For example, pandas.read_hdf()
requires the pytables package, while DataFrame.to_markdown() requires the tabulate package. If the optional
dependency is not installed, pandas will raise an ImportError when the method requiring that dependency is called.
Visualization
1.4. Tutorials 9
pandas: powerful Python data analysis toolkit, Release 1.4.4
Computation
Excel files
HTML
One of the following combinations of libraries is needed to use the top-level read_html() function:
• BeautifulSoup4 and html5lib
• BeautifulSoup4 and lxml
• BeautifulSoup4 and html5lib and lxml
• Only lxml, although see HTML Table Parsing for reasons as to why you should probably not take this approach.
Warning:
• if you install BeautifulSoup4 you must install either lxml or html5lib or both. read_html() will not work
with only BeautifulSoup4 installed.
• You are highly encouraged to read HTML Table Parsing gotchas. It explains issues surrounding the installa-
tion and usage of the above three libraries.
XML
SQL databases
Warning:
• If you want to use read_orc(), it is highly recommended to install pyarrow using conda. The following is
a summary of the environment in which read_orc() can work.
1.4. Tutorials 11
pandas: powerful Python data analysis toolkit, Release 1.4.4
Clipboard
Compression
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing
practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful
and flexible open source data analysis/manipulation tool available in any language. It is already well on its way
toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a
pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the
vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users,
DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and
is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply
ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both ag-
gregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into
DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Data structures
The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example,
DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove
objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the typical
orientation of time series and cross-sectional data sets. When using the N-dimensional array (ndarrays) to store 2- and
3-dimensional data, a burden is placed on the user to consider the orientation of the data set when writing functions;
axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters for performance). In
pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a particular data set, there is likely to
be a “right” way to orient the data. The goal, then, is to reduce the amount of mental effort required to code up data
transformations in downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the
columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable
code:
1.4. Tutorials 13
pandas: powerful Python data analysis toolkit, Release 1.4.4
All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The
length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast
majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability
where sensible.
Getting support
The first stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas community
experts can answer through Stack Overflow.
Community
pandas is actively supported today by a community of like-minded individuals around the world who contribute their
valuable time and energy to help make open source pandas possible. Thanks to all of our contributors.
If you’re interested in contributing, please visit the contributing guide.
pandas is a NumFOCUS sponsored project. This will help ensure the success of the development of pandas as a
world-class open-source project and makes it possible to donate to the project.
Project governance
The governance process that pandas project has used informally since its inception in 2008 is formalized in Project
Governance documents. The documents clarify how decisions are made and how the various elements of our commu-
nity interact, including the relationship between open source collaborative development and work that may be funded
by for-profit or non-profit entities.
Wes McKinney is the Benevolent Dictator for Life (BDFL).
Development team
The list of the Core Team members and more detailed information can be found on the people’s page of the governance
repo.
Institutional partners
The information about current institutional partners can be found on pandas website page.
License
Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData␣
˓→Development Team
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
To load the pandas package and start working with it, import the package. The community agreed alias for pandas is
pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.
I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers)
and sex (male/female) data.
In [2]: df = pd.DataFrame(
...: {
...: "Name": [
...: "Braund, Mr. Owen Harris",
...: "Allen, Mr. William Henry",
...: "Bonnell, Miss. Elizabeth",
...: ],
(continues on next page)
1.4. Tutorials 15
pandas: powerful Python data analysis toolkit, Release 1.4.4
In [3]: df
Out[3]:
Name Age Sex
0 Braund, Mr. Owen Harris 22 male
1 Allen, Mr. William Henry 35 male
2 Bonnell, Miss. Elizabeth 58 female
To manually store data in a table, create a DataFrame. When using a Python dictionary of lists, the dictionary keys
will be used as column headers and the values in each list as columns of the DataFrame.
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers,
floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.
frame in R.
• The table has 3 columns, each of them with a column label. The column labels are respectively Name, Age and
Sex.
• The column Name consists of textual data with each value a string, the column Age are numbers and the column
Sex is textual data.
In spreadsheet software, the table representation of our data would look very similar:
I’m just interested in working with the data in the column Age
In [4]: df["Age"]
Out[4]:
0 22
1 35
2 58
Name: Age, dtype: int64
When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the
column label in between square brackets [].
Note: If you are familiar to Python dictionaries, the selection of a single column is very similar to selection of
dictionary values based on the key.
In [6]: ages
Out[6]:
0 22
1 35
2 58
Name: Age, dtype: int64
A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.
In [7]: df["Age"].max()
Out[7]: 58
Or to the Series:
In [8]: ages.max()
Out[8]: 58
As illustrated by the max() method, you can do things with a DataFrame or Series. pandas provides a lot of func-
tionalities, each of them a method you can apply to a DataFrame or Series. As methods are functions, do not forget
to use parentheses ().
I’m interested in some basic statistics of the numerical data of my data table
1.4. Tutorials 17
pandas: powerful Python data analysis toolkit, Release 1.4.4
In [9]: df.describe()
Out[9]:
Age
count 3.000000
mean 38.333333
std 18.230012
min 22.000000
25% 28.500000
50% 35.000000
75% 46.500000
max 58.000000
The describe() method provides a quick overview of the numerical data in a DataFrame. As the Name and Sex
columns are textual data, these are by default not taken into account by the describe() method.
Many pandas operations return a DataFrame or a Series. The describe() method is an example of a pandas
operation returning a pandas Series or a pandas DataFrame.
Check more options on describe in the user guide section about aggregations with describe
Note: This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns
and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet
software are supported by pandas. Continue reading the next tutorials to get started!
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports
many different file formats or data sources out of the box (csv, excel, sql, json, parquet, . . . ), each of them with the
prefix read_*.
Make sure to always have a check on the data after reading in the data. When displaying a DataFrame, the first and
last 5 rows will be shown by default:
In [3]: titanic
Out[3]:
PassengerId Survived Pclass Name ..
˓→. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ..
˓→. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ..
˓→. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ..
˓→. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ..
˓→. 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ..
˓→. 373450 8.0500 NaN S
.. ... ... ... ... ..
˓→. ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas ..
˓→. 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith ..
˓→. 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" ..
˓→. W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell ..
˓→. 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick ..
˓→. 370376 7.7500 NaN Q
In [4]: titanic.head(8)
Out[4]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S (continues on next page)
1.4. Tutorials 19
pandas: powerful Python data analysis toolkit, Release 1.4.4
[8 rows x 12 columns]
To see the first N rows of a DataFrame, use the head() method with the required number of rows (in this case 8) as
argument.
Note: Interested in the last N rows instead? pandas also provides a tail() method. For example, titanic.tail(10)
will return the last 10 rows of the DataFrame.
A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes
attribute:
In [5]: titanic.dtypes
Out[5]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
For each of the columns, the used data type is enlisted. The data types in this DataFrame are integers (int64), floats
(float64) and strings (object).
Note: When asking for the dtypes, no brackets are used! dtypes is an attribute of a DataFrame and Series. At-
tributes of DataFrame or Series do not need brackets. Attributes represent a characteristic of a DataFrame/Series,
whereas a method (which requires brackets) do something with the DataFrame/Series as introduced in the first tuto-
rial.
Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The to_excel()
method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default
Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.
The equivalent read function read_excel() will reload the data to a DataFrame:
In [8]: titanic.head()
Out[8]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S
[5 rows x 12 columns]
In [9]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:
• It is indeed a DataFrame.
• There are 891 entries, i.e. 891 rows.
• Each row has a row label (aka the index) with values ranging from 0 to 890.
• The table has 12 columns. Most columns have a value for each of the rows (all 891 values are non-null). Some
columns do have missing values and less than 891 non-null values.
1.4. Tutorials 21
pandas: powerful Python data analysis toolkit, Release 1.4.4
• The columns Name, Sex, Cabin and Embarked consists of textual data (strings, aka object). The other columns
are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
• The kind of data (characters, integers,. . . ) in the different columns are summarized by listing the dtypes.
• The approximate amount of RAM used to hold the DataFrame is provided as well.
• Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
• Exporting data out of pandas is provided by different to_*methods.
• The head/tail/info methods and the dtypes attribute are convenient for a first check.
For a complete overview of the input and output possibilities from and to pandas, see the user guide section about
reader and writer functions.
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S
[5 rows x 12 columns]
In [5]: ages.head()
Out[5]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
To select a single column, use square brackets [] with the column name of the column of interest.
Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series.
We can verify this by checking the type of the output:
In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series
In [7]: titanic["Age"].shape
Out[7]: (891,)
DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of
a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series
is 1-dimensional and only the number of rows is returned.
I’m interested in the age and sex of the Titanic passengers.
In [9]: age_sex.head()
Out[9]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
To select multiple columns, use a list of column names within the selection brackets [].
Note: The inner square brackets define a Python list with column names, whereas the outer brackets are used to select
the data from a pandas DataFrame as seen in the previous example.
1.4. Tutorials 23
pandas: powerful Python data analysis toolkit, Release 1.4.4
The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional with
both a row and column dimension.
For basic information on indexing, see the user guide section on indexing and selecting data.
In [13]: above_35.head()
Out[13]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J ␣
˓→male ... 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth ␣
˓→female ... 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan ␣
˓→male ... 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) ␣
˓→female ... 0 248706 16.0000 NaN S
[5 rows x 12 columns]
To select rows based on a conditional expression, use a condition inside the selection brackets [].
The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value
larger than 35:
The output of the conditional expression (>, but also ==, !=, <, <=,. . . would work) is actually a pandas Series of
boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of
boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for
which the value is True will be selected.
We know from before that the original Titanic DataFrame consists of 891 rows. Let’s have a look at the number of
rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35:
In [15]: above_35.shape
Out[15]: (217, 12)
In [17]: class_23.head()
Out[17]:
PassengerId Survived Pclass Name Sex Age SibSp ␣
˓→Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 ␣
˓→ 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 ␣
˓→ 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 ␣
˓→ 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 ␣
˓→ 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 ␣
˓→ 1 349909 21.0750 NaN S
Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in
the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets
[]. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which
rows the Pclass column is either 2 or 3.
The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with
an | (or) operator:
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
In [19]: class_23.head()
Out[19]:
PassengerId Survived Pclass Name Sex Age SibSp ␣
˓→Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 ␣
˓→ 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 ␣
˓→ 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 ␣
˓→ 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 ␣
˓→ 0 330877 8.4583 NaN Q (continues on next page)
1.4. Tutorials 25
pandas: powerful Python data analysis toolkit, Release 1.4.4
Note: When combining multiple conditional statements, each condition must be surrounded by parentheses (). More-
over, you can not use or/and but need to use the or operator | and the and operator &.
See the dedicated section in the user guide about boolean indexing or about the isin function.
I want to work with passenger data for which the age is known.
In [21]: age_no_na.head()
Out[21]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S
[5 rows x 12 columns]
The notna() conditional function returns a True for each row the values are not an Null value. As such, this can be
combined with the selection brackets [] to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if
the shape has changed:
In [22]: age_no_na.shape
Out[22]: (714, 12)
For more dedicated functions on missing values, see the user guide section about handling missing data.
In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
(continues on next page)
In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient
anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part
before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use the loc operator in front of the selection
brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels,
a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
I’m interested in rows 10 till 25 and columns 3 to 5.
Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient
anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc
operator in front of the selection brackets [].
When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For
example, to assign the name anonymous to the first 3 elements of the third column:
In [27]: titanic.head()
Out[27]:
PassengerId Survived Pclass Name Sex .
˓→.. Parch Ticket Fare Cabin Embarked
0 1 0 3 anonymous male .
˓→.. 0 A/5 21171 7.2500 NaN S
1 2 1 1 anonymous female .
˓→.. 0 PC 17599 71.2833 C85 C
2 3 1 3 anonymous female .
˓→.. 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female .
˓→.. 0 113803 53.1000 C123 S
(continues on next page)
1.4. Tutorials 27
pandas: powerful Python data analysis toolkit, Release 1.4.4
[5 rows x 12 columns]
See the user guide section on different choices for indexing to get more insight in the usage of loc and iloc.
• When selecting subsets of data, square brackets [] are used.
• Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a
conditional expression or a colon.
• Select specific rows and/or columns using loc when using the row and column names
• Select specific rows and/or columns using iloc when using the positions in the table
• You can assign new values to a selection based on loc/iloc.
A full overview of indexing is provided in the user guide pages on indexing and selecting data.
For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and using the py-openaq package. The
air_quality_no2.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801 and London
Westminster in respectively Paris, Antwerp and London.
In [4]: air_quality.head()
Out[4]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN
Note: The usage of the index_col and parse_dates parameters of the read_csv function to define the first (0th)
column as index of the resulting DataFrame and convert the dates in the column to Timestamp objects, respectively.
In [5]: air_quality.plot()
Out[5]: <AxesSubplot:xlabel='datetime'>
With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data.
I want to plot only the columns of the data table with the data from Paris.
In [6]: air_quality["station_paris"].plot()
Out[6]: <AxesSubplot:xlabel='datetime'>
1.4. Tutorials 29
pandas: powerful Python data analysis toolkit, Release 1.4.4
To plot a specific column, use the selection method of the subset data tutorial in combination with the plot() method.
Hence, the plot() method works on both Series and DataFrame.
I want to visually compare the 𝑁 02 values measured in London versus Paris.
Apart from the default line plot when using the plot function, a number of alternatives are available to plot data.
Let’s use some standard Python to get an overview of the available plot methods:
In [8]: [
...: method_name
...: for method_name in dir(air_quality.plot)
...: if not method_name.startswith("_")
...: ]
...:
Out[8]:
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
Note: In many development environments as well as IPython and Jupyter Notebook, use the TAB button to get an
1.4. Tutorials 31
pandas: powerful Python data analysis toolkit, Release 1.4.4
One of the options is DataFrame.plot.box(), which refers to a boxplot. The box method is applicable on the air
quality example data:
In [9]: air_quality.plot.box()
Out[9]: <AxesSubplot:>
For an introduction to plots other than the default line plot, see the user guide section about supported plot styles.
I want each of the columns in a separate subplot.
Separate subplots for each of the data columns are supported by the subplots argument of the plot functions. The
builtin options available in each of the pandas plot functions are worth reviewing.
Some more formatting options are explained in the user guide section on plot formatting.
I want to further customize, extend or save the resulting plot.
In [12]: air_quality.plot.area(ax=axs)
Out[12]: <AxesSubplot:xlabel='datetime'>
In [14]: fig.savefig("no2_concentrations.png")
Each of the plot objects created by pandas is a matplotlib object. As Matplotlib provides plenty of options to customize
plots, making the link between pandas and Matplotlib explicit enables all the power of matplotlib to the plot. This
strategy is applied in the previous example:
1.4. Tutorials 33
pandas: powerful Python data analysis toolkit, Release 1.4.4
For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and using the py-openaq package. The
air_quality_no2.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801 and London
Westminster in respectively Paris, Antwerp and London.
In [3]: air_quality.head()
Out[3]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN
In [5]: air_quality.head()
Out[5]:
station_antwerp station_paris station_london london_mg_per_cubic
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286
2019-05-07 03:00:00 50.5 25.0 19.0 35.758
2019-05-07 04:00:00 45.0 27.7 19.0 35.758
2019-05-07 05:00:00 NaN 50.4 16.0 30.112
2019-05-07 06:00:00 NaN 61.9 NaN NaN
To create a new column, use the [] brackets with the new column name at the left side of the assignment.
Note: The calculation of the values is done element_wise. This means all values in the given column are multiplied
by the value 1.882 at once. You do not need to use a loop to iterate each of the rows!
I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column
In [6]: air_quality["ratio_paris_antwerp"] = (
...: air_quality["station_paris"] / air_quality["station_antwerp"]
...: )
...:
In [7]: air_quality.head()
Out[7]:
station_antwerp station_paris station_london london_mg_per_cubic␣
˓→ ratio_paris_antwerp
datetime ␣
˓→
The calculation is again element-wise, so the / is applied for the values in each row.
Also other mathematical operators (+, -, \*, /) or logical operators (<, >, =,. . . ) work element wise. The latter was
already used in the subset data tutorial to filter rows of a table using a conditional expression.
If you need more advanced logic, you can use arbitrary Python code via apply().
I want to rename the data columns to the corresponding station identifiers used by openAQ
In [8]: air_quality_renamed = air_quality.rename(
...: columns={
...: "station_antwerp": "BETR801",
...: "station_paris": "FR04014",
...: "station_london": "London Westminster",
...: }
...: )
...:
In [9]: air_quality_renamed.head()
Out[9]:
BETR801 FR04014 London Westminster london_mg_per_cubic ratio_
˓→paris_antwerp
datetime ␣
˓→
1.4. Tutorials 35
pandas: powerful Python data analysis toolkit, Release 1.4.4
The rename() function can be used for both row labels and column labels. Provide a dictionary with the keys the
current names and the values the new names to update the corresponding names.
The mapping should not be restricted to fixed names only, but can be a mapping function as well. For example,
converting the column names to lowercase letters can be done using a function as well:
In [11]: air_quality_renamed.head()
Out[11]:
betr801 fr04014 london westminster london_mg_per_cubic ratio_
˓→paris_antwerp
datetime ␣
˓→
Details about column or row label renaming is provided in the user guide section on renaming labels.
• Create a new column by assigning the output to the DataFrame with a new column name in between the [].
• Operations are element-wise, no need to loop over rows.
• Use rename with a dictionary or function to rename row labels or column names.
The user guide contains a separate section on column addition and deletion.
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S
[5 rows x 12 columns]
Aggregating statistics
In [4]: titanic["Age"].mean()
Out[4]: 29.69911764705882
Different statistics are available and can be applied to columns with numerical data. Operations in general exclude
missing data and operate across rows by default.
What is the median age and ticket fare price of the Titanic passengers?
The statistic applied to multiple columns of a DataFrame (the selection of two columns return a DataFrame, see the
subset data tutorial) is calculated for each numeric column.
1.4. Tutorials 37
pandas: powerful Python data analysis toolkit, Release 1.4.4
The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function
from first tutorial?
Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined
using the DataFrame.agg() method:
In [7]: titanic.agg(
...: {
...: "Age": ["min", "max", "median", "skew"],
...: "Fare": ["min", "max", "median", "mean"],
...: }
...: )
...:
Out[7]:
Age Fare
min 0.420000 0.000000
max 80.000000 512.329200
median 28.000000 14.454200
skew 0.389108 NaN
mean NaN 32.204208
Details about descriptive statistics are provided in the user guide section on descriptive statistics.
What is the average age for male versus female Titanic passengers?
As our interest is the average age for each gender, a subselection on these two columns is made first: titanic[["Sex
", "Age"]]. Next, the groupby() method is applied on the Sex column to make a group per category. The average
age for each gender is calculated and returned.
Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a
common pattern. The groupby method is used to support this type of operations. More general, this fits in the more
general split-apply-combine pattern:
In [9]: titanic.groupby("Sex").mean()
Out[9]:
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893
It does not make much sense to get the average value of the Pclass. if we are only interested in the average age for
each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:
In [10]: titanic.groupby("Sex")["Age"].mean()
Out[10]:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
Note: The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively
the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a
Categorical data type to handle this type of data. More information is provided in the user guide Categorical data
section.
What is the mean ticket fare price for each of the sex and cabin class combinations?
Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby()
method.
A full description on the split-apply-combine approach is provided in the user guide section on groupby operations.
1.4. Tutorials 39
pandas: powerful Python data analysis toolkit, Release 1.4.4
In [12]: titanic["Pclass"].value_counts()
Out[12]:
3 491
1 216
2 184
Name: Pclass, dtype: int64
The value_counts() method counts the number of records for each category in a column.
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records
within each group:
In [13]: titanic.groupby("Pclass")["Pclass"].count()
Out[13]:
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64
Note: Both size and count can be used in combination with groupby. Whereas size includes NaN values and just
provides the number of rows (size of the table), count excludes the missing values. In the value_counts method, use
the dropna argument to include or exclude the NaN values.
The user guide has a dedicated section on value_counts , see page on discretization.
• Aggregation statistics can be calculated on entire columns or rows
• groupby provides the power of the split-apply-combine pattern
• value_counts is a convenient shortcut to count the number of entries in each category of a variable
A full description on the split-apply-combine approach is provided in the user guide pages about groupby operations.
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S
[5 rows x 12 columns]
This tutorial uses air quality data about 𝑁 𝑂2 and Particulate matter less than 2.5 micrometers, made available by
openaq and using the py-openaq package. The air_quality_long.csv data set provides 𝑁 𝑂2 and 𝑃 𝑀25 values for
the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.
The air-quality data set has the following columns:
• city: city where the sensor is used, either Paris, Antwerp or London
• country: country where the sensor is used, either FR, BE or GB
• location: the id of the sensor, either FR04014, BETR801 or London Westminster
• parameter: the parameter measured by the sensor, either 𝑁 𝑂2 or Particulate matter
• value: the measured value
• unit: the unit of the measured parameter, in this case ‘µg/m3 ’
and the index of the DataFrame is datetime, the datetime of the measurement.
Note: The air-quality data is provided in a so-called long format data representation with each observation on a
separate row and each variable a separate column of the data table. The long/narrow format is also known as the tidy
data format.
In [5]: air_quality.head()
(continues on next page)
1.4. Tutorials 41
pandas: powerful Python data analysis toolkit, Release 1.4.4
I want to sort the Titanic data according to the age of the passengers.
In [6]: titanic.sort_values(by="Age").head()
Out[6]:
PassengerId Survived Pclass Name Sex Age SibSp␣
˓→ Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0␣
˓→ 1 2625 8.5167 NaN C
755 756 1 2 Hamalainen, Master. Viljo male 0.67 1␣
˓→ 1 250649 14.5000 NaN S
644 645 1 3 Baclini, Miss. Eugenie female 0.75 2␣
˓→ 1 2666 19.2583 NaN C
469 470 1 3 Baclini, Miss. Helene Barbara female 0.75 2␣
˓→ 1 2666 19.2583 NaN C
78 79 1 2 Caldwell, Master. Alden Gates male 0.83 0␣
˓→ 2 248738 29.0000 NaN S
I want to sort the Titanic data according to the cabin class and age in descending order.
With Series.sort_values(), the rows in the table are sorted according to the defined column(s). The index will
follow the row order.
More details about sorting of tables is provided in the using guide section on sorting data.
Let’s use a small subset of the air quality data set. We focus on 𝑁 𝑂2 data and only use the first two measurements of
each location (i.e. the head of each group). The subset of data will be called no2_subset
In [10]: no2_subset
Out[10]:
city country location parameter value unit
date.utc
2019-04-09 01:00:00+00:00 Antwerpen BE BETR801 no2 22.5 µg/m3
2019-04-09 01:00:00+00:00 Paris FR FR04014 no2 24.4 µg/m3
2019-04-09 02:00:00+00:00 London GB London Westminster no2 67.0 µg/m3
2019-04-09 02:00:00+00:00 Antwerpen BE BETR801 no2 53.5 µg/m3
2019-04-09 02:00:00+00:00 Paris FR FR04014 no2 27.4 µg/m3
2019-04-09 03:00:00+00:00 London GB London Westminster no2 67.0 µg/m3
I want the values for the three stations as separate columns next to each other
The pivot() function is purely reshaping of the data: a single value for each index/column combination is required.
As pandas support plotting of multiple columns (see plotting tutorial) out of the box, the conversion from long to wide
table format enables the plotting of the different time series at the same time:
In [12]: no2.head()
Out[12]:
city country location parameter value unit
date.utc
2019-06-21 00:00:00+00:00 Paris FR FR04014 no2 20.0 µg/m3
2019-06-20 23:00:00+00:00 Paris FR FR04014 no2 21.8 µg/m3
2019-06-20 22:00:00+00:00 Paris FR FR04014 no2 26.5 µg/m3
2019-06-20 21:00:00+00:00 Paris FR FR04014 no2 24.9 µg/m3
2019-06-20 20:00:00+00:00 Paris FR FR04014 no2 21.4 µg/m3
1.4. Tutorials 43
pandas: powerful Python data analysis toolkit, Release 1.4.4
Note: When the index parameter is not defined, the existing index (row labels) is used.
For more information about pivot(), see the user guide section on pivoting DataFrame objects.
Pivot table
I want the mean concentrations for 𝑁 𝑂2 and 𝑃 𝑀2.5 in each of the stations in table form
In [14]: air_quality.pivot_table(
....: values="value", index="location", columns="parameter", aggfunc="mean"
....: )
....:
Out[14]:
parameter no2 pm25
location
BETR801 26.950920 23.169492
FR04014 29.374284 NaN
London Westminster 29.740050 13.443568
In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated (in this specific case,
the values on different time steps) pivot_table() can be used, providing an aggregation function (e.g. mean) on how
to combine these values.
Pivot table is a well known concept in spreadsheet software. When interested in summary columns for each variable
separately as well, put the margin parameter to True:
In [15]: air_quality.pivot_table(
....: values="value",
....: index="location",
....: columns="parameter",
....: aggfunc="mean",
....: margins=True,
....: )
....:
Out[15]:
parameter no2 pm25 All
location
BETR801 26.950920 23.169492 24.982353
FR04014 29.374284 NaN 29.374284
London Westminster 29.740050 13.443568 21.491708
All 29.430316 14.386849 24.222743
For more information about pivot_table(), see the user guide section on pivot tables.
Note: In case you are wondering, pivot_table() is indeed directly linked to groupby(). The same result can be
derived by grouping on both parameter and location:
air_quality.groupby(["parameter", "location"]).mean()
Have a look at groupby() in combination with unstack() at the user guide section on combining stats and groupby.
Starting again from the wide format table created in the previous section:
In [17]: no2_pivoted.head()
Out[17]:
location date.utc BETR801 FR04014 London Westminster
0 2019-04-09 01:00:00+00:00 22.5 24.4 NaN
1 2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2 2019-04-09 03:00:00+00:00 54.5 34.2 67.0
3 2019-04-09 04:00:00+00:00 34.5 48.5 41.0
4 2019-04-09 05:00:00+00:00 46.5 59.5 41.0
I want to collect all air quality 𝑁 𝑂2 measurements in a single column (long format)
1.4. Tutorials 45
pandas: powerful Python data analysis toolkit, Release 1.4.4
The pandas.melt() method on a DataFrame converts the data table from wide format to long format. The column
headers become the variable names in a newly created column.
The solution is the short version on how to apply pandas.melt(). The method will melt all columns NOT mentioned
in id_vars together into two columns: A column with the column header names and a column with the values itself.
The latter column gets by default the name value.
The pandas.melt() method can be defined in more detail:
In [21]: no_2.head()
Out[21]:
date.utc id_location NO_2
0 2019-04-09 01:00:00+00:00 BETR801 22.5
1 2019-04-09 02:00:00+00:00 BETR801 53.5
2 2019-04-09 03:00:00+00:00 BETR801 54.5
3 2019-04-09 04:00:00+00:00 BETR801 34.5
4 2019-04-09 05:00:00+00:00 BETR801 46.5
For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and downloaded using the py-openaq
package.
The air_quality_no2_long.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801
and London Westminster in respectively Paris, Antwerp and London.
In [4]: air_quality_no2.head()
Out[4]:
date.utc location parameter value
0 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
1 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
2 2019-06-20 22:00:00+00:00 FR04014 no2 26.5
3 2019-06-20 21:00:00+00:00 FR04014 no2 24.9
4 2019-06-20 20:00:00+00:00 FR04014 no2 21.4
For this tutorial, air quality data about Particulate matter less than 2.5 micrometers is used, made available by openaq
and downloaded using the py-openaq package.
The air_quality_pm25_long.csv data set provides 𝑃 𝑀25 values for the measurement stations FR04014, BETR801
and London Westminster in respectively Paris, Antwerp and London.
In [7]: air_quality_pm25.head()
Out[7]:
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
1.4. Tutorials 47
pandas: powerful Python data analysis toolkit, Release 1.4.4
Concatenating objects
I want to combine the measurements of 𝑁 𝑂2 and 𝑃 𝑀25 , two tables with a similar structure, in a single table
In [9]: air_quality.head()
Out[9]:
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
The concat() function performs concatenation operations of multiple tables along one of the axis (row-wise or
column-wise).
By default concatenation is along axis 0, so the resulting table combines the rows of the input tables. Let’s check the
shape of the original and the concatenated tables to verify the operation:
Note: The axis argument will return in a number of pandas methods that can be applied along an axis. A DataFrame
has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running hori-
zontally across columns (axis 1). Most operations like concatenation or summary statistics are by default across rows
(axis 0), but can be applied across columns as well.
Sorting the table on the datetime information illustrates also the combination of both tables, with the parameter
column defining the origin of the table (either no2 from table air_quality_no2 or pm25 from table
air_quality_pm25):
In [14]: air_quality.head()
Out[14]:
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
(continues on next page)
In this specific example, the parameter column provided by the data ensures that each of the original tables can be
identified. This is not always the case. the concat function provides a convenient solution with the keys argument,
adding an additional (hierarchical) row index. For example:
In [16]: air_quality_.head()
Out[16]:
date.utc location parameter value
PM25 0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
Note: The existence of multiple row/column indices at the same time has not been mentioned within these tutorials.
Hierarchical indexing or MultiIndex is an advanced and powerful pandas feature to analyze higher dimensional data.
Multi-indexing is out of scope for this pandas introduction. For the moment, remember that the function reset_index
can be used to convert any level of an index to a column, e.g. air_quality.reset_index(level=0)
Feel free to dive into the world of multi-indexing at the user guide section on advanced indexing.
More options on table concatenation (row and column wise) and how concat can be used to define the logic (union or
intersection) of the indexes on the other axes is provided at the section on object concatenation.
Add the station coordinates, provided by the stations metadata table, to the corresponding rows in the measurements
table.
Warning: The air quality measurement station coordinates are stored in a data file air_quality_stations.csv,
downloaded using the py-openaq package.
In [18]: stations_coord.head()
Out[18]:
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226
1.4. Tutorials 49
pandas: powerful Python data analysis toolkit, Release 1.4.4
Note: The stations used in this example (FR04014, BETR801 and London Westminster) are just three entries enlisted
in the metadata table. We only want to add the coordinates of these three to the measurements table, each on the
corresponding rows of the air_quality table.
In [19]: air_quality.head()
Out[19]:
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0
In [21]: air_quality.head()
Out[21]:
date.utc location parameter value coordinates.latitude ␣
˓→coordinates.longitude
Using the merge() function, for each of the rows in the air_quality table, the corresponding coordinates are added
from the air_quality_stations_coord table. Both tables have the column location in common which is used as
a key to combine the information. By choosing the left join, only the locations available in the air_quality (left)
table, i.e. FR04014, BETR801 and London Westminster, end up in the resulting table. The merge function supports
multiple join options similar to database-style operations.
Add the parameter full description and name, provided by the parameters metadata table, to the measurements table
Warning: The air quality parameters metadata are stored in a data file air_quality_parameters.csv, down-
loaded using the py-openaq package.
In [23]: air_quality_parameters.head()
Out[23]:
id description name
0 bc Black Carbon BC
1 co Carbon Monoxide CO
2 no2 Nitrogen Dioxide NO2
3 o3 Ozone O3
(continues on next page)
In [25]: air_quality.head()
Out[25]:
date.utc location parameter ... id ␣
˓→ description name
0 2019-05-07 01:00:00+00:00 London Westminster no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2
1 2019-05-07 01:00:00+00:00 FR04014 no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2
2 2019-05-07 01:00:00+00:00 FR04014 no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2
3 2019-05-07 01:00:00+00:00 BETR801 pm25 ... pm25 Particulate␣
˓→matter less than 2.5 micrometers i... PM2.5
4 2019-05-07 01:00:00+00:00 BETR801 no2 ... no2 ␣
˓→ Nitrogen Dioxide NO2
[5 rows x 9 columns]
Compared to the previous example, there is no common column name. However, the parameter column in the
air_quality table and the id column in the air_quality_parameters_name both provide the measured vari-
able in a common format. The left_on and right_on arguments are used here (instead of just on) to make the link
between the two tables.
pandas supports also inner, outer, and right joins. More information on join/merge of tables is provided in the user
guide section on database style merging of tables. Or have a look at the comparison with SQL page.
• Multiple tables can be concatenated both column-wise and row-wise using the concat function.
• For database-like merging/joining of tables, use the merge function.
See the user guide for a full description of the various facilities to combine data tables.
For this tutorial, air quality data about 𝑁 𝑂2 and Particulate matter less than 2.5 micrometers is used, made available
by openaq and downloaded using the py-openaq package. The air_quality_no2_long.csv" data set provides 𝑁 𝑂2
values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and
London.
In [5]: air_quality.head()
Out[5]:
city country datetime location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m3
(continues on next page)
1.4. Tutorials 51
pandas: powerful Python data analysis toolkit, Release 1.4.4
In [6]: air_quality.city.unique()
Out[6]: array(['Paris', 'Antwerpen', 'London'], dtype=object)
I want to work with the dates in the column datetime as datetime objects instead of plain text
In [8]: air_quality["datetime"]
Out[8]:
0 2019-06-21 00:00:00+00:00
1 2019-06-20 23:00:00+00:00
2 2019-06-20 22:00:00+00:00
3 2019-06-20 21:00:00+00:00
4 2019-06-20 20:00:00+00:00
...
2063 2019-05-07 06:00:00+00:00
2064 2019-05-07 04:00:00+00:00
2065 2019-05-07 03:00:00+00:00
2066 2019-05-07 02:00:00+00:00
2067 2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]
Initially, the values in datetime are character strings and do not provide any datetime operations (e.g. extract the year,
day of the week,. . . ). By applying the to_datetime function, pandas interprets the strings and convert these to datetime
(i.e. datetime64[ns, UTC]) objects. In pandas we call these datetime objects similar to datetime.datetime from
the standard library as pandas.Timestamp.
Note: As many data sets do contain datetime information in one of the columns, pandas input function like
pandas.read_csv() and pandas.read_json() can do the transformation to dates when reading the data using
the parse_dates parameter with a list of the columns to read as Timestamp:
pd.read_csv("../data/air_quality_no2_long.csv", parse_dates=["datetime"])
Why are these pandas.Timestamp objects useful? Let’s illustrate the added value with some example cases.
What is the start and end date of the time series data set we are working with?
Using pandas.Timestamp for datetimes enables us to calculate with date information and make them comparable.
Hence, we can use this to get the length of our time series:
The result is a pandas.Timedelta object, similar to datetime.timedelta from the standard Python library and
defining a time duration.
The various time concepts supported by pandas are explained in the user guide section on time related concepts.
I want to add a new column to the DataFrame containing only the month of the measurement
In [12]: air_quality.head()
Out[12]:
city country datetime location parameter value unit month
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m3 6
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m3 6
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m3 6
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m3 6
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m3 6
By using Timestamp objects for dates, a lot of time-related properties are provided by pandas. For example the month,
but also year, weekofyear, quarter,. . . All of these properties are accessible by the dt accessor.
An overview of the existing date properties is given in the time and date components overview table. More details
about the dt accessor to return datetime like properties are explained in a dedicated section on the dt accessor.
What is the average 𝑁 𝑂2 concentration for each day of the week for each of the measurement locations?
In [13]: air_quality.groupby(
....: [air_quality["datetime"].dt.weekday, "location"])["value"].mean()
....:
Out[13]:
datetime location
0 BETR801 27.875000
FR04014 24.856250
London Westminster 23.969697
1 BETR801 22.214286
FR04014 30.999359
...
5 FR04014 25.266154
London Westminster 24.977612
6 BETR801 21.896552
FR04014 23.274306
London Westminster 24.859155
Name: value, Length: 21, dtype: float64
Remember the split-apply-combine pattern provided by groupby from the tutorial on statistics calculation? Here, we
want to calculate a given statistic (e.g. mean 𝑁 𝑂2 ) for each weekday and for each measurement location. To group
on weekdays, we use the datetime property weekday (with Monday=0 and Sunday=6) of pandas Timestamp, which is
also accessible by the dt accessor. The grouping on both locations and weekdays can be done to split the calculation
of the mean on each of these combinations.
1.4. Tutorials 53
pandas: powerful Python data analysis toolkit, Release 1.4.4
Danger: As we are working with a very short time series in these examples, the analysis does not provide a
long-term representative result!
Plot the typical 𝑁 𝑂2 pattern during the day of our time series of all stations together. In other words, what is the
average value for each hour of the day?
In [15]: air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
....: kind='bar', rot=0, ax=axs
....: )
....:
Out[15]: <AxesSubplot:xlabel='datetime'>
Similar to the previous case, we want to calculate a given statistic (e.g. mean 𝑁 𝑂2 ) for each hour of the day and we can
use the split-apply-combine approach again. For this case, we use the datetime property hour of pandas Timestamp,
which is also accessible by the dt accessor.
Datetime as index
In the tutorial on reshaping, pivot() was introduced to reshape the data table with each of the measurements locations
as a separate column:
In [19]: no_2.head()
Out[19]:
location BETR801 FR04014 London Westminster
datetime
2019-05-07 01:00:00+00:00 50.5 25.0 23.0
2019-05-07 02:00:00+00:00 45.0 27.7 19.0
2019-05-07 03:00:00+00:00 NaN 50.4 19.0
2019-05-07 04:00:00+00:00 NaN 61.9 16.0
2019-05-07 05:00:00+00:00 NaN 72.4 NaN
Note: By pivoting the data, the datetime information became the index of the table. In general, setting a column as an
index can be achieved by the set_index function.
Working with a datetime index (i.e. DatetimeIndex) provides powerful functionalities. For example, we do not need
the dt accessor to get the time series properties, but have these properties available on the index directly:
Some other advantages are the convenient subsetting of time period or the adapted time scale on plots. Let’s apply this
on our data.
Create a plot of the 𝑁 𝑂2 values in the different stations from the 20th of May till the end of 21st of May
In [21]: no_2["2019-05-20":"2019-05-21"].plot();
1.4. Tutorials 55
pandas: powerful Python data analysis toolkit, Release 1.4.4
By providing a string that parses to a datetime, a specific subset of the data can be selected on a DatetimeIndex.
More information on the DatetimeIndex and the slicing by using strings is provided in the section on time series
indexing.
Aggregate the current hourly time series values to the monthly maximum value in each of the stations.
In [23]: monthly_max
Out[23]:
location BETR801 FR04014 London Westminster
datetime
2019-05-31 00:00:00+00:00 74.5 97.0 97.0
2019-06-30 00:00:00+00:00 52.5 84.7 52.0
A very powerful method on time series data with a datetime index, is the ability to resample() time series to another
frequency (e.g., converting secondly data into 5-minutely data).
The resample() method is similar to a groupby operation:
• it provides a time-based grouping, by using a string (e.g. M, 5H,. . . ) that defines the target frequency
• it requires an aggregation function such as mean, max,. . .
An overview of the aliases used to define time series frequencies is given in the offset aliases overview table.
When defined, the frequency of the time series is provided by the freq attribute:
In [24]: monthly_max.index.freq
Out[24]: <MonthEnd>
More details on the power of time series resampling is provided in the user guide section on resampling.
• Valid date strings can be converted to datetime objects using to_datetime function or as part of read functions.
• Datetime objects in pandas support calculations, logical operations and convenient date-related properties using
the dt accessor.
• A DatetimeIndex contains these date-related properties and supports convenient slicing.
• Resample is a powerful method to change the frequency of a time series.
A full overview on time series is given on the pages on time series and date functionality.
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.
1.4. Tutorials 57
pandas: powerful Python data analysis toolkit, Release 1.4.4
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name ␣
˓→Sex ... Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ␣
˓→male ... 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ␣
˓→female ... 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ␣
˓→female ... 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ␣
˓→female ... 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ␣
˓→male ... 0 373450 8.0500 NaN S
[5 rows x 12 columns]
In [4]: titanic["Name"].str.lower()
Out[4]:
0 braund, mr. owen harris
1 cumings, mrs. john bradley (florence briggs th...
2 heikkinen, miss. laina
3 futrelle, mrs. jacques heath (lily may peel)
4 allen, mr. william henry
...
886 montvila, rev. juozas
887 graham, miss. margaret edith
888 johnston, miss. catherine helen "carrie"
889 behr, mr. karl howell
890 dooley, mr. patrick
Name: Name, Length: 891, dtype: object
To make each of the strings in the Name column lowercase, select the Name column (see the tutorial on selection of
data), add the str accessor and apply the lower method. As such, each of the strings is converted element-wise.
Similar to datetime objects in the time series tutorial having a dt accessor, a number of specialized string methods are
available when using the str accessor. These methods have in general matching names with the equivalent built-in
string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of
the values of the columns.
Create a new column Surname that contains the surname of the passengers by extracting the part before the comma.
In [5]: titanic["Name"].str.split(",")
Out[5]:
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
(continues on next page)
Using the Series.str.split() method, each of the values is returned as a list of 2 elements. The first element is
the part before the comma and the second element is the part after the comma.
In [7]: titanic["Surname"]
Out[7]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Surname, Length: 891, dtype: object
As we are only interested in the first part representing the surname (element 0), we can again use the str accessor and
apply Series.str.get() to extract the relevant part. Indeed, these string functions can be concatenated to combine
multiple functions at once!
More information on extracting parts of strings is available in the user guide section on splitting and replacing strings.
Extract the passenger data about the countesses on board of the Titanic.
In [8]: titanic["Name"].str.contains("Countess")
Out[8]:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Name, Length: 891, dtype: bool
1.4. Tutorials 59
pandas: powerful Python data analysis toolkit, Release 1.4.4
In [9]: titanic[titanic["Name"].str.contains("Countess")]
Out[9]:
PassengerId Survived Pclass Name ␣
˓→ Sex ... Ticket Fare Cabin Embarked Surname
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... ␣
˓→female ... 110152 86.5 B77 S Rothes
[1 rows x 13 columns]
Note: More powerful extractions on strings are supported, as the Series.str.contains() and Series.str.
extract() methods accept regular expressions, but out of scope of this tutorial.
More information on extracting parts of strings is available in the user guide section on string matching and extracting.
Which passenger of the Titanic has the longest name?
In [10]: titanic["Name"].str.len()
Out[10]:
0 23
1 51
2 22
3 44
4 24
..
886 21
887 28
888 40
889 21
890 19
Name: Name, Length: 891, dtype: int64
To get the longest name we first have to get the lengths of each of the names in the Name column. By using pandas
string methods, the Series.str.len() function is applied to each of the names individually (element-wise).
In [11]: titanic["Name"].str.len().idxmax()
Out[11]: 307
Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is
the largest. The idxmax() method does exactly that. It is not a string method and is applied to integers, so no str is
used.
Based on the index name of the row (307) and the column (Name), we can do a selection using the loc operator,
introduced in the tutorial on subsetting.
In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”.
In [14]: titanic["Sex_short"]
Out[14]:
0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex_short, Length: 891, dtype: object
Whereas replace() is not a string method, it provides a convenient way to use mappings or vocabularies to translate
certain values. It requires a dictionary to define the mapping {from : to}.
Warning: There is also a replace() method available to replace a specific set of characters. However, when
having a mapping of multiple values, this would become:
titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")
This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if
those two statements are applied in the opposite order. . .
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page
was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas.
In comparisons with R and CRAN libraries, we care about the following things:
• Functionality / flexibility: what can/cannot be done with each tool
• Performance: how fast are operations. Hard numbers/benchmarks are preferable
• Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code
comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
1.4. Tutorials 61
pandas: powerful Python data analysis toolkit, Release 1.4.4
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External compatibility for
an example.
Quick reference
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.
R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[['col1', 'col2']]
select(df, col1:col3) df.loc[:, 'col1':'col3']
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see1
distinct(select(df, col1)) df[['col1']].drop_duplicates()
distinct(select(df, col1, col2)) df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)
Sorting
R pandas
arrange(df, col1, col2) df.sort_values(['col1', 'col2'])
arrange(df, desc(col1)) df.sort_values('col1', ascending=False)
Transforming
R pandas
select(df, col_one = col1) df.rename(columns={'col1': 'col_one'})['col_one']
rename(df, col_one = col1) df.rename(columns={'col1': 'col_one'})
mutate(df, c=a-b) df.assign(c=df['a']-df['b'])
1 R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the list of columns, for
example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.
R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na.rm=TRUE)) df.groupby('col1').agg({'col1': 'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()
Base R
or by integer location
1.4. Tutorials 63
pandas: powerful Python data analysis toolkit, Release 1.4.4
Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer
attribute and numpy.r_.
In [5]: n = 30
aggregate
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting
it into groups by1 and by2:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
In [9]: df = pd.DataFrame(
...: {
...: "v1": [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
...: "v2": [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
...: "by1": ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: "by2": [
...: "wet",
...: "dry",
...: 99,
...: 95,
...: np.nan,
...: "damp",
...: 95,
...: 99,
...: "red",
...: 99,
...: np.nan,
...: np.nan,
...: ],
...: }
...: )
...:
1.4. Tutorials 65
pandas: powerful Python data analysis toolkit, Release 1.4.4
match / %in%
A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used
to return a logical vector indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
tapply
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using
a data.frame called baseball, and retrieving information based on the array team:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))
tapply(baseball$batting.average, baseball.example$team,
max)
subset
The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame
where one column’s values are less than another column’s values:
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an
index/slice as well as standard boolean indexing:
1.4. Tutorials 67
pandas: powerful Python data analysis toolkit, Release 1.4.4
with
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
In pandas the equivalent expression, using the eval() method, would be:
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval
documentation.
plyr
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data
structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures
could be mapped in Python.
R Python
array list
lists dictionary or list of objects
data.frame dataframe
ddply
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
In pandas the equivalent expression, using the groupby() method, would be:
In [25]: df = pd.DataFrame(
....: {
....: "x": np.random.uniform(1.0, 168.0, 120),
....: "y": np.random.uniform(7.0, 334.0, 120),
....: "z": np.random.uniform(1.7, 20.7, 120),
....: "month": [5, 6, 7, 8] * 30,
....: "week": np.random.randint(1, 4, 120),
....: }
....: )
....:
1.4. Tutorials 69
pandas: powerful Python data analysis toolkit, Release 1.4.4
reshape / reshape2
meltarray
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
meltlist
An expression using a list called a in R where you want to melt it into a data.frame:
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.
In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN
For more details and examples see the Into to Data Structures documentation.
meltdf
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
1.4. Tutorials 71
pandas: powerful Python data analysis toolkit, Release 1.4.4
cast
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
In [35]: df = pd.DataFrame(
....: {
....: "x": np.random.uniform(1.0, 168.0, 12),
....: "y": np.random.uniform(7.0, 334.0, 12),
....: "z": np.random.uniform(1.7, 20.7, 12),
....: "month": [5, 6, 7] * 4,
....: "week": [1, 2] * 6,
....: }
....: )
....:
In [37]: pd.pivot_table(
....: mdf,
....: values="value",
....: index=["variable", "week"],
....: columns=["month"],
....: aggfunc=np.mean,
....: )
....:
Out[37]:
month 5 6 7
variable week
x 1 93.888747 98.762034 55.219673
2 94.391427 38.112932 83.942781
y 1 94.306912 279.454811 227.840449
2 87.392662 193.028166 173.899260
z 1 11.016009 10.079307 16.170549
2 8.476111 17.638509 19.003494
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)
Python can approach this in two different ways. Firstly, similar to above using pivot_table():
In [38]: df = pd.DataFrame(
....: {
....: "Animal": [
....: "Animal1",
....: "Animal2",
....: "Animal3",
....: "Animal2",
....: "Animal1",
....: "Animal2",
....: "Animal3",
....: ],
....: "FeedType": ["A", "B", "A", "A", "B", "B", "A"],
....: "Amount": [10, 7, 4, 2, 5, 6, 2],
....: }
....: )
....:
Out[39]:
FeedType A B
Animal
Animal1 10.0 5.0
Animal2 2.0 13.0
Animal3 6.0 NaN
For more details and examples see the reshaping documentation or the groupby documentation.
1.4. Tutorials 73
pandas: powerful Python data analysis toolkit, Release 1.4.4
factor
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
For more details and examples see categorical introduction and the API documentation. There is also a documentation
regarding the differences to R’s factor.
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of
how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:
Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame
called tips and assume we have a database table of the same name and structure.
In [3]: url = (
...: "https://raw.github.com/pandas-dev"
(continues on next page)
In [5]: tips
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
Most pandas operations return copies of the Series/DataFrame. To make the changes “stick”, you’ll need to either
assign to a new variable:
sorted_df = df.sort_values("col1")
df = df.sort_values("col1")
Note: You will see an inplace=True keyword argument available for some methods:
df.sort_values("col1", inplace=True)
1.4. Tutorials 75
pandas: powerful Python data analysis toolkit, Release 1.4.4
SELECT
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):
With pandas, column selection is done by passing a list of column names to your DataFrame:
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).
In SQL, you can add a calculated column:
With pandas, you can use the DataFrame.assign() method of a DataFrame to append a new column:
WHERE
SELECT *
FROM tips
WHERE time = 'Dinner';
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with
True.
In [10]: is_dinner
Out[10]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool
In [11]: is_dinner.value_counts()
Out[11]:
True 176
False 68
Name: time, dtype: int64
1.4. Tutorials 77
pandas: powerful Python data analysis toolkit, Release 1.4.4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
Tips of more than $5 at Dinner meals:
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
Tips by parties of at least 5 diners OR bill total was more than $45:
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
In [16]: frame
Out[16]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS
NULL with the following query:
SELECT *
FROM frame
WHERE col2 IS NULL;
In [17]: frame[frame["col2"].isna()]
Out[17]:
col1 col2
1 B NaN
Getting items where col1 IS NOT NULL can be done with notna().
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [18]: frame[frame["col1"].notna()]
Out[18]:
col1 col2
(continues on next page)
1.4. Tutorials 79
pandas: powerful Python data analysis toolkit, Release 1.4.4
GROUP BY
In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby()
typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation)
, and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a
query getting us the number of tips left by sex:
In [19]: tips.groupby("sex").size()
Out[19]:
sex
Female 87
Male 157
dtype: int64
Notice that in the pandas code we used size() and not count(). This is because count() applies the function to
each column, returning the number of NOT NULL records within each.
In [20]: tips.groupby("sex").count()
Out[20]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
In [21]: tips.groupby("sex")["total_bill"].count()
Out[21]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs by day of
the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to
specific columns.
Grouping by more than one column is done by passing a list of columns to the groupby() method.
1.4. Tutorials 81
pandas: powerful Python data analysis toolkit, Release 1.4.4
JOIN
JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices. Each
method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns
to join on (column names or indices).
Warning: If both key columns contain rows where the key is a null value, those rows will be matched against each
other. This is different from usual SQL join behaviour and can lead to unexpected results.
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s
index.
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
RIGHT JOIN
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
FULL JOIN
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a
match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
Show all records from both tables.
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
1.4. Tutorials 83
pandas: powerful Python data analysis toolkit, Release 1.4.4
UNION
SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
LIMIT
In [36]: tips.head(10)
Out[36]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
1.4. Tutorials 85
pandas: powerful Python data analysis toolkit, Release 1.4.4
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [38]: (
....: tips.assign(
....: rn=tips.sort_values(["total_bill"], ascending=False)
....: .groupby(["day"])
....: .cumcount()
....: + 1
....: )
....: .query("rn < 3")
....: .sort_values(["day", "rn"])
....: )
....:
Out[38]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
(continues on next page)