PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

qwhelan · 2019-06-10T19:11:37Z

The .T operator can be quite slow on mixed-type DataFrames due to the creation of object dtype columns. In comparison to direct construction with DataFrame.from_dict() can generally be much more efficient.

Making that swap inside pd.read_json() yields a ~5-6x speedup for the orient='index' case:

       before           after         ratio
     [d47fc0cb]       [b0fd99ec]
     <read_json_speedup~1>       <read_json_speedup>
-      5.37±0.03s          907±5ms     0.17  io.json.ReadJSON.time_read_json('index', 'int')
-      5.27±0.01s          804±3ms     0.15  io.json.ReadJSON.time_read_json('index', 'datetime')

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd · 2019-06-10T19:14:28Z

cc @TomAugspurger would this be related to #24387 at all?

WillAyd · 2019-06-10T19:31:32Z

Ignore previous comment was too focused on the constructor and not the transposition. This makes sense to me

codecov · 2019-06-10T19:47:46Z

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641

Flag	Coverage Δ
#multiple	`?`
#single	`41.21% <0%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/json.py	`63.17% <0%> (-30.07%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/plotting/_matplotlib/__init__.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/sparse/scipy_sparse.py	`10.14% <0%> (-89.86%)`	⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

codecov · 2019-06-10T19:47:47Z

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641

Flag	Coverage Δ
#multiple	`?`
#single	`41.21% <0%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/json.py	`63.17% <0%> (-30.07%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/plotting/_matplotlib/__init__.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/sparse/scipy_sparse.py	`10.14% <0%> (-89.86%)`	⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

alimcmaster1 · 2019-06-10T20:33:06Z

nice! @qwhelan mind taking a look at the test cases ( looks like this changes the order of the index ) https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=12630

>   raise_assert_detail(obj, msg, lobj, robj)
E   AssertionError: DataFrame.columns are different
E   
E   DataFrame.columns values are different (100.0 %)
E   [left]:  Index(['A', 'B', 'C', 'D'], dtype='object')
E   [right]: Index(['D', 'C', 'B', 'A'], dtype='object')

qwhelan · 2019-06-10T20:45:33Z

@alimcmaster1 Given that this only fails on 3.5, I'm guessing this is a dict-orderedness issue in from_dict()

jreback · 2019-07-08T12:49:56Z

lgtm, can you add a note in Performance for 0.25.0, ping on green.

…spose

jreback · 2019-07-17T11:49:32Z

thanks @qwhelan

WillAyd added the Performance Memory or execution speed performance label Jun 10, 2019

jreback added the IO JSON read_json, to_json, json_normalize label Jun 27, 2019

qwhelan force-pushed the read_json_speedup branch 2 times, most recently from d77a2a2 to 5edd63c Compare July 8, 2019 05:44

jreback added this to the 0.25.0 milestone Jul 8, 2019

PERF: 5x speedup for read_json() with orient='index' by avoiding tran…

cef3d80

…spose

qwhelan force-pushed the read_json_speedup branch from 5edd63c to cef3d80 Compare July 8, 2019 14:50

jreback merged commit a373e0e into pandas-dev:master Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

Uh oh!

qwhelan commented Jun 10, 2019

Uh oh!

WillAyd commented Jun 10, 2019

Uh oh!

WillAyd commented Jun 10, 2019 •

edited

Loading

Uh oh!

codecov bot commented Jun 10, 2019

Uh oh!

codecov bot commented Jun 10, 2019

Uh oh!

alimcmaster1 commented Jun 10, 2019

Uh oh!

qwhelan commented Jun 10, 2019

Uh oh!

jreback commented Jul 8, 2019

Uh oh!

jreback commented Jul 17, 2019

Uh oh!

Uh oh!

Uh oh!

PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

Uh oh!

Conversation

qwhelan commented Jun 10, 2019

Uh oh!

WillAyd commented Jun 10, 2019

Uh oh!

WillAyd commented Jun 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 10, 2019

Codecov Report

Uh oh!

codecov bot commented Jun 10, 2019

Codecov Report

Uh oh!

alimcmaster1 commented Jun 10, 2019

Uh oh!

qwhelan commented Jun 10, 2019

Uh oh!

jreback commented Jul 8, 2019

Uh oh!

jreback commented Jul 17, 2019

Uh oh!

Uh oh!

WillAyd commented Jun 10, 2019 •

edited

Loading