Skip to content

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Jun 10, 2019

The .T operator can be quite slow on mixed-type DataFrames due to the creation of object dtype columns. In comparison to direct construction with DataFrame.from_dict() can generally be much more efficient.

Making that swap inside pd.read_json() yields a ~5-6x speedup for the orient='index' case:

       before           after         ratio
     [d47fc0cb]       [b0fd99ec]
     <read_json_speedup~1>       <read_json_speedup>
-      5.37±0.03s          907±5ms     0.17  io.json.ReadJSON.time_read_json('index', 'int')
-      5.27±0.01s          804±3ms     0.15  io.json.ReadJSON.time_read_json('index', 'datetime')
  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2019

cc @TomAugspurger would this be related to #24387 at all?

@WillAyd WillAyd added the Performance Memory or execution speed performance label Jun 10, 2019
@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2019

Ignore previous comment was too focused on the constructor and not the transposition. This makes sense to me

@codecov
Copy link

codecov bot commented Jun 10, 2019

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641
Flag Coverage Δ
#multiple ?
#single 41.21% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 63.17% <0%> (-30.07%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Jun 10, 2019

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641
Flag Coverage Δ
#multiple ?
#single 41.21% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 63.17% <0%> (-30.07%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

@alimcmaster1
Copy link
Member

nice! @qwhelan mind taking a look at the test cases ( looks like this changes the order of the index ) https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=12630

>   raise_assert_detail(obj, msg, lobj, robj)
E   AssertionError: DataFrame.columns are different
E   
E   DataFrame.columns values are different (100.0 %)
E   [left]:  Index(['A', 'B', 'C', 'D'], dtype='object')
E   [right]: Index(['D', 'C', 'B', 'A'], dtype='object')

@qwhelan
Copy link
Contributor Author

qwhelan commented Jun 10, 2019

@alimcmaster1 Given that this only fails on 3.5, I'm guessing this is a dict-orderedness issue in from_dict()

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Jun 27, 2019
@qwhelan qwhelan force-pushed the read_json_speedup branch 2 times, most recently from d77a2a2 to 5edd63c Compare July 8, 2019 05:44
@jreback jreback added this to the 0.25.0 milestone Jul 8, 2019
@jreback
Copy link
Contributor

jreback commented Jul 8, 2019

lgtm, can you add a note in Performance for 0.25.0, ping on green.

@qwhelan qwhelan force-pushed the read_json_speedup branch from 5edd63c to cef3d80 Compare July 8, 2019 14:50
@jreback jreback merged commit a373e0e into pandas-dev:master Jul 17, 2019
@jreback
Copy link
Contributor

jreback commented Jul 17, 2019

thanks @qwhelan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants