Skip to content

to_hdf writes data that doesn't match read back #7605

@vm-wylbur

Description

@vm-wylbur

here's the code:

    records.to_hdf(
        args.output, 'records',
        mode='w', format='fixed', append=False,
        complib='zlib', complevel=7, fletcher32=True)

    r2 = pd.read_hdf(
        path_or_buf=args.output, key='records',
        encoding='utf-8', start=None, stop=None)

    from pandas.util.testing import assert_frame_equal
    assert_frame_equal(records, r2, check_exact=True)

and the traceback:

/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/io/pytables.py:2441: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['dataset', 'record_id', 'DOD', 'CC', 'sex', 'name', 'loc', 'manner_of_death', 'eth', 'social_group', 'occ', 'clean_loc', 'month_of_death', 'year_of_death', 'name_sorted']]

  warnings.warn(ws, PerformanceWarning)
Traceback (most recent call last):
  File "src/import.py", line 59, in <module>
    tools.epilog(args, records, logger)
  File "/Users/pball/git/CO/match/import/src/lib/import_tools.py", line 46, in epilog
    assert_frame_equal(records, r2, check_exact=True)
  File "/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/util/testing.py", line 585, in assert_frame_equal
    check_exact=check_exact)
  File "/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/util/testing.py", line 530, in assert_series_equal
    right.values))
AssertionError: [nan nan nan ..., 'c2681113' 'c12266508' 'c2680757'] is not equal to [nan nan nan ..., 'c2681113' 'c12266508' 'c2680757'].
make: *** [output/input-records.h5] Error 1

I've been trying to figure out why upstream fixes didn't seem to appear downstream. I finally came here: apparently to_hdf is writing a file that's different when it's read back. As I've been re-running this over the last hour or so, different fields have come up in the AssertionError.

Here are a few things that do not eliminate the error: with or without compression; format table or fixed. However, changing these arguments does change which field is identified by assert_frame_equal as unequal.

I have no idea how to reproduce this without my entire dataset, which is unfortunately confidential. I'll fall back to csv for now, and I hope that I'm just doing something horribly dumb that we can fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions