Skip to content

Underlying support code of Dataframe is not thread safe #2440

@jaguarviajero

Description

@jaguarviajero

Hello,

I've run into a nasty bug as I try to work with dataframes and threads.

The problem is that one thread modifies a local dataframe by removing a column but in doing so somehow corrupts the dataframe of the other thread.

My input data is a dataframe of multiple data types (float and integer in the example). This dataframe is grouped according to column AA. Each group is placed into a synchronised Queue (from the standard library). Two threads will consume the items from the queue and place the results in a standard python List (whose append method is atomic). The thread internally creates an instance of a FrameProcessor and then enters an infinite job processing loop until the input queue is empty. The FrameProcessor is a callable object. It takes a dataframe as input and processes it in two levels. To process the first level it selects the rows of the input dataframe that match a condition and passes the resulting dataframe to a method that undertakes the second level processing. While processing the second level, the input dataframe is modified by removing a column and creating a new dataframe from the results of some calculations.

The code breaks in two places:

measurements = df[ df['BB'] == tag ] #from method _processFirstLevel
cc = df.pop('CC') #from method _processSecondLevel

and for different reasons (last line of trace shown):

    File "C:\Python26\lib\site-packages\pandas\core\internals.py", line 570, in _verify_integrity
        assert(len(self.items) == tot_items)
    AssertionError
    File "C:\Python26\lib\site-packages\pandas\core\internals.py", line 26, in __init__
        assert(len(items) == len(values))
    AssertionError
    File "C:\Python26\lib\site-packages\pandas\core\index.py", line 315, in __getitem__
        return arr_idx[key]
    IndexError: index out of bounds
    File "C:\Python26\lib\site-packages\numpy\lib\function_base.py", line 3336, in delete
        "invalid entry")
    ValueError: invalid entry

As much I can tell is that it seems that although two different objects (one for each thread) are working on two different dataframes (from the groups) they corrupt each other's local variables by removing the column.

Maybe the problem is because of my implementation (the way I set up the threads, jobs, etc.). However, I suspect the problem lies withing the deep support layers of the DataFrame class (such as BlockManager, etc.).

I am using

  • Python 2.6.6
  • Pandas 0.9.0
  • NumPy 1.6.1
  • Windows 7 Professional 64-bit

Why is data corruption ocurring in threads working with no shared resources?

Below is the code that you can use to reproduce the bug. Bear in mind that because of the threads, the bug may not happen on the first run, or you wont get the same effect every time.

import threading
import numpy as np
import pandas as pd
import Queue

class FrameProcessor(object):

    def __call__(self, *args, **kwargs):
        chunk = args[0]
        result = self._processFirstLevel(chunk)
        return result

    def _processFirstLevel(self, df):
        second_level_tags = list(df['BB'].unique())
        results = []
        while len(second_level_tags) > 0:
            tag = second_level_tags.pop()
            measurements = df[ df['BB'] == tag ]
            result = self._processSecondLevel(measurements)
            result['BB'] = tag
            results.append(result)

        result_Frame = pd.concat(results, axis=0)

        return result_Frame

    def _processSecondLevel(self, df):
        cc = df.pop('CC')
        result_row = {
            'cc_avg': cc.mean()
            ,'measurements_in_avg':len(df)}

        result = pd.DataFrame([result_row])

        return result

def test_():
    num_of_groups = 10
    group_size = 30
    sub_group_size = 5
    rows = group_size*num_of_groups
    a = []
    [a.extend(np.repeat(x,group_size)) for x in range(num_of_groups)]
    a = np.array(a)
    b = np.array(np.tile(range(sub_group_size),rows/sub_group_size))
    c = np.random.randn(rows)
    p = np.random.random_integers(0,10,rows)
    q = np.random.randn(rows)
    x = np.vstack((a,b,c,p,q))

    dates = np.asarray(pd.date_range('1/1/2000', periods=rows))
    df = pd.DataFrame(x.T, index=dates, columns=['AA', 'BB', 'CC','P','Q'])

    results = []
    inbox = Queue.Queue()

    group_by_columns = ['AA']
    groups = df.groupby(group_by_columns)
    for name, group in groups:
        inbox.put(groups.get_group(name))


    def workerShell():
        processor = FrameProcessor()
        while True:
            try:
                job = inbox.get(False)
                result = processor(job)
                results.append(result)
                inbox.task_done()
                print '{0} job done. Jobs left: {1}'.format(id(processor),inbox.qsize())
            except Queue.Empty:
                break

    thread1 = threading.Thread(target=workerShell)
    thread2 = threading.Thread(target=workerShell)
    thread1.start()
    thread2.start()
    inbox.join()
    thread1.join()
    thread2.join()

    df = pd.concat(results, axis=0)

    return df

if __name__ == '__main__':

    print 'pandas',pd.__version__
    print 'numpy',np.__version__
    for i in range(5):
        print '--------------test:',i
        test_()

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions