-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Hello,
I've run into a nasty bug as I try to work with dataframes and threads.
The problem is that one thread modifies a local dataframe by removing a column but in doing so somehow corrupts the dataframe of the other thread.
My input data is a dataframe of multiple data types (float and integer in the example). This dataframe is grouped according to column AA. Each group is placed into a synchronised Queue (from the standard library). Two threads will consume the items from the queue and place the results in a standard python List (whose append method is atomic). The thread internally creates an instance of a FrameProcessor and then enters an infinite job processing loop until the input queue is empty. The FrameProcessor is a callable object. It takes a dataframe as input and processes it in two levels. To process the first level it selects the rows of the input dataframe that match a condition and passes the resulting dataframe to a method that undertakes the second level processing. While processing the second level, the input dataframe is modified by removing a column and creating a new dataframe from the results of some calculations.
The code breaks in two places:
measurements = df[ df['BB'] == tag ] #from method _processFirstLevel
cc = df.pop('CC') #from method _processSecondLevel
and for different reasons (last line of trace shown):
File "C:\Python26\lib\site-packages\pandas\core\internals.py", line 570, in _verify_integrity
assert(len(self.items) == tot_items)
AssertionError
File "C:\Python26\lib\site-packages\pandas\core\internals.py", line 26, in __init__
assert(len(items) == len(values))
AssertionError
File "C:\Python26\lib\site-packages\pandas\core\index.py", line 315, in __getitem__
return arr_idx[key]
IndexError: index out of bounds
File "C:\Python26\lib\site-packages\numpy\lib\function_base.py", line 3336, in delete
"invalid entry")
ValueError: invalid entry
As much I can tell is that it seems that although two different objects (one for each thread) are working on two different dataframes (from the groups) they corrupt each other's local variables by removing the column.
Maybe the problem is because of my implementation (the way I set up the threads, jobs, etc.). However, I suspect the problem lies withing the deep support layers of the DataFrame class (such as BlockManager, etc.).
I am using
- Python 2.6.6
- Pandas 0.9.0
- NumPy 1.6.1
- Windows 7 Professional 64-bit
Why is data corruption ocurring in threads working with no shared resources?
Below is the code that you can use to reproduce the bug. Bear in mind that because of the threads, the bug may not happen on the first run, or you wont get the same effect every time.
import threading
import numpy as np
import pandas as pd
import Queue
class FrameProcessor(object):
def __call__(self, *args, **kwargs):
chunk = args[0]
result = self._processFirstLevel(chunk)
return result
def _processFirstLevel(self, df):
second_level_tags = list(df['BB'].unique())
results = []
while len(second_level_tags) > 0:
tag = second_level_tags.pop()
measurements = df[ df['BB'] == tag ]
result = self._processSecondLevel(measurements)
result['BB'] = tag
results.append(result)
result_Frame = pd.concat(results, axis=0)
return result_Frame
def _processSecondLevel(self, df):
cc = df.pop('CC')
result_row = {
'cc_avg': cc.mean()
,'measurements_in_avg':len(df)}
result = pd.DataFrame([result_row])
return result
def test_():
num_of_groups = 10
group_size = 30
sub_group_size = 5
rows = group_size*num_of_groups
a = []
[a.extend(np.repeat(x,group_size)) for x in range(num_of_groups)]
a = np.array(a)
b = np.array(np.tile(range(sub_group_size),rows/sub_group_size))
c = np.random.randn(rows)
p = np.random.random_integers(0,10,rows)
q = np.random.randn(rows)
x = np.vstack((a,b,c,p,q))
dates = np.asarray(pd.date_range('1/1/2000', periods=rows))
df = pd.DataFrame(x.T, index=dates, columns=['AA', 'BB', 'CC','P','Q'])
results = []
inbox = Queue.Queue()
group_by_columns = ['AA']
groups = df.groupby(group_by_columns)
for name, group in groups:
inbox.put(groups.get_group(name))
def workerShell():
processor = FrameProcessor()
while True:
try:
job = inbox.get(False)
result = processor(job)
results.append(result)
inbox.task_done()
print '{0} job done. Jobs left: {1}'.format(id(processor),inbox.qsize())
except Queue.Empty:
break
thread1 = threading.Thread(target=workerShell)
thread2 = threading.Thread(target=workerShell)
thread1.start()
thread2.start()
inbox.join()
thread1.join()
thread2.join()
df = pd.concat(results, axis=0)
return df
if __name__ == '__main__':
print 'pandas',pd.__version__
print 'numpy',np.__version__
for i in range(5):
print '--------------test:',i
test_()