Description
Hi,
I am really loving powertools, its been really useful recently for me, although I have encountered an issue since starting to use it in a new serverless application, I am currently using it in a few others and it is working o.k there and I had not noticed this issue.
This may well be a 'User Error' and what I am experiencing may be desired or necessary behaviour and I can change the way my code functions so I can still decorate this method but I think it is worth raising this.
I have provided as much information as I can below.
When a method is decorated with @TRACER.capture_method
and that method retrieves a file from S3 using boto3 and returns the S3 object\dictionary then the botocore.response.StreamingBody
object is already read meaning there is no data to be read anymore.
To add tracing to a method that calls S3 to retrieve a CSV file using boto3, has affected me in that since adding powertools tracing retrieved CSV files had no data when converted to data frame yet the response body was populated and files retrieved from S3.
Expected Behavior
Method is decorated and objects stream data is not read.
Current Behavior
See the following code snippet which works if the decorator is removed:
@TRACER.capture_method
def load_file_from_s3(bucket_name, key):
try:
obj = s3_client.get_object(Bucket=bucket_name, Key=key)
except ClientError as exc:
if exc.response["Error"]["Code"] != "404":
raise exc
return obj
Environment
- Powertools version used:
Latest
{
"level": "ERROR",
"location": "generate_single_shape:186",
"message": "No columns to parse from file",
"timestamp": "2020-12-09 18:41:03,474",
"service": "",
"sampling_rate": 0,
"cold_start": true,
"function_name": "",
"function_memory_size": "128",
"function_arn": "arn:aws:lambda:eu-west-1:",
"function_request_id": "2b74000d-10a6-48a9-8858-822f1f1f01e5",
"exception": "Traceback (most recent call last):\n File \"/var/task/src/x/controller.py\", line 166, in generate_single_shape\n df_shape = get_shape_dataframe(int(shape_number), \"shape\")\n File \"/var/task/src/x/controller.py\", line 68, in get_shape_dataframe\n df = service.s3_shape_obj_to_pandas(file_obj, columns)\n File \"/var/task/src/forecasting/shape_tool_service.py\", line 85, in s3_shape_obj_to_pandas\n df = pd.read_csv(csv_file, usecols=columns, index_col=0, skiprows=2)\n File \"/opt/python/pandas/io/parsers.py\", line 688, in read_csv\n return _read(filepath_or_buffer, kwds)\n File \"/opt/python/pandas/io/parsers.py\", line 454, in _read\n parser = TextFileReader(fp_or_buf, **kwds)\n File \"/opt/python/pandas/io/parsers.py\", line 948, in __init__\n self._make_engine(self.engine)\n File \"/opt/python/pandas/io/parsers.py\", line 1180, in _make_engine\n self._engine = CParserWrapper(self.f, **self.options)\n File \"/opt/python/pandas/io/parsers.py\", line 2010, in __init__\n self._reader = parsers.TextReader(src, **kwds)\n File \"pandas/_libs/parsers.pyx\", line 540, in pandas._libs.parsers.TextReader.__cinit__\npandas.errors.EmptyDataError: No columns to parse from file",
"xray_trace_id": "1-5fd11a3a-75ee6c8d49267424492ea927"
}
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
heitorlessa commentedon Dec 10, 2020
Hey Paul - Thanks for raising this issue, and also great to hear you find this library helpful!
If I understood this scenario correctly, you're downloading a file from S3 but not reading it's chunks (.read()) since you might have another function doing that e.g. pandas to turn "file-like" object into a Data Frame.
However, when you add capture_method decorator it is reading the object - calling .read() on your obj return - and this does not happen without the Tracer.
Did I understand this right?
If so, could you also share a snippet on how you're calling load_file_from_s3 fn, and your boto3 version?
From Powertools perspective, we primarily decorate and call your function when you consume it - we are unaware of what's inside.
This could be something else, but more than happy to try reproduce it tomorrow with the same boto version and snippet you have.
Thank you :)
paulalex commentedon Dec 10, 2020
Hi Heitor,
Yes it is exactly what I am doing and the behaviour is as you describe, if I have the decorator then the output outside of that method is
b''
, if I dump it inside the method before returning from it I get my data, if remove the decorator and I return that object I also get the data, so it appears as if something is reading the data.The versions of boto3 below:
boto3 >> version": "==1.16.30"
botocore >> "version": "==1.19.30"
Below are the functions involved, although worth reiterating that once the object returned from the boto call is outside of
load_file_from_s3
and it is decorated there is no data, so currently I have had to remove the decorator from that function.Apologies if this does turn out out to be something else, a configuration issue or something. Its just the application was working absolutely fine until we started a ticket to add power tools tracing and structured logging.
heitorlessa commentedon Dec 10, 2020
Don't worry @paulalex it could be something with us, as I can't see anything immediately odd in that code.
If there is something on our side, or if it takes too long to figure out, I'll push a context manager for Tracer to you can still use Tracer capabilities within your code more easily
-- it's something I've been meaning to do too.
I'll take this for a spin tomorrow (5pm here now).
Thanks for sharing all that info
heitorlessa commentedon Dec 11, 2020
hey @paulalex - I've managed to replicate it using that snippet you sent last. There's something in the decorator logic and in X-Ray that I can't figure out why yet - I'll keep digging.
In the meantime, you can use a context manager as it works expected:
It's really odd because the response from S3, Content-Length at least, is exactly the same as before. It's almost as if
read()
was called, and given you can only call it once, it returns an empty response - I might need to create an issue in boto3 or X-Ray.Things I've tried for the record:
I'll update here if we find the root cause and here's the source I used to reproduce now without all the boilerplate: https://github.com/heitorlessa/issue238-pt
paulalex commentedon Dec 11, 2020
@heitorlessa Great work you definitely put a lot of effort in! So it looks like the issue is not with powertools but with either the X-ray libraries or boto?
Thanks for the context manager snippet, I will switch to using this in my code for now. Noticed the same thing, that the response length is the same but it appears as if something has already read the body.
Thanks a lot for all of this!
heitorlessa commentedon Dec 11, 2020
Funny enough if you pass the Content-Length for the
read()
function to force boto to read the entire chunk it works consistently:So it's definitely not an issue with X-Ray per se but something to be investigated in Boto and the IO stream, because at first glance boto isn't actually giving a file-like stream but a modified version of it -- I need to dig in as I'm personally hooked into the problem now despite not being Powertools per se :D
TL;DR, if you pass the content-length to
read()
it will work without changing the decorator and context manager.csv_file = io.BytesIO(obj["Body"].read())
csv_file = io.BytesIO(obj["Body"].read(obj["Content-Length"]))
I'll do more digging next week when I free up more time, hope that helps ;)
paulalex commentedon Dec 11, 2020
Thats great thanks again and I hope this doesnt become your weekend!
to-mc commentedon Dec 12, 2020
Like Heitor, I was also pretty hooked on figuring this one out! Long story short, its due to the
capture_method
decorator trying to capture the response and add it to the X-Ray trace metadata. You can disable this behaviour in the decorator:@capture_method(capture_response=False)
but by default its enabled. What this means is that the xray-sdk tries to serialize the function's return value, in this case eventually causing.read()
to be called on theStreamingBody
object.Ultimately its expected behaviour - if you want the function response data in the trace, it has to be serialized. I'll add something to our docs to clarify before closing this issue, as I can certainly imagine this catching others out. Thanks for the helping us figure this out with the detailed bug report @paulalex!
capture_method
#244heitorlessa commentedon Dec 14, 2020
In simple words, X-Ray SDK uses
jsonpickle
dependency to serialize Python objects into JSON when they're added as Metadata trace - e.g. Any response from decorated function.When running locally this is a No-Op operation meaning it simply ignores hence why it wasn't easy to reproduce but in Lambda - @cakepietoast brilliantly traced all calls to reproduce that ;)
michaelbrewer commentedon Dec 15, 2020
@heitorlessa a hidden benefit of the
capture_response
flag ;)heitorlessa commentedon Dec 21, 2020
indeed @michaelbrewer though if I'm honest it took me by surprise the lack of
seek
in botocore stream response, and the upstream side effect of usingjsonpickle
.Closing this as it's now available as part of a bugfix release 1.9.1