Skip to content

@TRACER.capture_method causes botocore response body of objects retrieved from S3 to be pre-exhausted #238

Closed
@paulalex

Description

@paulalex

Hi,

I am really loving powertools, its been really useful recently for me, although I have encountered an issue since starting to use it in a new serverless application, I am currently using it in a few others and it is working o.k there and I had not noticed this issue.

This may well be a 'User Error' and what I am experiencing may be desired or necessary behaviour and I can change the way my code functions so I can still decorate this method but I think it is worth raising this.

I have provided as much information as I can below.

When a method is decorated with @TRACER.capture_method and that method retrieves a file from S3 using boto3 and returns the S3 object\dictionary then the botocore.response.StreamingBody object is already read meaning there is no data to be read anymore.

To add tracing to a method that calls S3 to retrieve a CSV file using boto3, has affected me in that since adding powertools tracing retrieved CSV files had no data when converted to data frame yet the response body was populated and files retrieved from S3.

Expected Behavior

Method is decorated and objects stream data is not read.

Current Behavior

See the following code snippet which works if the decorator is removed:

@TRACER.capture_method
def load_file_from_s3(bucket_name, key):
    try:
        obj = s3_client.get_object(Bucket=bucket_name, Key=key)        
    except ClientError as exc:
        if exc.response["Error"]["Code"] != "404":
            raise exc
    
    return obj

Environment

  • Powertools version used:
    Latest

How to enable debug mode**

{
    "level": "ERROR",
    "location": "generate_single_shape:186",
    "message": "No columns to parse from file",
    "timestamp": "2020-12-09 18:41:03,474",
    "service": "",
    "sampling_rate": 0,
    "cold_start": true,
    "function_name": "",
    "function_memory_size": "128",
    "function_arn": "arn:aws:lambda:eu-west-1:",
    "function_request_id": "2b74000d-10a6-48a9-8858-822f1f1f01e5",
    "exception": "Traceback (most recent call last):\n  File \"/var/task/src/x/controller.py\", line 166, in generate_single_shape\n    df_shape = get_shape_dataframe(int(shape_number), \"shape\")\n  File \"/var/task/src/x/controller.py\", line 68, in get_shape_dataframe\n    df = service.s3_shape_obj_to_pandas(file_obj, columns)\n  File \"/var/task/src/forecasting/shape_tool_service.py\", line 85, in s3_shape_obj_to_pandas\n    df = pd.read_csv(csv_file, usecols=columns, index_col=0, skiprows=2)\n  File \"/opt/python/pandas/io/parsers.py\", line 688, in read_csv\n    return _read(filepath_or_buffer, kwds)\n  File \"/opt/python/pandas/io/parsers.py\", line 454, in _read\n    parser = TextFileReader(fp_or_buf, **kwds)\n  File \"/opt/python/pandas/io/parsers.py\", line 948, in __init__\n    self._make_engine(self.engine)\n  File \"/opt/python/pandas/io/parsers.py\", line 1180, in _make_engine\n    self._engine = CParserWrapper(self.f, **self.options)\n  File \"/opt/python/pandas/io/parsers.py\", line 2010, in __init__\n    self._reader = parsers.TextReader(src, **kwds)\n  File \"pandas/_libs/parsers.pyx\", line 540, in pandas._libs.parsers.TextReader.__cinit__\npandas.errors.EmptyDataError: No columns to parse from file",
    "xray_trace_id": "1-5fd11a3a-75ee6c8d49267424492ea927"
}

Activity

added
bugSomething isn't working
triagePending triage from maintainers
on Dec 9, 2020
heitorlessa

heitorlessa commented on Dec 10, 2020

@heitorlessa
Contributor

Hey Paul - Thanks for raising this issue, and also great to hear you find this library helpful!

If I understood this scenario correctly, you're downloading a file from S3 but not reading it's chunks (.read()) since you might have another function doing that e.g. pandas to turn "file-like" object into a Data Frame.

However, when you add capture_method decorator it is reading the object - calling .read() on your obj return - and this does not happen without the Tracer.

Did I understand this right?

If so, could you also share a snippet on how you're calling load_file_from_s3 fn, and your boto3 version?

From Powertools perspective, we primarily decorate and call your function when you consume it - we are unaware of what's inside.

This could be something else, but more than happy to try reproduce it tomorrow with the same boto version and snippet you have.

Thank you :)

paulalex

paulalex commented on Dec 10, 2020

@paulalex
Author

Hi Heitor,

Yes it is exactly what I am doing and the behaviour is as you describe, if I have the decorator then the output outside of that method isb'', if I dump it inside the method before returning from it I get my data, if remove the decorator and I return that object I also get the data, so it appears as if something is reading the data.

The versions of boto3 below:

boto3 >> version": "==1.16.30"
botocore >> "version": "==1.19.30"

Below are the functions involved, although worth reiterating that once the object returned from the boto call is outside of load_file_from_s3 and it is decorated there is no data, so currently I have had to remove the decorator from that function.

import boto3
from botocore.exceptions import ClientError

#Decorator removed for now as does not work with it
def load_file_from_s3(bucket_name, key):
    try:
        obj = s3_client.get_object(Bucket=bucket_name, Key=key)
    except ClientError as exc:
        if exc.response["Error"]["Code"] != "404":
            raise exc
    # Getting the value of the body as csv is fine up to this point, once returned it is no longer available to be read
    return obj

def s3_shape_obj_to_pandas(obj, columns):
    """Reads a csv file containing the shape data and returns this as a pandas dataframe"""
    csv_file = io.BytesIO(obj["Body"].read())

    # Read the shape file using Pandas
    # Exception is triggered here because the retrieved data is equal to b''
    df = pd.read_csv(csv_file, usecols=columns, index_col=0, skiprows=2)

    return df

def get_shape_dataframe(shape_number, shape_type):
    """
    Pulls a shape from s3 and returns it as a pandas dataframe object.
    Inputs:
        shape_number: integer in the range 0-6 inclusive
        shape_type: string containing either "shape" or "historic"
    """

    # Read shape data from s3
    bucket_name = os.environ["RESOURCE_BUCKET_NAME"]
    key = service.get_shape_file_name(shape_number, shape_type)
    file_obj = service.load_file_from_s3(bucket_name, key)

    columns = service.get_hh_shape_file_columns()

    if shape_type == "shape":
        df = service.s3_shape_obj_to_pandas(file_obj, columns)
    elif shape_type == "historic":
        df = service.s3_historic_shape_obj_to_pandas(file_obj, columns)
    else:
        raise Exception("Invalid shape type")

    return df

Apologies if this does turn out out to be something else, a configuration issue or something. Its just the application was working absolutely fine until we started a ticket to add power tools tracing and structured logging.

heitorlessa

heitorlessa commented on Dec 10, 2020

@heitorlessa
Contributor

Don't worry @paulalex it could be something with us, as I can't see anything immediately odd in that code.

If there is something on our side, or if it takes too long to figure out, I'll push a context manager for Tracer to you can still use Tracer capabilities within your code more easily
-- it's something I've been meaning to do too.

I'll take this for a spin tomorrow (5pm here now).

Thanks for sharing all that info

self-assigned this
on Dec 11, 2020
heitorlessa

heitorlessa commented on Dec 11, 2020

@heitorlessa
Contributor

hey @paulalex - I've managed to replicate it using that snippet you sent last. There's something in the decorator logic and in X-Ray that I can't figure out why yet - I'll keep digging.

In the meantime, you can use a context manager as it works expected:

def load_file_from_s3(bucket_name, key):
    try:
        with tracer.provider.in_subsegment("## load_file_from_s3") as subsegment:
            obj = s3_client.get_object(Bucket=bucket_name, Key=key)
            # you can add annotations and metadata using subsegment.put_annotation, subsegment.put_metadata
            # https://awslabs.github.io/aws-lambda-powertools-python/core/tracer/#escape-hatch-mechanism
    except botocore.exceptions.ClientError as exc:
        if exc.response["Error"]["Code"] != "404":
            raise exc
    # Getting the value of the body as csv is fine up to this point, once returned it is no longer available to be read
    return obj

It's really odd because the response from S3, Content-Length at least, is exactly the same as before. It's almost as if read() was called, and given you can only call it once, it returns an empty response - I might need to create an issue in boto3 or X-Ray.

Things I've tried for the record:

  • Compared responses with and without Tracer decorator on Content-Length, calls to S3, etc.
  • Created dummy decorator to rule out any shenanigans with decorated functions
  • Tested reproducing it locally, or with Tracer disabled (no effect)
  • Tested with and without Pandas and IO Stream (no effect)
  • Tested using csvreader (you know, just in case)
  • Tested with non-CSV file (.txt, .yaml, .json) to no avail too

I'll update here if we find the root cause and here's the source I used to reproduce now without all the boilerplate: https://github.com/heitorlessa/issue238-pt

image

paulalex

paulalex commented on Dec 11, 2020

@paulalex
Author

@heitorlessa Great work you definitely put a lot of effort in! So it looks like the issue is not with powertools but with either the X-ray libraries or boto?

Thanks for the context manager snippet, I will switch to using this in my code for now. Noticed the same thing, that the response length is the same but it appears as if something has already read the body.

Thanks a lot for all of this!

heitorlessa

heitorlessa commented on Dec 11, 2020

@heitorlessa
Contributor

Funny enough if you pass the Content-Length for the read() function to force boto to read the entire chunk it works consistently:

file_obj["Body"].read(file_obj["Content-Length"]).decode('utf-8')

So it's definitely not an issue with X-Ray per se but something to be investigated in Boto and the IO stream, because at first glance boto isn't actually giving a file-like stream but a modified version of it -- I need to dig in as I'm personally hooked into the problem now despite not being Powertools per se :D

TL;DR, if you pass the content-length to read() it will work without changing the decorator and context manager.

  • Before: csv_file = io.BytesIO(obj["Body"].read())
  • After: csv_file = io.BytesIO(obj["Body"].read(obj["Content-Length"]))

I'll do more digging next week when I free up more time, hope that helps ;)

paulalex

paulalex commented on Dec 11, 2020

@paulalex
Author

Thats great thanks again and I hope this doesnt become your weekend!

10 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Triage

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @heitorlessa@paulalex@michaelbrewer@to-mc

      Issue actions

        @TRACER.capture_method causes botocore response body of objects retrieved from S3 to be pre-exhausted · Issue #238 · aws-powertools/powertools-lambda-python