Skip to content

Feature: Advanced parser utility #147

Closed
@heitorlessa

Description

@heitorlessa

Is your feature request related to a problem? Please describe.

We've heard from a small number of customers that parsing Lambda Event Source payloads require a considerable effort since these don't have official schemas for Python.

With parsing and classes modelled after these schemas, they can have the benefits of runtime type safety, custom validations on possible values pertinent to their use case, autocomplete, and only parse fields they're interested in.

Describe the solution you'd like

Solution is two-fold:

  • A new parser utility that uses Pydantic to parse and validate incoming/outgoing events, and allow customers to use their own data models
  • Pre-defined schemas and event envelopes for popular event sources, so one can apply and validate their models against where they payload is

This would reduce the amount of time developers invest searching for official data structure for each event source, improve their security posture on incoming and outgoing events, and increased developer productivity.

Describe alternatives you've considered

  • Implement simple validation using JSON Schemas as well as an extractor utility to retrieve the payload only
  • Bring Pydantic as an optional package to prevent bloating the library for those not using it

Challenge with JSON Schemas is they typically don't validate business rules for incoming/outgoing events, but merely a schema.

Additional context

Initial implementation that lacked customer data points as of now, but could be revisited depending on interest for thisfeature: #118

Activity

added
triagePending triage from maintainers
and removed
triagePending triage from maintainers
on Aug 28, 2020
jplock

jplock commented on Aug 28, 2020

@jplock

I like the idea of using JSON schemas strictly for validating AWS managed event types that don’t change. Validating the payload would be the responsibility of the consumer (using pydantic or something similar).

changed the title [-]Advanced parser utility[/-] [+]Feature: Advanced parser utility[/+] on Aug 28, 2020
pinned this issue on Aug 28, 2020
Nr18

Nr18 commented on Aug 31, 2020

@Nr18

My preference would be to use pydantic having the ability to add business rule validation would be beneficial to me.

I actually ran into the issue of having to set my date field to a string in the JSON schema because the date is optional... With pydantic, you can have an Optional[date] without having to write an additional statement just to check that the string is actually a date.

koxudaxi

koxudaxi commented on Aug 31, 2020

@koxudaxi

I always use pydantic for Lambda for events and payload.
Pydantic is a very great solution for validation and parsing.

However, The library is a little heavy for lambda.
Pydantic is compiled by Cython which is created some *.so files.
I calculate these file sizes. It's about 76MB https://pypi.org/project/pydantic/#files

I suggest Pydantic is supplied as an option.
And I think powertools should provide two type validator(decorator) that are JSON Schema and Pydantic.
Users can select one.

Also, I develop a code-generator that generates pydantic models from JSON Schema.
https://github.com/koxudaxi/datamodel-code-generator
If we maintain JSON Schema then, we can get a pydantic model by the code-generator too.

Additionally, this code generator can create models from JSON data.
Last week, I create the AWS Connect event model from the event object which I get in Cloudwatch logs. It's very useful.

Backgrounds

I created an experimental project that provides pydantic models for events.
Example: https://github.com/koxudaxi/pydantic-collection/blob/master/pydantic_collection/aws/sns/models.py

I often define pydantic models for events in my project.
I would re-use the models for all projects.
But, It's difficult to create all events in my-hands.
I hope that aws-lambda-powertools-python will maintain all models.

heitorlessa

heitorlessa commented on Aug 31, 2020

@heitorlessa
ContributorAuthor

hey @koxudaxi - This is interesting! I was under the impression only the wheels was going to count (8.2M for manylinux). It's also great to hear you created all these Pydantic tooling as I still have questions about the UX and code generation :) -- TIL.

At the moment, we're collecting customer demand on Pydantic usage within Serverless. We want to support it (see #118), but we're also mindful of justifying customer demand, Pydantic as an extra dependency, simplified UX, how much we want to abstract, and docs to ease the transition from JSON Schemas to Pydantic.

We're on the fence as to whether create a single validator utility that supports dual-modes (JSON Schema or Pydantic), or parser utility being solely focused on Pydantic use as the docs and usage will be largely different.

Would love to hear feedback on this front -- And yes, we'd be happy to maintain models for Lambda Event Sources (only), hence a longer discussion so we can get Pydantic right without breaking our Tenets

heitorlessa

heitorlessa commented on Aug 31, 2020

@heitorlessa
ContributorAuthor

@jplock initial simple validator for JSON Schema with optional data selector (envelope) using JMESPath as an extra dependency #153

gmcrocetti

gmcrocetti commented on Sep 1, 2020

@gmcrocetti
Contributor

This proposal is great ! It's something really useful I've been wishing for a long time.

About the description, it's unclear to me why do we need to create JSON schemas representing AWS event types instead of python annotations - don't read as critic please, I really don't know. For my use case, the killer feature of using pydantic or any other "parsing" lib is that we can deal with objects, validate/add business behavior we're unable to do in a schema. What are the use cases for JSONSchema and pydantic ?

@heitorlessa , maybe I'm going "too far" but IMO two points need to be discussed, at least in a "design" level. First one is related with the input parsing, it would be great to design something plugable, e.g, codebase of client X is entirely written in marshmallow, are we going to force him to rewrite everything to a new standard ? Maybe few cases, but I'm sure we can provide an interface to plug any parsing lib he's used to - powertools shouldn't provide them. About the second one, our utility must explicitly require the what/how to correctly parse a message. "What" is an AWS event source and the "How", a schema - looks like you've already done this in your pr.

About pydantic, I'd be great to see it here, as an extra dependency, for sure. Add it to a layer and we're "done".

ran-isenberg

ran-isenberg commented on Sep 1, 2020

@ran-isenberg
Contributor

@gmcrocetti @koxudaxi you can see my pydantic PR #118.
It also includes automatic envelope parsing for schemas for SQS, eventbridge, dynamoDb streams and custom user schemas.

koxudaxi

koxudaxi commented on Sep 1, 2020

@koxudaxi

@heitorlessa

This is interesting! I was under the impression only the wheels was going to count (8.2M for manylinux).

It's compressed. You may be surprised when extracting it.

OK, I have understood what we should discuss in this phase.

I write about the great UX of Pydantic.
It's parsing a dict to a model.
(Of course, Pydantic can Validate input data. However, JSON Schema has the same benefit.)

Lambda Event objects are deeply and nested structures.
We have too hard to understand the structure of each service.
Also, IDEs don't support auto-completion, type-checking.
(TypedDict is a better way to static type analysis.)

Pydantic clear these problems.
We can access nested attributes in Pydantic models easily.
I feel the way is the best practice to treat Lambda Event Objects.

I want to hear other "customers" too.

koxudaxi

koxudaxi commented on Sep 1, 2020

@koxudaxi

@risenberg-cyberark
Great work!!
I will review the PR when the PR will be unlocked.

added
pending-releaseFix or implementation already in dev waiting to be released
on Oct 2, 2020
unpinned this issue on Oct 25, 2020
heitorlessa

heitorlessa commented on Oct 26, 2020

@heitorlessa
ContributorAuthor

Everyone - This is now available in the 1.7.0 release.

removed
pending-releaseFix or implementation already in dev waiting to be released
on Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jplock@koxudaxi@Nr18@heitorlessa@gmcrocetti

        Issue actions

          Feature: Advanced parser utility · Issue #147 · aws-powertools/powertools-lambda-python