Description
Is your feature request related to a problem? Please describe.
We've heard from a small number of customers that parsing Lambda Event Source payloads require a considerable effort since these don't have official schemas for Python.
With parsing and classes modelled after these schemas, they can have the benefits of runtime type safety, custom validations on possible values pertinent to their use case, autocomplete, and only parse fields they're interested in.
Describe the solution you'd like
Solution is two-fold:
- A new
parser
utility that uses Pydantic to parse and validate incoming/outgoing events, and allow customers to use their own data models - Pre-defined schemas and event envelopes for popular event sources, so one can apply and validate their models against where they payload is
This would reduce the amount of time developers invest searching for official data structure for each event source, improve their security posture on incoming and outgoing events, and increased developer productivity.
Describe alternatives you've considered
- Implement simple validation using JSON Schemas as well as an extractor utility to retrieve the payload only
- Bring Pydantic as an optional package to prevent bloating the library for those not using it
Challenge with JSON Schemas is they typically don't validate business rules for incoming/outgoing events, but merely a schema.
Additional context
Initial implementation that lacked customer data points as of now, but could be revisited depending on interest for thisfeature: #118
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
jplock commentedon Aug 28, 2020
I like the idea of using JSON schemas strictly for validating AWS managed event types that don’t change. Validating the payload would be the responsibility of the consumer (using pydantic or something similar).
[-]Advanced parser utility[/-][+]Feature: Advanced parser utility[/+]Nr18 commentedon Aug 31, 2020
My preference would be to use pydantic having the ability to add business rule validation would be beneficial to me.
I actually ran into the issue of having to set my date field to a string in the JSON schema because the date is optional... With pydantic, you can have an
Optional[date]
without having to write an additional statement just to check that the string is actually a date.koxudaxi commentedon Aug 31, 2020
I always use pydantic for Lambda for events and payload.
Pydantic is a very great solution for validation and parsing.
However, The library is a little heavy for lambda.
Pydantic is compiled by Cython which is created some *.so files.
I calculate these file sizes. It's about 76MB https://pypi.org/project/pydantic/#files
I suggest Pydantic is supplied as an option.
And I think powertools should provide two type validator(decorator) that are JSON Schema and Pydantic.
Users can select one.
Also, I develop a code-generator that generates pydantic models from JSON Schema.
https://github.com/koxudaxi/datamodel-code-generator
If we maintain JSON Schema then, we can get a pydantic model by the code-generator too.
Additionally, this code generator can create models from JSON data.
Last week, I create the AWS Connect event model from the event object which I get in Cloudwatch logs. It's very useful.
Backgrounds
I created an experimental project that provides pydantic models for events.
Example: https://github.com/koxudaxi/pydantic-collection/blob/master/pydantic_collection/aws/sns/models.py
I often define pydantic models for events in my project.
I would re-use the models for all projects.
But, It's difficult to create all events in my-hands.
I hope that
aws-lambda-powertools-python
will maintain all models.heitorlessa commentedon Aug 31, 2020
hey @koxudaxi - This is interesting! I was under the impression only the wheels was going to count (8.2M for manylinux). It's also great to hear you created all these Pydantic tooling as I still have questions about the UX and code generation :) -- TIL.
At the moment, we're collecting customer demand on Pydantic usage within Serverless. We want to support it (see #118), but we're also mindful of justifying customer demand, Pydantic as an extra dependency, simplified UX, how much we want to abstract, and docs to ease the transition from JSON Schemas to Pydantic.
We're on the fence as to whether create a single
validator
utility that supports dual-modes (JSON Schema or Pydantic), orparser
utility being solely focused on Pydantic use as the docs and usage will be largely different.Would love to hear feedback on this front -- And yes, we'd be happy to maintain models for Lambda Event Sources (only), hence a longer discussion so we can get Pydantic right without breaking our Tenets
heitorlessa commentedon Aug 31, 2020
@jplock initial simple validator for JSON Schema with optional data selector (envelope) using JMESPath as an extra dependency #153
gmcrocetti commentedon Sep 1, 2020
This proposal is great ! It's something really useful I've been wishing for a long time.
About the description, it's unclear to me why do we need to create JSON schemas representing AWS event types instead of python annotations - don't read as critic please, I really don't know. For my use case, the killer feature of using pydantic or any other "parsing" lib is that we can deal with objects, validate/add business behavior we're unable to do in a schema. What are the use cases for JSONSchema and pydantic ?
@heitorlessa , maybe I'm going "too far" but IMO two points need to be discussed, at least in a "design" level. First one is related with the input parsing, it would be great to design something plugable, e.g, codebase of client X is entirely written in marshmallow, are we going to force him to rewrite everything to a new standard ? Maybe few cases, but I'm sure we can provide an interface to plug any parsing lib he's used to - powertools shouldn't provide them. About the second one, our utility must explicitly require the what/how to correctly parse a message. "What" is an AWS event source and the "How", a schema - looks like you've already done this in your pr.
About pydantic, I'd be great to see it here, as an extra dependency, for sure. Add it to a layer and we're "done".
ran-isenberg commentedon Sep 1, 2020
@gmcrocetti @koxudaxi you can see my pydantic PR #118.
It also includes automatic envelope parsing for schemas for SQS, eventbridge, dynamoDb streams and custom user schemas.
koxudaxi commentedon Sep 1, 2020
@heitorlessa
It's compressed. You may be surprised when extracting it.
OK, I have understood what we should discuss in this phase.
I write about the great UX of Pydantic.
It's parsing a dict to a model.
(Of course, Pydantic can Validate input data. However, JSON Schema has the same benefit.)
Lambda Event objects are deeply and nested structures.
We have too hard to understand the structure of each service.
Also, IDEs don't support auto-completion, type-checking.
(TypedDict is a better way to static type analysis.)
Pydantic clear these problems.
We can access nested attributes in Pydantic models easily.
I feel the way is the best practice to treat Lambda Event Objects.
I want to hear other "customers" too.
koxudaxi commentedon Sep 1, 2020
@risenberg-cyberark
Great work!!
I will review the PR when the PR will be unlocked.
heitorlessa commentedon Oct 26, 2020
Everyone - This is now available in the 1.7.0 release.