[RFC] YAMLGenerateSchema support for producing YAML schemas

Objective

It would be great to have YAML Schema support based on the existing YAML IO parser.

Motivation

When I was working on one llvm-based tool, I had to work with input YAML files with a complex structure. Since all the code was written using YAMLTraits.h and in different files, it is quite difficult to see the full structure of this file. In addition, it would be convenient for users of this tool, writing these input files, to see hints on the format of this file from their IDE. And if you make support for generating the schema from YAML parser, then you will not have to change its schema manually when changing the file structure. This can be useful to see the full structure of the input YAML file for different llvm based tools that use existing YAML parser, for example clang-format, clang-tidy e.t.c.

How It Works

Reading the current code in YAMLTraits.h, I found 2 classes: one for Input YAML and another for Output. I thought that I could create another yaml::IO derived class, with the same interface, so that I could simply dump the schema into some raw_ostream. At the same time, this derived class has access to all keys and types of the current mapping.

How It’s Structured

I prepared a patch that adds this new derived class (I named it GenerateSchema and moved it to a separate file so as not to increase compilation time in cases where this functionality is not required). Internally, this class does two things: first, it builds a tree of schemas for YAML chunks of thr input file. And then it builds a tree of nodes to output the schema to a file in YAML format, using the existing YAMLParser.

Testing

The unit test for this functionality is designed in such a way that it dumps the schema of some simple YAML and then compares it with a previously known

Dear llvm community, please give me your feedback.

For the record, a corresponding PR is here: [llvm][Support] Add YAMLSchemeGen for producing YAML Schemes from YAMLTraits by tgs-sc · Pull Request #133284 · llvm/llvm-project · GitHub.

Just some comments after glancing at this.

YAMLGenerateSchema support for producing YAML schemas

This has the same energy as “let’s add an MLIR dialect to describe the description of MLIR dialects” and it was probably not immediately clear to readers what you were proposing.

(though it is an accurate description from what I can tell)

This sounds like a great idea that people would find useful.

So could you reply with a short example showing how the parser code gets translated into a schema? You can gloss over the details but so we can see how much of a difference there is between reading C++ parser code vs. reading a schema.

Bonus points if you have a screenshot from one of the IDEs showing how the schema gets used

I think you can also give the schema and a file to a verifier tool? Maybe we could root out malformed files in tests that way.

For testing my first thought is: is it possible to discover every “schema” currently in use in llvm and see whether this generator works with them?

For instance, we have a lot of files for use with obj2yaml and yaml2obj. You could generate a schema for those, and then use it to verify existing files that use that schema.

This could generate you a lot of corner cases that you can cover in unit tests using smaller examples.