Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute_signature should skip all transient properties not only signature #234

Open
nanoant opened this issue Oct 29, 2021 · 2 comments
Open

Comments

@nanoant
Copy link

nanoant commented Oct 29, 2021

JupterLab has started recently unconditionally adding orig_nbformat: 1 into edited notebook since jupyterlab/jupyterlab#10118. This broke notebook trust, since orig_nbformat is included in sign.compute_signature but is later stripped in v4/rwbase.strip_transient on save. Therefore digest computed before save does not match one restored.

sign.compute_signature should not only do signature_removed but skip all transient metadata in order to ensure that we compute signature on exactly the same structure as it will be saved to disk.

I am not jupyter developer so I don't want to propose PR as I don't know what is desired implementation of the fix for this problem.

This issue is related to jupyterlab/jupyterlab#11005

@Carreau Carreau changed the title compute_signature should skip all transient properties no only signature compute_signature should skip all transient properties not only signature Nov 16, 2021
@Carreau
Copy link
Member

Carreau commented Nov 16, 2021

There seem to be other recent issues with signature/validation/saving, and yes I agree that we should be

  1. stricter to the fields we accept,
  2. better document which fields are computed in signature.

My take is that we should have a clean(). method somewhere that return a cleaned copy of the notebook data structure that can be properly signed instead of avoiding some fields. But that's my personal opinion.

That way sign is simple, and can raise if ever there are fields it does not know about.

Pingin @echarles as we were talking about other signature issues this morning, and it might be of interest to him.

@westurner
Copy link
Contributor

My take is that we should have a clean(). method somewhere that return a cleaned copy of the notebook data structure that can be properly signed instead of avoiding some fields.

This sounds like a document Canonicalization or Normalization step. LD-proofs now specifies how to future-proof inlined JSON-LD document signatures.

From https://w3c-ccg.github.io/ld-proofs/#advanced-terminology :

Canonicalization algorithm
An algorithm that takes an input document that has more than one possible representation and always transforms it into a deterministic representation. For example, alphabetically sorting a list of items is a type canonicalization. This process is sometimes also called normalization.

A complete example of a proof type is shown in the next example:

EXAMPLE 7

{
 "id": "https://w3id.org/security#Ed25519Signature2020",
 "type": "Ed25519VerificationKey2020",
 "canonicalizationAlgorithm":  "https://w3id.org/security#URDNA2015",
 "digestAlgorithm": "https://www.ietf.org/assignments/jwa-parameters#SHA256",
 "signatureAlgorithm": "https://w3id.org/security#ed25519"
}

From https://json-ld.github.io/rdf-dataset-canonicalization/spec/#introduction :

When data scientists discuss canonicalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to be able to transform each of these different ways into a single, standard format. With a standard format, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.

In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires canonicalization to achieve the aforementioned goals and any others that may arise via serendipity.

Most RDF datasets can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.

This document outlines an algorithm for generating a normalized RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Canonicalization Algorithm 2015 or URDNA2015.

Maybe ~URDNA2015 + special custom nbformat normalizations would be ideal. Or, at least nothing that would preclude later use of URDNA2015 (and JSON-LD (for nb metadata and JSON-LD cell outputs))?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants