Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressable data store (aka CID store) #5715

Draft
wants to merge 61 commits into
base: master
Choose a base branch
from
Draft

Addressable data store (aka CID store) #5715

wants to merge 61 commits into from

Conversation

pditommaso
Copy link
Member

@pditommaso pditommaso commented Jan 27, 2025

Tentative implementation for addressable data store (very basic POC so far).

Update on 1 Mar 2025 from #5787 by @jorgee

M1 Implementation of CID store for provenance

Changes:

  • CID store is specified by workflow.data.store.location
  • Workflow Hash is created based on the workflow and parameters description
  • workflow, tasks and outputs metadata are stored in <cid.store.location>/.meta
  • references to other cid metadata are cid://<workflow_hash|task_hash/output_target_path
  • CID NIO Filesystem to access data based on CIS URLs
  • nextflow cid command to log, show and get lineage from CID store metadata

Known Limitations:

  • Outputs which are not published in absolutePaths or URLs which are not subfolders both the outputDir, we can not infer the relative output target path. They are not currently tracked in the CID store. We could create a hash for the parent directory of the URL or absolute path and use it as relative folder.

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso marked this pull request as draft January 27, 2025 13:15
Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 10f2b87
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67f9773f9f61560008d2c2da

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
@pditommaso
Copy link
Member Author

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

@jorgee
Copy link
Contributor

jorgee commented Feb 13, 2025

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

I have reverted the changes in this branch and created a new one in PR #5787

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
pditommaso and others added 2 commits March 24, 2025 11:46
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: jorgee <[email protected]>
Co-authored-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Ben Sherman <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member Author

pditommaso commented Apr 2, 2025

Some quick wins discussed this morning with @jorgee

  • Use Instant object instead of strings in model objects
  • Unify nomenclature using taskRun and workflowRun instead of publishBy, runBy, etc
  • Use #outputs fragment using run CID instead of a separate result CID
  • Use DataOutput in place of WorkflowOutput and TaskOutput

Comment on lines +174 to +180
void annotations(Map value) {
setOption('annotations', value)
}

void annotations(Closure value) {
setOption('annotations', value)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: decide whether to use tags or phase it out in favor of annotations

jorgee and others added 6 commits April 5, 2025 17:26
Signed-off-by: jorgee <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member Author

Made a few changes to cleanup and simplify some commons patterns.

As general comment try reducing average cyclomatic complexity using smaller method and simpler if-block using if-guard pattern (an example here)

Also it should be increased the test coverage, there are large part not tested.

Screenshot 2025-04-05 at 22 30 16

@pditommaso
Copy link
Member Author

pditommaso commented Apr 5, 2025

A few more comments:

  1. the use of run in DataOutput is ambiguous. It should be used taskRun or workflowRun. I guess the problem is used for both outputs. In the case it should be considered to add both taskRun and workflowRun.
  2. we should think how to unify the model for inputs and outputs. Now are used both List<DataPath>, List<Parameter> and Map<String, Object>
  3. along the same manner it would be nice unify mainScriptFile and otherScriptFile into a single collection scriptFiles
  4. it would be useful to have a symmetry on the usage /outputs and #outputs for tasks along the same manner of workflow

}
}
catch (IllegalArgumentException e) {
log.warn("Can't read CID history file: ${FilesEx.toUriString(this.path)}", e.message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be review to make it work without the need for locks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the same as the executions log file. Two executions could write the log. The other option is converting the .history to a folder and have a file per run, the history log will be the list of the content of the files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is converting the .history to a folder and have a file per run, the history log will be the list of the content of the files

I was thinking the same

}

protected String storeWorkflowRun() {
final normalizer = new PathNormalizer(session.workflowMetadata)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely this can be a class attribute to avoid creating a new instance for each task, see below

long size
Instant createdAt
Instant modifiedAt
Map annotations
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a description for these fields

@pditommaso pditommaso force-pushed the cid-store branch 2 times, most recently from 939840d to 5692b67 Compare April 7, 2025 10:35
jorgee and others added 2 commits April 11, 2025 15:09
Signed-off-by: jorgee <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso

This comment was marked as outdated.

@pditommaso
Copy link
Member Author

pditommaso commented Apr 11, 2025

This could introduce some unexpected side effect with symlinks. We may need to introduce a specific method for it.

if (path instanceof RealPathAware){
path = path.toRealPath()
}

Signed-off-by: Paolo Di Tommaso <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants