Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow outputs (third preview) #5909

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

bentsherman
Copy link
Member

@bentsherman bentsherman commented Mar 20, 2025

This PR implements the third preview of the workflow output definition for Nextflow 25.04.

Changes are described in the docs, copied here for convenience:

  • The publish: section can only be specified in the entry workflow.

  • Workflow outputs in the publish: section are assigned instead of using the >> operator. The output name must be a valid identifier.

  • By default, output files are published to the base output directory, rather than a subdirectory corresponding to the output name.

  • The syntax for dynamic publish paths has changed. Instead of defining a closure that returns a closure with the path directive, the outer closure should use the >> operator to publish individual files.

  • The mapper index directive has been removed. Use a map operator in the workflow body instead.

Changes not described in the docs:

  • The onWorkflowPublish trace event has been modified to include the output name. It is emitted once for each output, rather than once for each channel value, and it is emitted regardless of whether the index file is enabled.

@bentsherman bentsherman requested review from a team as code owners March 20, 2025 21:44
@bentsherman bentsherman requested a review from pditommaso March 20, 2025 21:44
Copy link

netlify bot commented Mar 20, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit a65f5c1
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67f049199cc8c50008b905f9

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
*/
void onWorkflowPublish(Object value){}
void onWorkflowPublish(String name, Object value){}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pditommaso regarding this change to onWorkflowPublish

Aside from the fact that it's a preview feature, there is a bigger issue I wanted to raise about the TraceObserver -- these overloads that we add to be backwards compatible don't actually work

Even if we add something like this:

void onWorkflowPublish(Object value) {
    onWorkflowPublish(null, value)
}

If I build a plugin with Nextflow 24.10 and try to use it with 25.04, it will fail with an error like this:

ERROR ~ Receiver class nextflow.validation.ValidationObserver does not define or inherit an implementation of the resolved method 'abstract void onWorkflowPublish(java.lang.String,java.lang.Object)' of interface nextflow.trace.TraceObserver.

This is because the custom trace observer gets compiled against the 24.10 version of TraceObserver and run against the 25.04 version at runtime, but it doesn't receive the new method overload. I think it's a limitation of the Java/Groovy runtime.

So there is no point in adding these extra overloads. Plugins built with an older Nextflow will break no matter what we do

See: nextflow-io/nf-schema#90

Signed-off-by: Ben Sherman <[email protected]>
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I find the syntax improved compared to the previous iteration, I think this version diluting the original idea of decoupling output publishing for the final output structure by having an intermediate model represented by "publish target" that were defined both at process and sub-workflow level.

In this version this is essentially not possible any more, and the output needs to be wired channel by channel in the main workflow definition if I'm understanding correctly.

Is there as nf-core pipeline adopting this version as reference?

}
}
}
```

The inner closure will be applied to each file in the channel value, in this case `sample.fastq_1` and `sample.fastq_2`.
Each `>>` specifies a *source file* and *publish target*. The source file should be a file or collection of files, and the publish target should be a directory or file name. If the publish target ends with a slash, it is treated as the directory in which source files are published. Otherwise, it is treated as the target filename of a source file. Only files that are published with the `>>` operator are saved to the output directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The publish target was not the output e.g. samples in this example? this looks to me more a "publish directory" or just "output directory"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the meaning of publish target in this iteration to match the overall syntactic/semantic changes

What was previously a "publish target" like samples is now simply an "output" or "output declaration".

Publish target now refers to the right-hand side of a publish statement e.g. sample.fastq_1 >> 'fastq/'

Essentially I have moved the concept of "publish target" into the path directive.

Basically every real-world use case I've seen when talking to users requires this fine-grained level of publishing, because of how they like to organize their output directory

@bentsherman
Copy link
Member Author

I think this version diluting the original idea of decoupling output publishing for the final output structure by having an intermediate model represented by "publish target" that were defined both at process and sub-workflow level.

In this version this is essentially not possible any more, and the output needs to be wired channel by channel in the main workflow definition if I'm understanding correctly.

I've gone back and forth on this throughout each iteration. I was initially skeptical about making the user wire the outputs all the way to the top, because I thought it would be a ton of work for little value.

Two things made me change my mind:

  1. In practice, workflows will have very few outputs. For example, even though rnaseq can run dozens of tools, it really only has a few outputs -- "samples", "summary" (e.g. multiqc), and maybe "counts matrix". You might have to join several outputs from different tools, but the rest of the wiring is pretty simple.

  2. The original problems with publishDir are that (1) you have no high-level visibility of the workflow outputs and (2) you can't model output metadata, only files. Publishing channels instead of glob patterns solves (2), but if we still allow publishing from processes and subworkflows, we haven't solved (1).

As a user, I should be able to see all of my outputs in one place and easily trace them to upstream sources, which I can't do if any subcomponent can "contribute" to an output. It is analogous to people using params anywhere in their pipeline, instead of only in the entry workflow.

It does require writing a bit more code, but the improved readability is worth it. And I would point out that we spend much more time reading code than writing code, so we should design the language accordingly.

There is still an intermediate model, but instead of "publish target" it is called simply an "output" or "output declaration". It has similar syntax/semantics to a parameter.

Is there as nf-core pipeline adopting this version as reference?

nf-core is reluctant to adopt any features as long as we keep them in "preview", which creates a minor chicken and egg problem for us 😅

I have updated my fetchngs PR to use this preview, so you can see how it would look there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In a channel that is directed to a publish target, Files listed inside a Map object are not published.
2 participants