How to define workflows
The built-in abstract workflow schema¶
Workflows are an important aspect of data as they explain how the data came to be. Let's first clarify that workflow refers to a workflow that already happened and that has produced input and output data that are linked through tasks that have been performed . This often is also referred to as data provenance or provenance graph.
The following shows the overall abstract schema for worklows that can be found
in nomad.datamodel.metainfo.workflow
(blue):
The idea is that workflows are stored in a top-level archive section along-side other sections that contain the inputs and outputs. This way the workflow or provenance graph is just additional piece of the archive that describes how the data in this (or other archives) is connected.
Let'c consider an example workflow. Imagine a geometry optimization and ground state
calculation performed by two individual DFT code runs. The code runs are stored in
NOMAD entries geom_opt.archive.yaml
and ground_state.archive.yaml
using the run
top-level section.
Example workflow¶
Here is a logical depiction of the workflow and all its tasks, inputs, and outputs.
Simple workflow entry¶
The following archive shows how to create such a workflow based on the given schema.
Here we only model the GeometryOpt
and GroundStateCalculation
as two tasks with
respective inputs and outputs that use references to entry archives of the respective
code runs.
workflow2:
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
outputs:
- name: relaxed system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
- name: ground state calculation of relaxed system
section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
tasks:
- name: GeometryOpt
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
outputs:
- name: relaxed system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
- name: GroundStateCalculation
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
outputs:
- name: ground state
section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
Nested workflows in one entry¶
Since a Workflow
instance is also a Tasks
instance due to inheritance, we can nest
workflows. Here we detailed the GeometryOpt
as a nested workflow:
workflow2:
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
outputs:
- name: relaxed system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
- name: ground state calculation of relaxed system
section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
tasks:
- name: GeometryOpt
m_def: nomad.datamodel.metainfo.workflow.Workflow
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
outputs:
- name: relaxed system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
tasks:
- inputs:
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
outputs:
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/1'
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/calculation/0'
- inputs:
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/1'
outputs:
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/2'
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/calculation/1'
- inputs:
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/2'
outputs:
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/3'
- section: '../upload/raw/geom_opt.archive.yaml#/run/0/calculation/2'
- name: GroundStateCalculation
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
outputs:
- name: ground state
section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
Nested Workflows in multiple entries¶
Typically, we want to colocate our individual workflows with their inputs and outputs.
In the case of the geometry optimization, we might want to put this into the archive of
the geometry optimization code run. So the geom_opt.archive.yaml
might contain its
own section workflow2
that only contains the GeometryOpt
workflow and uses local
references to its inputs and outputs:
workflow2:
name: GeometryOpt
inputs:
- name: input system
section: '#/run/0/system/0'
outputs:
- name: relaxed system
section: '#/run/0/system/-1'
tasks:
- inputs:
- section: '#/run/0/system/0'
outputs:
- section: '#/run/0/system/1'
- section: '#/run/0/calculation/0'
- inputs:
- section: '#/run/0/system/1'
outputs:
- section: '#/run/0/system/2'
- section: '#/run/0/calculation/1'
- inputs:
- section: '#/run/0/system/2'
outputs:
- section: '#/run/0/system/3'
- section: '#/run/0/calculation/2'
run:
- program:
name: 'VASP'
system: [{}, {}, {}]
calculation: [{}, {}, {}]
When we want to detail the complex workflow, we now need to refer to a nested workflow in
a different entry. This cannot be done directly, because Workflow
instances can only contain Task
instances and not reference them. Therefore, we added a TaskReference
section definition that can be used to create proxy instances for tasks and workflows:
workflow2:
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
outputs:
- name: relaxed system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
- name: ground state calculation of relaxed system
section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
tasks:
- m_def: nomad.datamodel.metainfo.workflow.TaskReference
task: '../upload/raw/geom_opt.archive.yaml#/workflow2'
- name: GroundStateCalculation
inputs:
- name: input system
section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
outputs:
- name: ground state
section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
Extending the workflow schema¶
The abstract workflow schema above allows us to build generalized tools for workflows, like workflow searches, navigation in workflow, graphical representations of workflows, etc. But, you can still augment the given section definitions with more information through inheritance. These information can be specialized references to denote inputs and outputs, can be additional workflow or task parameters, and much more.
In this example, we created a special workflow section definition GeometryOptimization
that defines a parameter threshold
and an additional reference to the final
calculation of the optimization:
definitions:
sections:
GeometryOptimizationWorkflow:
base_section: nomad.datamodel.metainfo.workflow.Workflow
quantities:
threshold:
type: float
unit: eV
final_calculation:
type: nomad.datamodel.metainfo.simulation.calculation.Calculation
workflow2:
m_def: GeometryOptimizationWorkflow
final_calculation: '#/run/0/calculation/-1'
threshold: 0.029
name: GeometryOpt
inputs:
...