How to reference hdf5
The NOMAD schemas and processed data system is designed to describe and manage intricate hierarchies of connected data. This is ideal for metadata and lots of small data quantities, but does not work for large quantities. Quantities are atomic and are always manages as a whole; there is currently no functionality to stream or splice large quantities. Consequently, tools that produce or work with such data cannot scale.
HDF5Reference¶
Large data quantities should be managed within HDF5 raw files. These can be files that are either produced during processing (e.g. created by a parser) or that are uploaded and contain the large quantities already. To describe these data, schemas and processed data can include references into HDF5 raw files. This allows to describe large data quantities with metadata that is described with schemas and contained in the processed data.
The data type HDF5Reference
is a regular type that can be used for quantities
in schemas. The values are similar to reference values and contain
the HDF5 file and a path to a group or field in the HDF5, e.g. ../upload/raw/large_data.hdf5#group/large_field
.
For HDF5 files that are generated during processing, it is good practice to structure
the HDF5 inline with the schema. E.g. a HDF5Reference
quantity with a sub-section
path data.process.log
should be stored in a field names log
in the sub-group process
that is part of the root group data
.
In future version of NOMAD, processed data might be maintained partially or in full in HDF5 files. Structuring HDF5 files and processed data alike, might simplify later migration.
NOMAD clients (e.g. NOMAD UI) can pick up on these HDF5Reference
quantities and
provide respective functionality (e.g. showing a H5Web view).
Attention
This part of the documentation is still work in progress.
Metadata for large quantities¶
Attention
This will be implemented and documented soon.