How to write data to archive with MappingParser¶

MappingParser is a generic parser class implemented in nomad.parsing.file_parser/mapping_parser.py to handle the conversion to and from a data object and a python dictionary. We refer to an instance of the this class as 'mapping parser' throughout this section. In the following, the abstract properties and methods of the mapping parser are explained. The various implementations of the mapping parser are also defined and Mapper which is required to convert a mapping parser into another mapping parser is explained as well.

MappingParser¶

The mapping parser has several abstract properties and methods and the most important ones are listed in the following:

filepath: path to the input file to be parsed
data_object: object resulting from loading the file in memory with load_file
data: dictionary representation of data_object
mapper: instance of Mapper required by convert
load_file: method to load the file given by filepath
to_dict: method to convert data_object into data
from_dict: method to convert data into data_object
convert: method to convert to another mapping parser

data_object can be an XML element tree or a metainfo section for example depending on the inheriting class. In order to convert a mapping parser to another parser, the target parser must provide a Mapper object. We refer to this simply as mapper throughout.

In the following, we describe the currently implemented mapping parsers.

XMLParser¶

This is mapping parser for XML files. It uses lxml to load the file as an element tree. The dictionary is generated by iteratively parsing the elements of the tree in to_dict. The values parsed from element text are automatically converted to a corresponding data type. If attributes are present, the value is wrapped in a dictionary with key given by value_key ('__value' by default) while the attribute keys are prefixed by attribute_prefix ('@' by default). The following XML:

<a>
  <b name='item1'>name</b>
  <b name='item2'>name2</b>
</a>

will be converted to:

    data = {
      'a' : {
        'b': [
          {'@name': 'item1', '__value': 'name'},
          {'@name': 'item2', '__value': 'name2'}
        ]
      }
    }

The conversion can be reversed using the from_dict method.

HDF5Parser¶

This is the mapping parser for HDF5 files. It uses h5py to load the file as an HDF5 group. Similar to XMLParser, the HDF5 datasets are iteratively parsed from the underlying groups and if attributes are present these are also parsed. The from_dict method is also implemented to convert a dictionary into an HDF5 group.

MetainfoParser¶

This is the mapping parser for NOMAD archive files or metainfo sections. It accepts a schema root node annotated with MappingAnnotation as data_object. create_mapper generates the actual mapper as matching the annotation_key. If a filepath is specified, it instead falls back on the ArchiveParser.

The annotation should always point to a parsed value via a path (JMesPath format). It may optionally specify a multi-argument operator for data mangling. In this case, specify a tuple consisting of:

the operator name, defined within the same scope.
a list of paths with the corresponding values for the operator arguments.

Similar to MSection, it can be converted to (to_dict) or from (from_dict) a Python dict. Other attributes are currently accessible.

from nomad.datamodel.metainfo.annotations import Mapper as MappingAnnotation

class BSection(ArchiveSection):
    v = Quantity(type=np.float64, shape=[2, 2])
    v.m_annotations['mapping'] = dict(
        xml=MappingAnnotation(mapper='.v'),
        hdf5=MappingAnnotation(mapper=('get_v', ['.v[0].d'])),
    )

    v2 = Quantity(type=str)
    v2.m_annotations['mapping'] = dict(
        xml=MappingAnnotation(mapper='.c[0].d[1]'),
        hdf5=MappingAnnotation(mapper='g.v[-2]'),
    )

class ExampleSection(ArchiveSection):
    b = SubSection(sub_section=BSection, repeats=True)
    b.m_annotations['mapping'] = dict(
        xml=MappingAnnotation(mapper='a.b1'), hdf5=MappingAnnotation(mapper='.g1')
    )

ExampleSection.m_def.m_annotations['mapping'] = dict(
    xml=MappingAnnotation(mapper='a'), hdf5=MappingAnnotation(mapper='g')
)

parser = MetainfoParser()
p.data_object = ExampleSection(b=[BSection()])
p.annotation_key = 'xml'
p.mapper
# Mapper(source=Path(path='a'....

Converting mapping parsers¶

The following is a sample python code to illustrate the mapping of the contents of an HDF5 file to an archive. First, we create a MetainfoParser object for the archive. The annotation key is set to hdf5 which will generate a mapper from the hdf5 annotations defined in the definitions. Essentially, only metainfo sections and quantities with the hdf5 annotation will be mapped. The mapper will contain paths for the source (HDF5) and the target (archive). The archive is then set to the archive parser data_object. Here, the archive already contains some data which should be merged to data that will be parsed. Next, a parser for HDF5 data is created. We use a custom class of the HDF5Parser which implements the get_v method defined in BSection.v In this example, we do not read the data from the HDF5 file but instead generate it from a dictionary by using the from_dict method. By invoking the convert method, the archive parser data object is populated with the corresponding HDF5 data.

    class ExampleHDF5Parser(HDF5Parser):
        @staticmethod
        def get_v(value):
            return np.array(value)[1:, :2]

    archive_parser = MetainfoParser()
    archive_parser.annotation_key = 'hdf5'
    archive_parser.data_object = ExampleSection(b=[BSection(v=np.eye(2))])

    hdf5_parser = ExampleHDF5Parser()
    d = dict(
        g=dict(
            g1=dict(v=[dict(d=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))]),
            v=['x', 'y', 'z'],
            g=dict(
                c1=dict(
                    i=[4, 6],
                    f=[
                        {'@index': 0, '__value': 1},
                        {'@index': 2, '__value': 2},
                        {'@index': 1, '__value': 1},
                    ],
                    d=[dict(e=[3, 0, 4, 8, 1, 6]), dict(e=[1, 7, 8, 3, 9, 1])],
                ),
                c=dict(v=[dict(d=np.eye(3), e=np.zeros(3)), dict(d=np.ones((3, 3)))]),
            ),
        )
    )
    hdf5_parser.from_dict(d)

    hdf5_parser.convert(archive_parser)

    # >>> archive_parser.data_object
    # ExampleSection(b, b2)
    # >>> archive_parser.data_object.b[1].v
    # array([[4., 5.],
    #   [7., 8.]])

Mapper¶

A mapper is necessary in order to convert a mapping parser to a target mapping parser by mapping data from the source to the target. There are three kinds of mapper: Map, Evaluate and Mapper each inheriting from BaseMapper. A mapper has attributes source and target which define the paths to the source data and target, respectively. Map is intended for mapping data directly from source to target. The path to the data is given by the attribute path. Evaluate will execute a function defined by function_name with the arguments given by the mapped values of the paths in function_args. Lastly, Mapper allows the nesting of mappers by providing a list of mappers to its attribute mapper. All the paths are instances of Path with the string value of the path to the data given by the attribute path. The value of path should follow the jmespath specifications but could be prefixed by . which indicates that this is a path relative to the parent. This will communicate to the mapper which source to get the data.

    Mapper(
        source=Path(path='a.b2', target=Path(path='b2'), mapper=[
            Mapper(
                source=Path(path='.c', parent=Path(path='a.b2')),
                target=Path(path='.c', parent=Path(path='b2')), mapper=[
                    Map(
                        target=Path(
                            path='.i', parent=Path(path='.c', parent=Path(path='b2'))
                        ),
                        path=Path(
                            path='.d', parent=Path(path='.c' parent=Path(path='a.b2'))
                        )
                    ),
                    Evaluate(
                        target=Path(
                            path='.g', parent=Path(path='.c', parent=Path(path='b2'))
                        ),
                        function_name='slice', function_args=[Path(path='a.b2.c.f.g.i')]
                    )
                ]
            )
        ),
    )