How to define tabular data

In order to import your data from a .csv or Excel file, NOMAD provides three distinct (and separate) ways, that with each comes unique options for importing and interacting with your data. In order to better understand how to use NOMAD tabular parser to import your data, follow three sections below. In each section you can find a commented sample schema with a step-by-step guide on how to import your tabular data.

Tabular parser, implicitly, parse the data into the same NOMAD entry where the datafile is loaded. Also, explicitly, this can be defined by putting the corresponding annotations under current_entry (check the examples below). In addition, tabular parser can be set to parse the data into new entry (or entries). For this, the proper annotations should be appended to new_entry annotation in your schema file.

Two main components of any tabular parser schema are: 1) implementing the correct base-section(s), and 2) providing a data_file Quantity with the correct m_annotations.

Please bear in mind that the schema files should 1) follow the NOMAD naming convention (i.e. My_Name.archive.yaml), and 2) be accompanied by your data file in order for NOMAD to parse them. In the examples provided below, an Excel file is assumed to contain all the data, as both NOMAD and Excel support multiple-sheets data manipulations and imports. Note that the Excel file name in each schema should match the name of the Excel data file, which in case of using a .csv data file, it can be replaced by the .csv file name.

TableData (and any other section(s) that is inheriting from TableData) has a customizable checkbox Quantity (i.e. fill_archive_from_datafile) to turn the tabular parser on or off. If you do not want to have the parser running everytime you make a change to your archive data, it is achievable then via unchecking the checkbox. It is customizable in the sense that if you do not wish to see this checkbox at all, you can configure the hide parameter of the section's m_annotations to hide the checkbox. This in turn sets the parser to run everytime you save your archive.

Be cautious though! Turning on the tabular parser (or checking the box) on saving your data will cause losing/overwriting your manually-entered data by the parser!

Column-mode¶

The following sample schema creates one quantity off the entire column of an excel file (column mode). For example, suppose in an excel sheet, several rows contain information of a chemical product (e.g. purity in one column). In order to list all the purities under the column purity and import them into NOMAD, you can use the following schema by substituting My_Quantity with any name of your choice (e.g. Purity), tabular-parser.data.xlsx with the name of the csv/excel file where the data lies, and My_Sheet/My_Column with sheet_name/column_name of your targeted data. The Tabular_Parser can also be changed to any arbitrary name of your choice.

Important notes:

shape: ['*'] under My_Quantity is essential to parse the entire column of the data file.
The data_file Quantity can have any arbitrary name (e.g. xlsx_file)
My_Quantity can also be defined within another subsection (see next sample schema)
Use current_entry and append column_to_sections to specify which sub_section(s) is to be filled in this mode. Leaving this field empty causes the parser to parse the entire schema under column mode.

# This schema is specially made for demonstration of implementing a tabular parser with
# column mode.
definitions:
  name: 'Tabular Parser example schema'
  sections:
    Tabular_Parser: # The main section that contains the quantities to be read from an excel file.
    # This name can be changed freely.
      base_sections:
       - nomad.parsing.tabular.TableData
      quantities:
        data_file:
          type: str
          m_annotations:
            tabular_parser: # The tabular_parser annotation, will treat the values of this
          # quantity as files. It will try to interpret the files and fill
          # quantities in this section (and sub_sections) with the column
          # data of .csv or .xlsx files.
              comment: '#' # Skipping lines in csv or excel file that start with the sign `#`
              # column_sections: # Here the relative path to the sub_sections that are supposed to be filled
              # from the given excel/csv file. Leaving this empty causes the normalizer to
              # parse the entire schema under column mode.
        My_Quantity:
          type: str
          shape: ['*']
          m_annotations:
            tabular: # The tabular annotation defines a mapping to column headers used in tabular data files
              name: My_Sheet/My_Column # Here you can define where the data for the given quantity is to be taken from
                # The convention for selecting the name is if the data is to be taken from an excel file,
                # you can specify the sheet_name followed by a forward slash and the column_name to target the desired quantity.
                # If only a column name is provided, then the first sheet in the excel file (or the .csv file)
                # is assumed to contain the targeted data.
data:
  m_def: Tabular_Parser # this is a reference to the section definition above
  data_file: tabular-parser.data.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file

Step-by-step guide to import your data using column-mode:

After writing your schema file, you can create a new upload in NOMAD (or use an existing upload), and upload both your schema file and the excel/csv file together (or zipped) to your NOMAD project. In the Overview page of your NOMAD upload, you should be able to see a new entry created and appended to the Process data section. Go to the entry page, click on DATA tab (on top of the screen) and in the Entry lane, your data is populated under the data sub_section.

Row-mode Sample:¶

The sample schema provided below, creates separate instances of a repeated section from each row of an excel file (row mode). For example, suppose in an excel sheet, you have the information for a chemical product (e.g. name in one column), and each row contains one entry of the aforementioned chemical product. Since each row is separate from others, in order to create instances of the same product out of all rows and import them into NOMAD, you can use the following schema by substituting My_Subsection, My_Section and My_Quantity with any appropriate name (e.g. Substance, Chemical_product and Name respectively).

Important notes:

This schema demonstrates how to import data within a subsection of another subsection, meaning the targeted quantity should not necessarily go into the main quantites.
Setting row_to_sections under current_entry signals that for each row in the sheet_name (provided in My_Quantity), one instance of the corresponding (sub-)section (in this example, My_Subsection sub-section as it has the repeats option set to true), will be appended. Please bear in mind that if this mode is selected, then all other quantities in this sub_section, should exist in the same sheet_name.

# This schema is specially made for demonstration of implementing a tabular parser with
# row mode.
definitions:
  name: 'Tabular Parser example schema'
  sections:
    Tabular_Parser: # The main section that contains the quantities to be read from an excel file
    # This name can be changed freely.
      base_sections:
       - nomad.parsing.tabular.TableData # Here we specify that we need to acquire the data from a .xlsx or a .csv file
      quantities:
        data_file:
          type: str
          m_annotations:
            tabular_parser:
              current_entry:
                row_to_sections: # This is the reference to where the targeted (sub-)section lies within this example schema file
                  - My_Subsection/My_Section
              comment: '#' # Skipping lines in csv or excel file that start with the sign `#`
      sub_sections:
        My_Subsection:
          section:
            sub_sections:
              My_Section:
                repeats: true # The repeats option set to true means there can be multiple instances of this
                # section
                section:
                  quantities:
                    My_Quantity:
                      type: str
                      m_annotations:
                        tabular: # The tabular annotation defines a mapping to column headers used in tabular data files
                          name: My_Sheet/My_Column # sheet_name and column name of the targeted data in csv/xlsx file
data:
  m_def: Tabular_Parser # this is a reference to the section definition above
  data_file: tabular-parser.data.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file

Step-by-step guide to import your data using row-mode:

After writing your schema file, you can create a new upload in NOMAD (or use an existing upload), and upload both your schema file and the excel/csv file together (or zipped) to your NOMAD project. In the Overview page of your NOMAD upload, you should be able to see as many new sub-sections created and appended to the repeating section as there are rows in your excel/csv file. Go to the entry page of the new entries, click on DATA tab (on top of the screen) and in the Entry lane, your data is populated under the data sub_section.

Entry-mode Sample:¶

The following sample schema creates one entry for each row of an excel file (entry mode). For example, suppose in an excel sheet, you have the information for a chemical product (e.g. name in one column), and each row contains one entry of the aforementioned chemical product. Since each row is separate from others, in order to create multiple archives of the same product out of all rows and import them into NOMAD, you can use the following schema by substituting My_Quantity with any appropriate name (e.g. Name).

Important note:

To create new entries based on your entire schema, set row_to_entries to - root. Otherwise, you can provide the relative path of specific sub_section(s) in your schema to create new entries.
Leaving row_to_entries empty causes the parser to parse the entire schema using column mode!

# This schema is specially made for demonstration of implementing a tabular parser with
# entry mode.
definitions:
  name: 'Tabular Parser example schema' # The main section that contains the quantities to be read from an excel file
  # This name can be changed freely.
  sections:
    Tabular_Parser:
      base_sections:
        - nomad.parsing.tabular.TableData # To create entries from each row in the excel file
        # the base section should inherit from `nomad.parsing.tabular.TableData`.
      quantities:
        data_file:
          type: str
          m_annotations:
            tabular_parser:
              new_entry:
                - row_to_entries: # This is the reference to where the targeted (sub-)section lies within this example schema file
                  - root
              comment: '#' # Skipping lines in csv or excel file that start with the sign `#`
        My_quantity:
          type: str
          m_annotations:
            tabular:
              name: My_Sheet/My_Column
data:
  m_def: Tabular_Parser # this is a reference to the section definition above
  data_file: tabular-parser-entry-mode.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file

Step-by-step guide to import your data using entry-mode:

After writing your schema file, you can create a new upload in NOMAD (or use an existing upload), and upload both your schema file and the excel/csv file together (or zipped) to your NOMAD project. In the Overview page of your NOMAD upload, you should be able to see as many new entries created and appended to the Process data section as there are rows in your excel/csv file. Go to the entry page of the new entries, click on DATA tab (on top of the screen) and in the Entry lane, your data is populated under the data sub_section.

Advanced options to use/set in tabular parser:

If you want to populate your schema from multiple excel/csv files, you can define multiple data_file Quantitys annotated with tabular_parser in the root level of your schema (root level of your schema is where you inherit from TableData class under base_sections). Each individual data_file quantity can now contain a list of sub_sections which are expected to be filled using one- or all of the modes mentioned above. Check the MyOverallSchema section in Complex Schema example below. It contains 2 data_file quantities that each one, contains separate instructions to populate different parts of the schema. data_file_1 is responsible to fill MyColSubsection while data_file_2 fills all sub_sections listed in row_to_sections and entry_to_sections under new_entry.
When using the entry mode, you can create a custom Quantity to hold a reference to each new entries generated by the parser. Check the MyEntrySubsection section in the Complex Schema example below. The refs_quantity is a ReferenceEditQuantiy with type #/MyEntry which tells the parser to populate this quantity with a reference to the fresh entry of type MyEntry. Also, you may use tabular_pattern annotation to explicitly set the name of the fresh entries.
If you have multiple columns with exact same name in your excel/csv file, you can parse them using row mode. For this, define a repeating sub_section that handles your data in different rows and inside each row, define another repeating sub_section that contains your repeating columns. Check MySpecialRowSubsection section in the Complex Schema example below. data_file_2 contains a repeating column called row_quantity_2 and we want to create a section out of each row and each column. This is done by creating one row of type MySpecialRowSubsection and populate MyRowQuantity3 quantity from row_quantity_3 column in the csv file, and appending each column of row_quantity_2 to MyRowQuantity2.

definitions:
  name: Complex Schema
  sections:
    MyEntry: # MyEntry section has only one quantity `MyEntryQuantity`
      quantities:
        MyEntryQuantity:
          type: str
          m_annotations:
            tabular:
              name: entry_quantity
    MyColumn: # MyColumn section has only one quantity `MyColumnQuantity`
      quantities:
        MyColumnQuantity:
          type: np.float64
          shape: ['*']
          m_annotations:
            tabular:
              name: column_quantity
    MyRow: # MyColumn section has only one quantity `MyRowQuantity`
      sub_sections:
        MyRowCollection:
          repeats: true
          section:
            quantities:
              MyRowQuantity:
                type: str
                m_annotations:
                  tabular:
                    name: row_quantity
    MyOverallSchema: # root level my the schema (inheriting from the `TableData` class)
      base_sections:
        - nomad.parsing.tabular.TableData
      m_annotations:
        eln:
      quantities:
        data_file_1: # This data file quantity is responsible to fill the `MyColSubsection` subsection
        # as denoted in the column_sections.
          type: str
          m_annotations:
            tabular_parser:
              sep: ','  
              comment: '#'
              column_sections: # list of subsections to be parsed by data_file_1 in column mode
                - MyColSubsection
        data_file_2: # This data file quantity is responsible to fill the `MyRowSubsection`,
        # `MySpecialRowSubsection`, and `MyEntrySubsection` subsections as
        # denoted by both entry_sections and row_sections.
          type: str
          m_annotations:
            tabular_parser:
              current_entry:
                row_to_sections:
                  - MyRowSubsection/MyRowCollection
                  - MySpecialRowSubsection
              new_entry:
                - row_to_entries:
                  - MyEntrySubsection
              sep: ','  
              comment: '#'
              target_sub_section: # list of subsections to be parsed by data_file_2 in row mode
                - MyRowSubsection/MyRowCollection
                - MySpecialRowSubsection
              entry_sections: # list of subsections to be parsed by data_file_2 in entry mode
                - MyEntrySubsection
        MyRootQuantity: # This quantity lives in the root level which is parsed in the column mode
          type: str
          shape: ['*']
          m_annotations:
            tabular:
              name: root_quantity
      sub_sections: 
        MyColSubsection:
          section: '#/MyColumn'
        MyRowSubsection:
          section: '#/MyRow'
        MyEntrySubsection:
          repeats: true
          section:
            quantities: # A quantiy for the entry_section that holds a reference to the
            # entries created by the parser
              refs_quantity:
                type: '#/MyEntry'
                m_annotations:
                  eln:
                    component: ReferenceEditQuantity
                  entry_name: '#/MyEntry/MyEntryQuantity'
                  tabular_pattern: # use this option to define the names of the new entries created
                  # with parser
                    name: my_entry
        MySpecialRowSubsection:
          repeats: true
          section:
            quantities:
              MyRowQuantity3:
                type: str
                m_annotations:
                  tabular:
                    name: row_quantity_3
            sub_sections:
              MyRowCollection2:
                repeats: true
                section:
                  quantities:
                    MyRowQuantity2:
                      type: str
                      m_annotations:
                        tabular:
                          name: row_quantity_2
data:
  m_def: MyOverallSchema # instantiating the root level of the schema
  data_file_1: data_file_1.csv # 
  data_file_2: data_file_2.csv

Here are all parameters for the two annotations Tabular Parser and Tabular.

Tabular Parser¶

Instructs NOMAD to treat a string valued scalar quantity as a file path and interprets the contents of this file as tabular data. Supports both .csv and Excel files.

name	type
comment	`str`	The character denoting the commented lines in `.csv` files. This is passed to pandas to parse the file. Has to be used to annotate the quantity that holds the path to the `.csv` or excel file.
sep	`str`	The character used to separate cells in a `.csv` file. This is passed to pandas to parse the file. Has to be used to annotate the quantity that holds the path to the `.csv` or excel file.
skiprows	`int`	Number of `.csv` file rows that are skipped. This is passed to pandas to parse the file. Has to be used to annotate the quantity that holds the path to the `.csv` or excel file.
separator	`str`	An alias for `sep`
target_sub_section	`List[str]`	this feature is deprecated and will be removed in future release. Use `row_sections` instead. A `list` of paths to the repeating sub-sections where the tabular quantities are to be filled from individual rows of the excel/csv file (i.e. in the row mode). Each path is a `/` separated list of nested sub-sections. The targeted sub-sections, will be considered when mapping table rows to quantities. Has to be used to annotate the quantity that holds the path to the `.csv` or excel file. default: `[]`
mode	`str`	This is optional. It will be removed in future release. Either `column`, `row`, or `entry`. With `column` the whole column is mapped into a quantity (needs to be a list). With `row` each row (and its cells) are mapped into instances of a repeating sub section, where each section represents a row (quantities need to be scalars). With `entry` new entry is created and populated from each row (and its cells) where all quantities should remain to be scalars. Has to be used to annotate the quantity that holds the path to the `.csv` or excel file. default: `column` options: - `row` - `column` - `root` - `entry`
current_entry	`CurrentEntryOptions`	Append a list of `row_sections` and `column_sections` here to parse the tabular data in the same NOMAD entry default: `[]`
new_entry	`List[NewEntryOptions]`	Append a list of `row_sections` and `column_sections` and `row_to_entries` here to parse the tabular data in new entries. default: `[]`

Tabular¶

Allows to map a quantity to a row of a tabular data-file. Should only be used in conjunction with tabular_parser.

name	type
name	`str`	The column name that should be mapped to the annotation quantity. Has to be the same string that is used in the header, i.e. first `.csv` line or first excel file `row`. For excel files with multiple sheets, the name can have the form `<sheet name>/<column name>`. Otherwise, only the first sheets is used. Has to be applied to the quantity that a column should be mapped to.
unit	`str`	The unit of the value in the file. Has to be compatible with the annotated quantity's unit. Will be used to automatically convert the value. If this is not defined, the values will not be converted. Has to be applied to the quantity that a column should be mapped to.

Plot Annotation¶

This annotation can be used to add a plot to a section or quantity. Example:

class Evaporation(MSection):
    m_def = Section(a_plot={
        'label': 'Temperature and Pressure',
        'x': 'process_time',
        'y': ['./substrate_temperature', './chamber_pressure'],
        'config': {
            'editable': True,
            'scrollZoom': False
        }
    })
    time = Quantity(type=float, shape=['*'], unit='s')
    substrate_temperature = Quantity(type=float, shape=['*'], unit='K')
    chamber_pressure = Quantity(type=float, shape=['*'], unit='Pa')

You can create multi-line plots by using lists of the properties y (and x). You either have multiple sets of y-values over a single set of x-values. Or you have pairs of x and y values. For this purpose the annotation properties x and y can reference a single quantity or a list of quantities. For repeating sub sections, the section instance can be selected with an index, e.g. "sub_section_name/2/parameter_name" or with a slice notation start:stop where negative values index from the end of the array, e.g. "sub_section_name/1:-5/parameter_name".

The interactive examples of the plot annotations can be found here.

name	type
label	`str`	Is passed to plotly to define the label of the plot.
x	`Union[List[str], str]`	A path or list of paths to the x-axes values. Each path is a `/` separated list of sub-section and quantity names that leads from the annotation section to the quantity. Repeating sub sections are indexed between two `/`s with an integer or a slice `start:stop`.
y	`Union[List[str], str]`	A path or list of paths to the y-axes values. list of sub-section and quantity names that leads from the annotation section to the quantity. Repeating sub sections are indexed between two `/`s with an integer or a slice `start:stop`.
lines	`List[dict]`	A list of dicts passed as `traces` to plotly to configure the lines of the plot. See https://plotly.com/javascript/reference/scatter/ for details.
layout	`dict`	A dict passed as `layout` to plotly to configure the plot layout. See https://plotly.com/javascript/reference/layout/ for details.
config	`dict`	A dict passed as `config` to plotly to configure the plot functionallity. See https://plotly.com/javascript/configuration-options/ for details.

Built-in base sections for ELNs¶

Coming soon ...

Custom normalizers¶

For custom schemas, you might want to add custom normalizers. All files are parsed and normalized when they are uploaded or changed. The NOMAD metainfo Python interface allows you to add functions that are called when your data is normalized.

Here is an example:

from nomad.datamodel import EntryData, ArchiveSection
from nomad.metainfo.metainfo import Quantity, Datetime, SubSection


class Sample(ArchiveSection):
    added_date = Quantity(type=Datetime)
    formula = Quantity(type=str)

    sample_id = Quantity(type=str)

    def normalize(self, archive, logger):
        super(Sample, self).normalize(archive, logger)

        if self.sample_id is None:
            self.sample_id = f'{self.added_date}--{self.formula}'


class SampleDatabase(EntryData):
    samples = SubSection(section=Sample, repeats=True)

To add a normalize function, your section has to inherit from ArchiveSection which provides the base for this functionality. Now you can overwrite the normalize function and add you own behavior. Make sure to call the super implementation properly to support schemas with multiple inheritance.

If we parse an archive like this:

data:
  m_def: 'examples.archive.custom_schema.SampleDatabase'
  samples:
    - formula: NaCl
      added_date: '2022-06-18'

we will get a final normalized archive that contains our data like this:

{
  "data": {
    "m_def": "examples.archive.custom_schema.SampleDatabase",
    "samples": [
      {
        "added_date": "2022-06-18T00:00:00+00:00",
        "formula": "NaCl",
        "sample_id": "2022-06-18 00:00:00+00:00--NaCl"
      }
    ]
  }
}

Third-party integration¶

NOMAD offers integration with third-party ELN providers, simplifying the process of connecting and interacting with external platforms. Three main external ELN solutions that are integrated into NOMAD are: elabFTW, Labfolder and chemotion. The process of data retrieval and data mapping onto NOMAD's schema varies for each of these third-party ELN provider as they inherently allow for certain ways of communicating with their database. Below you can find a How-to guide on importing your data from each of these external repositories.

elabFTW integration¶

elabFTW is part of the ELN Consortium and supports exporting experimental data in ELN file format. ELNFileFormat is a zipped file that contains metadata of your elabFTW project along with all other associated data of your experiments.

How to import elabFTW data into NOMAD:

Go to your elabFTW experiment and export your project as ELN Archive. Save the file to your filesystem under your preferred name and location (keep the .eln extension intact). To parse your ebalFTW data into NOMAD, go to the upload page of NOMAD and create a new upload. In the overview page, upload your exported file (either by drag-dropping it into the click or drop files box or by navigating to the path where you stored the file). This causes triggering NOMAD's parser to create as many new entries in this upload as there are experiments in your elabFTW project.

You can inspect the parsed data of each of your entries (experiments) by going to the DATA tab of each entry page. Under Entry column, click on data section. Now a new lane titled ElabFTW Project Import should be visible. Under this section, (some of) the metadata of your project is listed. There two sub-sections: 1) experiment_data, and 2) experiment_files.

experiment_data section contains detailed information of the given elabFTW experiment, such as links to external resources and extra fields. experiment_files section is a list of sub-sections containing metadata and additional info of the files associated with the experiment.

Labfolder integration¶

Labfolder provides API endpoints to interact with your ELN data. NOMAD makes API calls to retrieve, parse and map the data from your Labfolder instacne/database to a NOMAD's schema. To do so, the necessary information are listed in the table below:

project_url: The URL address to the Labfolder project. it should follow this pattern: 'https://your-labfolder-server/eln/notebook#?projectIds=your-project-id'. This is used to setup the server and initialize the NOMAD schema.

labfolder_email: The email (user credential) to authenticate and login the user. Important Note: this information is discarded once the authentication process is finished.

password: The password (user credential) to authenticate and login the user. Important Note: this information is discarded once the authentication process is finished.

How to import Labfolder data into NOMAD:

To get your data transferred to NOMAD, first go to NOMAD's upload page and create a new upload. Then click on CREATE ENTRY button. Select a name for your entry and pick Labfolder Project Import from the Built-in schema dropdown menu. Then click on CREATE. This creates an entry where you can insert your user information. Fill the Project url, Labfolder email and password fields. Once completed, click on the save icon in the top-right corner of the screen. This triggers NOMAD's parser to populate the schema of current ELN. Now the metadata and all files of your Labfolder project should be populated in this entry.

The elements section lists all the data and files in your projects. There are 6 main data types returned by Labfolder's API: DATA, FILE, IMAGE, TABLE, TEXT and WELLPLATE. DATA element is a special Labfolder element where the data is structured in JSON format. Every data element in NOMAD has a special Quantity called labfolder_data which is a flattened and aggregated version of the data content. IMAGE element contains information of any image stored in your Labfolder project. TEXT element contains data of any text field in your Labfodler project.

Chemotion integration¶

Coming soon