How to define tabular data
In order to import your data from a .csv
or Excel
file, NOMAD provides three distinct (and separate) ways, that
with each comes unique options for importing and interacting with your data. In order to better understand how to use
NOMAD tabular parser to import your data, follow three sections below. In each section you
can find a commented sample schema with a step-by-step guide on how to import your tabular data.
Tabular parser, implicitly, parse the data into the same NOMAD entry where the datafile is loaded. Also, explicitly,
this can be defined by putting the corresponding annotations under current_entry
(check the examples below).
In addition, tabular parser can be set to parse the data into new entry (or entries). For this, the proper annotations
should be appended to new_entry
annotation in your schema file.
Two main components of any tabular parser schema are:
1) implementing the correct base-section(s), and
2) providing a data_file
Quantity
with the correct m_annotations
.
Please bear in mind that the schema files should 1) follow the NOMAD naming convention
(i.e. My_Name.archive.yaml
), and 2) be accompanied by your data file in order for NOMAD to parse them.
In the examples provided below, an Excel
file is assumed to contain all the data, as both NOMAD and
Excel
support multiple-sheets data manipulations and imports. Note that the Excel
file name in each schema
should match the name of the Excel
data file, which in case of using a .csv
data file, it can be replaced by the
.csv
file name.
TableData
(and any other section(s) that is inheriting from TableData
) has a customizable checkbox Quantity
(i.e. fill_archive_from_datafile
) to turn the tabular parser on
or off
.
If you do not want to have the parser running everytime you make a change to your archive data, it is achievable then via
unchecking the checkbox. It is customizable in the sense that if you do not wish to see this checkbox at all,
you can configure the hide
parameter of the section's m_annotations
to hide the checkbox. This in turn sets
the parser to run everytime you save your archive.
Be cautious though! Turning on the tabular parser (or checking the box) on saving your data will cause losing/overwriting your manually-entered data by the parser!
Column-mode¶
The following sample schema creates one quantity off the entire column of an excel file (column mode
).
For example, suppose in an excel sheet, several rows contain information of a chemical product (e.g. purity
in one
column). In order to list all the purities under the column purity
and import them into NOMAD, you can use the
following schema by substituting My_Quantity
with any name of your choice (e.g. Purity
),
tabular-parser.data.xlsx
with the name of the csv/excel
file where the data lies, and My_Sheet/My_Column
with
sheet_name/column_name of your targeted data. The Tabular_Parser
can also be changed to any arbitrary name of your
choice.
Important notes:
shape: ['*']
underMy_Quantity
is essential to parse the entire column of the data file.- The
data_file
Quantity
can have any arbitrary name (e.g.xlsx_file
) My_Quantity
can also be defined within another subsection (see next sample schema)- Use
current_entry
and appendcolumn_to_sections
to specify which sub_section(s) is to be filled in this mode.Leaving this field empty
causes the parser to parse the entire schema under column mode.
# This schema is specially made for demonstration of implementing a tabular parser with
# column mode.
definitions:
name: 'Tabular Parser example schema'
sections:
Tabular_Parser: # The main section that contains the quantities to be read from an excel file.
# This name can be changed freely.
base_sections:
- nomad.parsing.tabular.TableData
quantities:
data_file:
type: str
m_annotations:
tabular_parser: # The tabular_parser annotation, will treat the values of this
# quantity as files. It will try to interpret the files and fill
# quantities in this section (and sub_sections) with the column
# data of .csv or .xlsx files.
comment: '#' # Skipping lines in csv or excel file that start with the sign `#`
# column_sections: # Here the relative path to the sub_sections that are supposed to be filled
# from the given excel/csv file. Leaving this empty causes the normalizer to
# parse the entire schema under column mode.
My_Quantity:
type: str
shape: ['*']
m_annotations:
tabular: # The tabular annotation defines a mapping to column headers used in tabular data files
name: My_Sheet/My_Column # Here you can define where the data for the given quantity is to be taken from
# The convention for selecting the name is if the data is to be taken from an excel file,
# you can specify the sheet_name followed by a forward slash and the column_name to target the desired quantity.
# If only a column name is provided, then the first sheet in the excel file (or the .csv file)
# is assumed to contain the targeted data.
data:
m_def: Tabular_Parser # this is a reference to the section definition above
data_file: tabular-parser.data.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file
Step-by-step guide to import your data using column-mode:
After writing your schema file, you can create a new upload in NOMAD (or use an existing upload),
and upload both your schema file
and the excel/csv
file together (or zipped) to your NOMAD project. In the
Overview
page of your NOMAD upload, you should be able to see a new entry created and appended to the Process data
section. Go to the entry page, click on DATA
tab (on top of the screen) and in the Entry
lane, your data
is populated under the data
sub_section.
Row-mode Sample:¶
The sample schema provided below, creates separate instances of a repeated section from each row of an excel file
(row mode
). For example, suppose in an excel sheet, you have the information for a chemical product
(e.g. name
in one column), and each row contains one entry of the aforementioned chemical product.
Since each row is separate from others, in order to create instances of the same product out of all rows
and import them into NOMAD, you can use the following schema by substituting My_Subsection
,
My_Section
and My_Quantity
with any appropriate name (e.g. Substance
, Chemical_product
and Name
respectively).
Important notes:
- This schema demonstrates how to import data within a subsection of another subsection, meaning the
targeted quantity should not necessarily go into the main
quantites
. - Setting
row_to_sections
undercurrent_entry
signals that for each row in the sheet_name (provided inMy_Quantity
), one instance of the corresponding (sub-)section (in this example,My_Subsection
sub-section as it has therepeats
option set to true), will be appended. Please bear in mind that if this mode is selected, then all other quantities in this sub_section, should exist in the same sheet_name.
# This schema is specially made for demonstration of implementing a tabular parser with
# row mode.
definitions:
name: 'Tabular Parser example schema'
sections:
Tabular_Parser: # The main section that contains the quantities to be read from an excel file
# This name can be changed freely.
base_sections:
- nomad.parsing.tabular.TableData # Here we specify that we need to acquire the data from a .xlsx or a .csv file
quantities:
data_file:
type: str
m_annotations:
tabular_parser:
current_entry:
row_to_sections: # This is the reference to where the targeted (sub-)section lies within this example schema file
- My_Subsection/My_Section
comment: '#' # Skipping lines in csv or excel file that start with the sign `#`
sub_sections:
My_Subsection:
section:
sub_sections:
My_Section:
repeats: true # The repeats option set to true means there can be multiple instances of this
# section
section:
quantities:
My_Quantity:
type: str
m_annotations:
tabular: # The tabular annotation defines a mapping to column headers used in tabular data files
name: My_Sheet/My_Column # sheet_name and column name of the targeted data in csv/xlsx file
data:
m_def: Tabular_Parser # this is a reference to the section definition above
data_file: tabular-parser.data.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file
Step-by-step guide to import your data using row-mode:
After writing your schema file, you can create a new upload in NOMAD (or use an existing upload),
and upload both your schema file
and the excel/csv
file together (or zipped) to your NOMAD project. In the
Overview
page of your NOMAD upload, you should be able to see as many new sub-sections created and appended
to the repeating section as there are rows in your excel/csv
file.
Go to the entry page of the new entries, click on DATA
tab (on top of the screen) and in the Entry
lane,
your data is populated under the data
sub_section.
Entry-mode Sample:¶
The following sample schema creates one entry for each row of an excel file (entry mode
).
For example, suppose in an excel sheet, you have the information for a chemical product (e.g. name
in one column),
and each row contains one entry of the aforementioned chemical product. Since each row is separate from others, in
order to create multiple archives of the same product out of all rows and import them into NOMAD, you can use the
following schema by substituting My_Quantity
with any appropriate name (e.g. Name
).
Important note:
- To create new entries based on your entire schema, set
row_to_entries
to- root
. Otherwise, you can provide the relative path of specific sub_section(s) in your schema to create new entries. - Leaving
row_to_entries
empty causes the parser to parse the entire schema using column mode!
# This schema is specially made for demonstration of implementing a tabular parser with
# entry mode.
definitions:
name: 'Tabular Parser example schema' # The main section that contains the quantities to be read from an excel file
# This name can be changed freely.
sections:
Tabular_Parser:
base_sections:
- nomad.parsing.tabular.TableData # To create entries from each row in the excel file
# the base section should inherit from `nomad.parsing.tabular.TableData`.
quantities:
data_file:
type: str
m_annotations:
tabular_parser:
new_entry:
- row_to_entries: # This is the reference to where the targeted (sub-)section lies within this example schema file
- root
comment: '#' # Skipping lines in csv or excel file that start with the sign `#`
My_quantity:
type: str
m_annotations:
tabular:
name: My_Sheet/My_Column
data:
m_def: Tabular_Parser # this is a reference to the section definition above
data_file: tabular-parser-entry-mode.xlsx # name of the excel/csv file to be uploaded along with this schema yaml file
Step-by-step guide to import your data using entry-mode:
After writing your schema file, you can create a new upload in NOMAD (or use an existing upload),
and upload both your schema file
and the excel/csv
file together (or zipped) to your NOMAD project. In the
Overview
page of your NOMAD upload, you should be able to see as many new entries created and appended
to the Process data
section as there are rows in your excel/csv
file.
Go to the entry page of the new entries, click on DATA
tab (on top of the screen) and in the Entry
lane,
your data is populated under the data
sub_section.
Advanced options to use/set in tabular parser:
-
If you want to populate your schema from multiple
excel/csv
files, you can define multiple data_fileQuantity
s annotated withtabular_parser
in the root level of your schema (root level of your schema is where you inherit fromTableData
class underbase_sections
). Each individual data_file quantity can now contain a list of sub_sections which are expected to be filled using one- or all of the modes mentioned above. Check theMyOverallSchema
section inComplex Schema
example below. It contains 2 data_file quantities that each one, contains separate instructions to populate different parts of the schema.data_file_1
is responsible to fillMyColSubsection
whiledata_file_2
fills all sub_sections listed inrow_to_sections
andentry_to_sections
undernew_entry
. -
When using the entry mode, you can create a custom
Quantity
to hold a reference to each new entries generated by the parser. Check theMyEntrySubsection
section in theComplex Schema
example below. Therefs_quantity
is aReferenceEditQuantiy
with type#/MyEntry
which tells the parser to populate this quantity with a reference to the fresh entry of typeMyEntry
. Also, you may usetabular_pattern
annotation to explicitly set the name of the fresh entries. -
If you have multiple columns with exact same name in your
excel/csv
file, you can parse them using row mode. For this, define a repeating sub_section that handles your data in different rows and inside each row, define another repeating sub_section that contains your repeating columns. CheckMySpecialRowSubsection
section in theComplex Schema
example below.data_file_2
contains a repeating column calledrow_quantity_2
and we want to create a section out of each row and each column. This is done by creating one row of typeMySpecialRowSubsection
and populateMyRowQuantity3
quantity fromrow_quantity_3
column in thecsv
file, and appending each column ofrow_quantity_2
toMyRowQuantity2
.
definitions:
name: Complex Schema
sections:
MyEntry: # MyEntry section has only one quantity `MyEntryQuantity`
quantities:
MyEntryQuantity:
type: str
m_annotations:
tabular:
name: entry_quantity
MyColumn: # MyColumn section has only one quantity `MyColumnQuantity`
quantities:
MyColumnQuantity:
type: np.float64
shape: ['*']
m_annotations:
tabular:
name: column_quantity
MyRow: # MyColumn section has only one quantity `MyRowQuantity`
sub_sections:
MyRowCollection:
repeats: true
section:
quantities:
MyRowQuantity:
type: str
m_annotations:
tabular:
name: row_quantity
MyOverallSchema: # root level my the schema (inheriting from the `TableData` class)
base_sections:
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file_1: # This data file quantity is responsible to fill the `MyColSubsection` subsection
# as denoted in the column_sections.
type: str
m_annotations:
tabular_parser:
sep: ','
comment: '#'
column_sections: # list of subsections to be parsed by data_file_1 in column mode
- MyColSubsection
data_file_2: # This data file quantity is responsible to fill the `MyRowSubsection`,
# `MySpecialRowSubsection`, and `MyEntrySubsection` subsections as
# denoted by both entry_sections and row_sections.
type: str
m_annotations:
tabular_parser:
current_entry:
row_to_sections:
- MyRowSubsection/MyRowCollection
- MySpecialRowSubsection
new_entry:
- row_to_entries:
- MyEntrySubsection
sep: ','
comment: '#'
target_sub_section: # list of subsections to be parsed by data_file_2 in row mode
- MyRowSubsection/MyRowCollection
- MySpecialRowSubsection
entry_sections: # list of subsections to be parsed by data_file_2 in entry mode
- MyEntrySubsection
MyRootQuantity: # This quantity lives in the root level which is parsed in the column mode
type: str
shape: ['*']
m_annotations:
tabular:
name: root_quantity
sub_sections:
MyColSubsection:
section: '#/MyColumn'
MyRowSubsection:
section: '#/MyRow'
MyEntrySubsection:
repeats: true
section:
quantities: # A quantiy for the entry_section that holds a reference to the
# entries created by the parser
refs_quantity:
type: '#/MyEntry'
m_annotations:
eln:
component: ReferenceEditQuantity
entry_name: '#/MyEntry/MyEntryQuantity'
tabular_pattern: # use this option to define the names of the new entries created
# with parser
name: my_entry
MySpecialRowSubsection:
repeats: true
section:
quantities:
MyRowQuantity3:
type: str
m_annotations:
tabular:
name: row_quantity_3
sub_sections:
MyRowCollection2:
repeats: true
section:
quantities:
MyRowQuantity2:
type: str
m_annotations:
tabular:
name: row_quantity_2
data:
m_def: MyOverallSchema # instantiating the root level of the schema
data_file_1: data_file_1.csv #
data_file_2: data_file_2.csv
Here are all parameters for the two annotations Tabular Parser
and Tabular
.
Tabular Parser¶
Instructs NOMAD to treat a string valued scalar quantity as a file path and
interprets the contents of this file as tabular data. Supports both
.csv
and Excel files.
name | type | |
---|---|---|
comment | str |
The character denoting the commented lines in .csv files. This is passed to pandas to parse the file. Has to be used to annotate the quantity that holds the path to the .csv or excel file. |
sep | str |
The character used to separate cells in a .csv file. This is passed to pandas to parse the file. Has to be used to annotate the quantity that holds the path to the .csv or excel file. |
skiprows | int |
Number of .csv file rows that are skipped. This is passed to pandas to parse the file. Has to be used to annotate the quantity that holds the path to the .csv or excel file. |
separator | str |
An alias for sep |
target_sub_section | List[str] |
this feature is deprecated and will be removed in future release. Use row_sections instead. A list of paths to the repeating sub-sections where the tabular quantities are to be filled from individual rows of the excel/csv file (i.e. in the row mode). Each path is a / separated list of nested sub-sections. The targeted sub-sections, will be considered when mapping table rows to quantities. Has to be used to annotate the quantity that holds the path to the .csv or excel file.default: [] |
mode | str |
This is optional. It will be removed in future release. Either column , row , or entry . With column the whole column is mapped into a quantity (needs to be a list). With row each row (and its cells) are mapped into instances of a repeating sub section, where each section represents a row (quantities need to be scalars). With entry new entry is created and populated from each row (and its cells) where all quantities should remain to be scalars. Has to be used to annotate the quantity that holds the path to the .csv or excel file.default: column options:- row - column - root - entry |
current_entry | CurrentEntryOptions |
Append a list of row_sections and column_sections here to parse the tabular data in the same NOMAD entrydefault: [] |
new_entry | List[NewEntryOptions] |
Append a list of row_sections and column_sections and row_to_entries here to parse the tabular data in new entries.default: [] |
Tabular¶
Allows to map a quantity to a row of a tabular data-file. Should only be used
in conjunction with tabular_parser
.
name | type | |
---|---|---|
name | str |
The column name that should be mapped to the annotation quantity. Has to be the same string that is used in the header, i.e. first .csv line or first excel file row . For excel files with multiple sheets, the name can have the form <sheet name>/<column name> . Otherwise, only the first sheets is used. Has to be applied to the quantity that a column should be mapped to. |
unit | str |
The unit of the value in the file. Has to be compatible with the annotated quantity's unit. Will be used to automatically convert the value. If this is not defined, the values will not be converted. Has to be applied to the quantity that a column should be mapped to. |
Plot Annotation¶
This annotation can be used to add a plot to a section or quantity. Example:
class Evaporation(MSection):
m_def = Section(a_plot={
'label': 'Temperature and Pressure',
'x': 'process_time',
'y': ['./substrate_temperature', './chamber_pressure'],
'config': {
'editable': True,
'scrollZoom': False
}
})
time = Quantity(type=float, shape=['*'], unit='s')
substrate_temperature = Quantity(type=float, shape=['*'], unit='K')
chamber_pressure = Quantity(type=float, shape=['*'], unit='Pa')
You can create multi-line plots by using lists of the properties y
(and x
).
You either have multiple sets of y
-values over a single set of x
-values. Or
you have pairs of x
and y
values. For this purpose the annotation properties
x
and y
can reference a single quantity or a list of quantities.
For repeating sub sections, the section instance can be selected with an index, e.g.
"sub_section_name/2/parameter_name" or with a slice notation start:stop
where
negative values index from the end of the array, e.g.
"sub_section_name/1:-5/parameter_name".
The interactive examples of the plot annotations can be found here.
name | type | |
---|---|---|
label | str |
Is passed to plotly to define the label of the plot. |
x | Union[List[str], str] |
A path or list of paths to the x-axes values. Each path is a / separated list of sub-section and quantity names that leads from the annotation section to the quantity. Repeating sub sections are indexed between two / s with an integer or a slice start:stop . |
y | Union[List[str], str] |
A path or list of paths to the y-axes values. list of sub-section and quantity names that leads from the annotation section to the quantity. Repeating sub sections are indexed between two / s with an integer or a slice start:stop . |
lines | List[dict] |
A list of dicts passed as traces to plotly to configure the lines of the plot. See https://plotly.com/javascript/reference/scatter/ for details. |
layout | dict |
A dict passed as layout to plotly to configure the plot layout. See https://plotly.com/javascript/reference/layout/ for details. |
config | dict |
A dict passed as config to plotly to configure the plot functionallity. See https://plotly.com/javascript/configuration-options/ for details. |
Built-in base sections for ELNs¶
Coming soon ...
Custom normalizers¶
For custom schemas, you might want to add custom normalizers. All files are parsed and normalized when they are uploaded or changed. The NOMAD metainfo Python interface allows you to add functions that are called when your data is normalized.
Here is an example:
from nomad.datamodel import EntryData, ArchiveSection
from nomad.metainfo.metainfo import Quantity, Datetime, SubSection
class Sample(ArchiveSection):
added_date = Quantity(type=Datetime)
formula = Quantity(type=str)
sample_id = Quantity(type=str)
def normalize(self, archive, logger):
super(Sample, self).normalize(archive, logger)
if self.sample_id is None:
self.sample_id = f'{self.added_date}--{self.formula}'
class SampleDatabase(EntryData):
samples = SubSection(section=Sample, repeats=True)
To add a normalize
function, your section has to inherit from ArchiveSection
which
provides the base for this functionality. Now you can overwrite the normalize
function
and add you own behavior. Make sure to call the super
implementation properly to
support schemas with multiple inheritance.
If we parse an archive like this:
data:
m_def: 'examples.archive.custom_schema.SampleDatabase'
samples:
- formula: NaCl
added_date: '2022-06-18'
we will get a final normalized archive that contains our data like this:
{
"data": {
"m_def": "examples.archive.custom_schema.SampleDatabase",
"samples": [
{
"added_date": "2022-06-18T00:00:00+00:00",
"formula": "NaCl",
"sample_id": "2022-06-18 00:00:00+00:00--NaCl"
}
]
}
}
Third-party integration¶
NOMAD offers integration with third-party ELN providers, simplifying the process of connecting and interacting with external platforms. Three main external ELN solutions that are integrated into NOMAD are: elabFTW, Labfolder and chemotion. The process of data retrieval and data mapping onto NOMAD's schema varies for each of these third-party ELN provider as they inherently allow for certain ways of communicating with their database. Below you can find a How-to guide on importing your data from each of these external repositories.
elabFTW integration¶
elabFTW is part of the ELN Consortium and supports exporting experimental data in ELN file format. ELNFileFormat is a zipped file that contains metadata of your elabFTW project along with all other associated data of your experiments.
How to import elabFTW data into NOMAD:
Go to your elabFTW experiment and export your project as ELN Archive
. Save the file to your filesystem under
your preferred name and location (keep the .eln
extension intact).
To parse your ebalFTW data into NOMAD,
go to the upload page of NOMAD and create a new upload. In the overview
page, upload your exported file (either by
drag-dropping it into the click or drop files box or by navigating to the path where you stored the file).
This causes triggering NOMAD's parser to create as many new entries in this upload as there are experiments in your
elabFTW project.
You can inspect the parsed data of each of your entries (experiments) by going to the DATA
tab of each entry page. Under Entry column, click on data section. Now a new lane titled
ElabFTW Project Import
should be visible. Under this section, (some of) the metadata of your project is listed.
There two sub-sections: 1) experiment_data, and 2) experiment_files.
experiment_data section contains detailed information of the given elabFTW experiment, such as links to external resources and extra fields. experiment_files section is a list of sub-sections containing metadata and additional info of the files associated with the experiment.
Labfolder integration¶
Labfolder provides API endpoints to interact with your ELN data. NOMAD makes API calls to retrieve, parse and map the data from your Labfolder instacne/database to a NOMAD's schema. To do so, the necessary information are listed in the table below:
project_url: The URL address to the Labfolder project. it should follow this pattern: 'https://your-labfolder-server/eln/notebook#?projectIds=your-project-id'. This is used to setup the server and initialize the NOMAD schema.
labfolder_email: The email (user credential) to authenticate and login the user. Important Note: this information is discarded once the authentication process is finished.
password: The password (user credential) to authenticate and login the user. Important Note: this information is discarded once the authentication process is finished.
How to import Labfolder data into NOMAD:
To get your data transferred to NOMAD, first go to NOMAD's upload page and create a new upload.
Then click on CREATE ENTRY
button. Select a name for your entry and pick Labfolder Project Import
from
the Built-in schema
dropdown menu. Then click on CREATE
. This creates an entry where you can
insert your user information. Fill the Project url
, Labfolder email
and password
fields. Once completed,
click on the save icon
in the
top-right corner of the screen. This triggers NOMAD's parser to populate the schema of current ELN.
Now the metadata and all files of your Labfolder project should be populated in this entry.
The elements
section lists all the data and files in your projects. There are 6 main data types
returned by Labfolder's API: DATA
, FILE
, IMAGE
, TABLE
, TEXT
and WELLPLATE
. DATA
element is
a special Labfolder element where the data is structured in JSON format. Every data element in NOMAD has a special
Quantity
called labfolder_data
which is a flattened and aggregated version of the data content.
IMAGE
element contains information of any image stored in your Labfolder project. TEXT
element
contains data of any text field in your Labfodler project.
Chemotion integration¶
Coming soon