# Manifest-based Strands¶

Frequently, twins operate on files containing some kind of data. These files need to be made accessible to the code running in the twin, in order that their contents can be read and processed. Conversely, a twin might produce an output dataset which must be understood by users.

The configuration_manifest, input_manifest and output_manifest strands describe what kind of datasets (and associated files) are required / produced.

Note

Files are always contained in datasets, even if there’s only one file. It’s so that we can keep nitty-gritty file metadata separate from the more meaningful, higher level metadata like what a dataset is for.

This describes datasets/files that are required at startup of the twin / service. They typically contain a resource that the twin might use across many analyses.

For example, a twin might predict failure for a particular component, given an image. It will require a trained ML model (saved in a *.pickle or *.json). While many thousands of predictions might be done over the period that the twin is deployed, all predictions are done using this version of the model - so the model file is supplied at startup.

These files are made available for the twin to run a particular analysis with. Each analysis will likely have different input datasets.

For example, a twin might be passed a dataset of LiDAR *.scn files and be expected to compute atmospheric flow properties as a timeseries (which might be returned in the output values for onward processing and storage).

Files are created by the twin during an analysis, tagged and stored as datasets for some onward purpose. This strand is not used for sourcing data; it enables users or other services to understand appropriate search terms to retrieve datasets produced.

## Describing Manifests¶

Manifest-based strands are a description of what files are needed, NOT a list of specific files or datasets. This is a tricky concept, but important, since services should be reusable and applicable to a range of similar datasets.

The purpose of the manifest strands is to provide a helper to a wider system providing datafiles to digital twins.

The manifest strands therefore use tagging - they contain a filters field, which should be valid Apache Lucene search syntax. This is a powerful syntax, whose tagging features allow us to specify incredibly broad, or extremely narrow searches (even down to a known unique result). See the tabs below for examples.

Note

Tagging syntax is extremely powerful. Below, you’ll see how this enables a digital twin to specify things like:

“OK, I need this digital twin to always have access to a model file for a particular system, containing trained model data”

“Uh, so I need an ordered sequence of files, that are CSV files from a meteorological mast.”

This allows twined to check that the input files contain what is needed, enables quick and easy extraction of subgroups or particular sequences of files within a dataset, and enables management systems to map candidate datasets to twins that might be used to process them.

Here we construct an extremely tight filter, which connects this digital twin to datasets from a specific system.

Show twine containing this strand

{
// Manifest strands contain lists, with one entry for each required dataset
"configuration_manifest": [
{
// Once the inputs are validated, your analysis program can use this key to access the dataset
"key": "trained_model",
// General notes, which are helpful as a reminder to users of the service
"purpose": "The trained classifier",
// Issues a strict search for data provided by megacorp, containing *.mdl files tagged as
// classifiers for blade damage on system abc123
"filters": "organisation: megacorp AND tags:(classifier AND damage AND system:abc123) AND files:(extension:mdl)"
}
]
}


Show a matching file manifest

{
"datasets": [
{
"name": "training data for system abc123",
"organisation": "megacorp",
"tags": "classifier, damage, system:abc123",
"files": [
{
"cluster": 0,
"sequence": 0,
"extension": "csv",
"tags": "",
"posix_timestamp": 0,
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
"last_modified": "2019-02-28T22:40:30.533005Z",
"size_bytes": 59684813,
"sha-512/256": "somesha"
}
]
}
]
}


Here we specify that two datasets (and all or some of the files associated with them) are required, for a service that cross-checks meteorological mast data and power output data for a wind farm.

Show twine containing this strand

{
// Manifest strands contain lists, with one entry for each required dataset
"input_manifest_filters": [
{
// Once the inputs are validated, your analysis program can use this key to access the dataset
"key": "met_mast_data",
// General notes, which are helpful as a reminder to users of the service
"purpose": "A dataset containing meteorological mast data",
// Searches datasets which are tagged "met*" (allowing for "met" and "meterological"), whose
// files are CSVs in a numbered sequence, and which occur at a particular location
"filters": "tags:(met* AND mast) AND files:(extension:csv AND sequence:>=0) AND location:10"
},
{
"purpose": "A dataset containing scada data",
// The organisation: filter refines search to datasets owned by a particular organisation handle
"filters": "organisation: megacorp AND tags:(scada AND mast) AND files:(extension:csv AND sequence:>=0)"
}
],
}


Show a matching file manifest

{
"datasets": [
{
"name": "meteorological mast dataset",
"tags": "met, mast, wind, location:108346",
"files": [
{
"cluster": 0,
"sequence": 0,
"extension": "csv",
"tags": "",
"posix_timestamp": 1551393630,
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
"last_modified": "2019-02-28T22:40:30.533005Z",
"name": "mast_1.csv",
"size_bytes": 59684813,
"sha-512/256": "somesha"
},
{
"cluster": 0,
"sequence": 1,
"extension": "csv",
"tags": "",
"posix_timestamp": 1551394230,
"id": "bbff07bc-7c19-4ed5-be6d-a6546eae8e45",
"last_modified": "2019-02-28T22:50:40.633001Z",
"name": "mast_2.csv",
"size_bytes": 59684813,
"sha-512/256": "someothersha"
}
]
},
{
"id": "5cf9e445-c288-4567-9072-edc31003b022",
"tags": "wind, turbine, scada, system:ab32, location:108346",
"files": [
{
"cluster": 0,
"sequence": 0,
"extension": "csv",
"tags": "",
"posix_timestamp": 1551393600,
"id": "78fa511f-3e28-4bc2-aa28-7b6a2e8e6ef9",
"last_modified": "2019-02-28T22:40:00.000000Z",
"name": "export_1.csv",
"size_bytes": 88684813,
"sha-512/256": "somesha"
},
{
"cluster": 0,
"sequence": 1,
"extension": "csv",
"tags": "",
"posix_timestamp": 1551394200,
"id": "204d7316-7ae6-45e3-8f90-443225b21226",
"last_modified": "2019-02-28T22:50:00.000000Z",
"name": "export_2.csv",
"size_bytes": 88684813,
"sha-512/256": "someothersha"
}
]
}
]
}


Output figure files (with *.fig extension) containing figures enabling a visual check of correlation between met mast and scada data.

Show twine containing this strand

{
"output_manifest_filters": [
{
// Twined will prepare a manifest with this key, which you can add to during the analysis or once its complete
// General notes, which are helpful as a reminder to users of the service
"purpose": "A dataset containing figures showing correlations between mast and scada data",
// Twined will check that the output file manifest has tags appropriate to the filters
"filters": "tags:(met* AND scada AND correlation) AND files:(extension:json) AND location:*"
}
]
}


Show a matching file manifest

{
"datasets": [
{
"name": "visual cross check data",
"organisation": "megacorp",
"tags": "figure, met, mast, scada, check, location:108346",
"files": [
{
"cluster": 0,
"sequence": 0,
"extension": "fig",
"tags": "",
"posix_timestamp": 1551394800,
"id": "38f77fe2-c8c0-49d1-a08c-0928d53a742f",
"last_modified": "2019-02-28T23:00:00.000000Z",
"name": "cross_check.fig",
"size_bytes": 59684813,
"sha-512/256": "somesha"
}
]
}
]
}


TODO - clean up or remove this section

It’s the job of twined to make sure of two things:

1. make sure the twine file itself is valid,

File data (input, output)

Files are not streamed directly to the digital twin (this would require extreme bandwidth in whatever system is orchestrating all the twins). Instead, files should be made available on the local storage system; i.e. a volume mounted to whatever container or VM the digital twin runs in.

Groups of files are described by a manifest, where a manifest is (in essence) a catalogue of files in a dataset.

A digital twin might receive multiple manifests, if it uses multiple datasets. For example, it could use a 3D point cloud LiDAR dataset, and a meteorological dataset.

{
"manifests": [
{
"type": "dataset",
"id": "3c15c2ba-6a32-87e0-11e9-3baa66a632fe",  // UUID of the manifest
"files": [
{
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",  // UUID of that file
"sha1": "askjnkdfoisdnfkjnkjsnd"  // for quality control to check correctness of file contents
"name": "Lidar - 4 to 10 Dec.csv",
"path": "local/file/path/to/folder/containing/it/",
"type": "csv",
},
"size_bytes": 59684813,
"tags": "lidar, helpful, information, like, sequence:1",  // Searchable, parsable and filterable
},
{
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
"name": "Lidar - 11 to 18 Dec.csv",
"path": "local/file/path/to/folder/containing/it/",
"type": "csv",
},
"size_bytes": 59684813,
"tags": "lidar, helpful, information, like, sequence:2",  // Searchable, parsable and filterable
},
{
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
"name": "Lidar report.pdf",
"path": "local/file/path/to/folder/containing/it/",
"type": "pdf",