Manifest-based Strands

Frequently, twins operate on files containing some kind of data. These files need to be made accessible to the code running in the twin, in order that their contents can be read and processed. Conversely, a twin might produce an output dataset which must be understood by users.

The configuration_manifest, input_manifest and output_manifest strands describe what kind of datasets (and associated files) are required / produced.

Note

Files are always contained in datasets, even if there’s only one file. It’s so that we can keep nitty-gritty file metadata separate from the more meaningful, higher level metadata like what a dataset is for.

This describes datasets/files that are required at startup of the twin / service. They typically contain a resource that the twin might use across many analyses.

For example, a twin might predict failure for a particular component, given an image. It will require a trained ML model (saved in a *.pickle or *.json). While many thousands of predictions might be done over the period that the twin is deployed, all predictions are done using this version of the model - so the model file is supplied at startup.

Describing Manifests

Manifest-based strands are a description of what files are needed. The purpose of the manifest strands is to provide a helper to a wider system providing datafiles to digital twins.

Show twine containing this strand

{
  // Manifest strands contain lists, with one entry for each required dataset
  "configuration_manifest": {
    "datasets": [
      {
        // Once the inputs are validated, your analysis program can use this key to access the dataset
        "key": "trained_model",
        // General notes, which are helpful as a reminder to users of the service
        "purpose": "The trained classifier"
      }
    ]
  }
}

Show a matching file manifest

{
  "id": "8ead7669-8162-4f64-8cd5-4abe92509e17",
  "datasets": [
    {
      "id": "7ead7669-8162-4f64-8cd5-4abe92509e17",
      "name": "training data for system abc123",
      "organisation": "megacorp",
      "tags": {"system": "abc123"},
      "labels": ["classifier", "damage"],
      "files": [
        {
          "path": "datasets/7ead7669/blade_damage.mdl",
          "cluster": 0,
          "sequence": 0,
          "extension": "csv",
          "tags": {},
          "labels": [],
          "posix_timestamp": 0,
          "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
          "last_modified": "2019-02-28T22:40:30.533005Z",
          "name": "blade_damage.mdl",
          "size_bytes": 59684813,
          "sha-512/256": "somesha"
        }
      ]
    }
  ]
}

File tag templates

Datafiles can be tagged with key-value pairs of relevant metadata that can be used in analyses. Certain datasets might need one set of metadata on each file, while others might need a different set. The required (or optional) file tags can be specified in the twine in the file_tags_template property of each dataset of any manifest strand. Each file in the corresponding manifest strand is then validated against its dataset’s file tag template to ensure the required tags are present.

The example below is for an input manifest, but the format is the same for configuration and output manifests.

Show twine containing a manifest strand with a file tag template

{
  "input_manifest": {
    "datasets": [
      {
        "key": "met_mast_data",
        "purpose": "A dataset containing meteorological mast data",
        "file_tags_template": {
          "type": "object",
          "properties": {
            "manufacturer": {"type": "string"},
            "height": {"type": "number"},
            "is_recycled": {"type": "boolean"}
          },
          "required": ["manufacturer", "height", "is_recycled"]
        }
      }
    ]
  }
}

Show a matching file manifest

{
  "id": "8ead7669-8162-4f64-8cd5-4abe92509e17",
  "datasets": [
    {
      "id": "7ead7669-8162-4f64-8cd5-4abe92509e17",
      "name": "met_mast_data",
      "tags": {},
      "labels": ["met", "mast", "wind"],
      "files": [
        {
          "path": "input/datasets/7ead7669/file_1.csv",
          "cluster": 0,
          "sequence": 0,
          "extension": "csv",
          "labels": ["mykeyword1", "mykeyword2"],
          "tags": {
            "manufacturer": "vestas",
            "height": 500,
            "is_recycled": true
          },
          "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
          "name": "file_1.csv"
        },
        {
          "path": "input/datasets/7ead7669/file_1.csv",
          "cluster": 0,
          "sequence": 1,
          "extension": "csv",
          "labels": [],
          "tags": {
            "manufacturer": "vestas",
            "height": 500,
            "is_recycled": true
          },
          "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
          "name": "file_1.csv"
        }
      ]
    }
  ]
}

TODO - clean up or remove this section

How Filtering Works

It’s the job of twined to make sure of two things:

  1. make sure the twine file itself is valid,

    File data (input, output)

    Files are not streamed directly to the digital twin (this would require extreme bandwidth in whatever system is orchestrating all the twins). Instead, files should be made available on the local storage system; i.e. a volume mounted to whatever container or VM the digital twin runs in.

    Groups of files are described by a manifest, where a manifest is (in essence) a catalogue of files in a dataset.

    A digital twin might receive multiple manifests, if it uses multiple datasets. For example, it could use a 3D point cloud LiDAR dataset, and a meteorological dataset.

    {
        "manifests": [
            {
                "type": "dataset",
                "id": "3c15c2ba-6a32-87e0-11e9-3baa66a632fe",  // UUID of the manifest
                "files": [
                    {
                        "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",  // UUID of that file
                        "sha1": "askjnkdfoisdnfkjnkjsnd"  // for quality control to check correctness of file contents
                        "name": "Lidar - 4 to 10 Dec.csv",
                        "path": "local/file/path/to/folder/containing/it/",
                        "type": "csv",
                        "metadata": {
                        },
                        "size_bytes": 59684813,
                        "tags": {"special_number": 1},
                        "labels": ["lidar", "helpful", "information", "like"],  // Searchable, parsable and filterable
                    },
                    {
                        "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
                        "name": "Lidar - 11 to 18 Dec.csv",
                        "path": "local/file/path/to/folder/containing/it/",
                        "type": "csv",
                        "metadata": {
                        },
                        "size_bytes": 59684813,
                        "tags": {"special_number": 2},
                        "labels": ["lidar", "helpful", "information", "like"]  // Searchable, parsable and filterable
                    },
                    {
                        "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
                        "name": "Lidar report.pdf",
                        "path": "local/file/path/to/folder/containing/it/",
                        "type": "pdf",
                        "metadata": {
                        },
                        "size_bytes": 484813,
                        "tags": {},
                        "labels": ["report"]  // Searchable, parsable and filterable
                    }
                ]
            },
            {
                // ... another dataset manifest ...
            }
        ]
    }