Standardized provenance recording

  • Status: proposed

  • Deciders: sdruskat, skernchen, notactuallyfinn

  • Date: 2025-10-17

Technical story:

  • https://github.com/softwarepub/hermes/pull/442

  • https://github.com/softwarepub/hermes/issues/363

Context and Problem Statement

To consolidate traceability of the metadata, and resolution based on metadata sources in case of duplicates, etc., we need to record the provenance of metadata values in a standardized way. To achieve this, we use the PROV-O ontology serialized as JSON-LD. Additionally, HERMES should make it possible to record as much of the provenance as possible centrally, i.e., as part of the core codebase. This is to keep plugin developers from having to supply their own provenance solutions.

To do this, we need to specify what provenance information is recorded and how it can be implemented in HERMES to make it easy to use.

Considered Options

  • Provide HERMES API-methods that also document themselves

Decision Outcome

Chosen option:

Pros and Cons of the Options

Provide HERMES API-methods that also document themselves

Provide API-methods for loading, writing, making web requests, etc. that document themselves.
Those methods take also the function that should be used for the task at hand and just define a framework in which we implement the provenance-data recording.
Like so:

class HermesPlugin():
    def load(func, path: str, *args, **kwargs):
        # TODO: handle and record byte formats properly
        with open(path) as fi:
            data = func(fi, *args, **kwargs)
        prov.record("load", path, func.__name__, data) # also module of func
        return data

    def write(func, path: str, data, *args, **kwargs):
        # TODO: handle and record byte formats properly
        with open(path) as fi:
            func(fi, data, *args, **kwargs)
        prov.record("write", path, func.__name__, data) # also module of func
  • Good, because allows for recording of provenance information of the plugins

  • Good, because it isn’t making plugin development harder

  • Bad, because API methods may not cover all I/O functionality python provides

  • Bad, because it doesn’t cover merging, mapping, etc.

All provenance information should be recorded in the following format where addtional properties of agents, activites and entities are values of suitable vocabularies (from Schema.org, CodeMeta and potentially other schemas):


source: hermes-prov.drawio