Record provenance of metadata
Status: accepted
Deciders: sdruskat, jkelling, led02, poikilotherm, skernchen
Date: 2023-11-15
Technical story: https://github.com/hermes-hmc/hermes/pull/40
Context and Problem Statement
To enable traceability of the metadata, and resolution based on metadata sources in case of duplicates, etc., we need to record the provenance of metadata values. To do this, we need to specify a way to do this.
Considered Options
Internal comment field
Dedicated metadata field
Use PROV standard
Separate internal provenance model
Create wrapped JSON-LD entities and add our metadata (json-ld/json-ld.org#744)
Create non-standard JSON-LD extension with custom keywords
Decision Outcome
Chosen option: “Create non-standard JSON-LD extension with custom keywords”, because comes out best.
Pros and Cons of the Options
Positive Consequences
We have a unified data model and keep both provenance and actual data at the sample place
Unifying the data model requires less complex handling in the implementation, as there is no need to recombine data
It’s not possible to loose the provenance information, as it is added to any attribute stored in the model
Negative Consequences
We loose the ability to reuse the serialized internal data model as output files (CodeMeta) without further processing
We have to document these added keywords very well for plugin developers
We need to define a ontology for this (internal) provenance metadata
Internal comment field
This would be a comment attached to single metadata fields to record provenance.
Bad, because Non-standard way to record this kind of information, i.e., non-reusable
Bad, because Extra documentation effort
Dedicated metadata field
This would be an extra metadata field to be attached to each field (?), e.g., with a URI (source: https://repo.org/user/project/codemeta.json
or similar)
Good, because Very generic way to specify the source of information
Bad, because Very generic way to specify the source of information
Bad, because Non-standard way to record provenance
Use PROV standard
This attaches provenance information following PROV-O to metadata fields
Good, because Standardized for provenance information
Good, because Not much extra documentation needed
Bad, because More circumvent way to describe relatively constricted cases (probably only use a few entities and
prov:wasInformedBy
or similar)
Separate internal metadata about metadata model
This would create a valid JSON-LD file serializing our internal data model and a auxiliary file with the provencance data
Good, because keeping things separate enables direct reuse and validation of the data model file
Good, because serialization of the provenance data is free form and simple to do
Bad, because we need to re-combine provenance and metadata
Bad, because we have more files in the output which might confuse people
Bad, because not easy to debug when recombination fails
Create wrapped JSON-LD entities and add our metadata
This would work around the limitation of RDF and JSON-LD that value objects are non-extensible
Good, because standard compliant, still validates using standard validators
Bad, because very noisy in the output files
Bad, because still needs back references to the object when using
@id
in graph objectsBad, because would require our own ontology and repeating any field ever needed (when keeping the original fields and not using a graph object)
Bad, because would require our own objects to keep the type and value separated, requiring reparsing when writing output files
Create non-standard JSON-LD extension with custom keywords
This would work around the limitation of RDF and JSON-LD that value objects are non-extensible
Good, because easy to implement in our custom handling of the graph as Python dictionaries
Good, because not very noisy
Good, because keywords are the JSON-LD way to provide metadata already
Good, because very light extension and not touching definitions from other ontologies
Good, because we can still make use of an ontology for the metadata objects to provide an open/closed principle compliant structure
Bad, because not standard compliant
Bad, because needs filtering when writing output files