Reproducibility¶
A corpus in lairs is reproducible: the same corpus at the same revision yields the same records, byte for byte. That guarantee is the reason the on-disk store is a version-control system rather than a directory of files, and it is what lets a dataset version pin exact record content and carry that provenance through to an export. This page explains how the guarantee is constructed and what its limits are.
The Repository as schema-aware version control¶
The on-disk source of truth is a didactic Repository, which is
content-addressed, versioned, and git-like, sitting over panproto's VCS.
lairs wraps it so that a corpus snapshot is a commit and a named dataset
version is a tag. This gives three properties at once: reproducibility (a
tag pins exact record content), provenance (which pull introduced or
changed a record), and cheap diffing (across Layers versions or across
re-pulls).
One fact about the upstream surface shapes the wrapper, and it is worth
stating plainly because it differs from the obvious mental model.
didactic's Repository is a schema VCS first. add stages a Model class
(or a panproto Schema) and records the structural schema. It also
versions record values. add_data stages a record's value as committed
data, keyed by AT-URI and associated with that schema. lairs uses both on
every save. It writes each record's value as JSON under records/, stages
that value as committed data under its AT-URI, and stages the record
type's Model schema alongside it. A commit captures both: the values, as
committed data, and their structure, in the schema history. The values
committed at a revision are read back through data_at under their AT-URI
keys, so a tag pins an exact, byte-reproducible set of record values.
didactic 0.9.0 exposes tag creation, the committed-data write, and the
committed-data read on the public Repository surface, all of which the
wrapper uses directly. A revision-to-revision diff reconstructs the value
set at each revision by folding data_at over the revision's commit
ancestry, keyed by AT-URI, then compares the two sets by content.
Structural diffs across two record-type schemas (a Layers version bump,
say) go through didactic's schema diff. The reproducibility the data needs
is backed by committed data, read back at any revision, rather than
reconstructed from loose files.
Content addressing¶
Reproducibility rests on content addressing, which lairs uses at two
levels. Record values are stored content-addressed in the working tree,
so identical values share storage and a changed value is a different
object. Blob bytes are cached content-addressed by their content
identifier (CID), under blobs/<cid>, shared across corpora and fetched
lazily. Because addresses are derived from content, a revision that
resolves to the same record values and the same blob CIDs is the same
corpus. There is no separate notion of equality to maintain. didactic's
own immutability and content-addressed hashing make this sound at the
model level: every value is frozen, so its address cannot shift under
it.
A snapshot is a commit, a version is a tag¶
The version-control vocabulary maps directly onto corpus operations. A
corpus snapshot is a single commit over the working tree. A named dataset
version (v2.1, say) is a tag pinning that commit, and resolving the
tag later yields the exact record content committed at it. This is what
makes "load the corpus at revision v2.1" a precise instruction rather
than an approximate one: the tag is an immutable pointer to a
content-addressed snapshot.
It is also what makes authoring a git-like round trip. pull ingests
existing PDS records into a Repository, an author commits and tags
locally, and publish diffs the target revision against what is already
on the PDS and emits only the writes needed to make the PDS match. The
revision is the unit of publication, so what reaches a PDS is always a
named, diffable state.
In-record reproducibility metadata¶
The store guarantee above is about record values: the same revision
yields the same bytes. A record may also carry reproducibility metadata
in its value, describing how the artifact it represents was produced.
That metadata is the ReproducibilityInfo def (code URI, commit hash,
command, environment, random seed). It is a shared def, carried by the
produce records that release a computational artifact (the corpus, the
annotation layer, the segmentation, the cluster set, the alignment, the
edge set, the experiment definition) as well as by the eprint data link,
rather than living only on eprints. The two are complementary: the store
pins what the records contain, and ReproducibilityInfo records how a
producer would regenerate the artifact those records describe.
Arrow views are rebuildable derivations¶
Fast ML access is served by materialized Arrow and Parquet views: an
expressions table, an exploded annotations table, and per-record-type
tables, with anchors flattened into typed columns. These are derived
from the Repository and are explicitly never the source of truth.
materialize writes them, and they can always be regenerated from the
committed records. Treating them as a cache rather than as canonical data
is what keeps the reproducibility guarantee intact: there is exactly one
authoritative copy of the data (the Repository), and the columnar views
are a rebuildable projection of it. A consumer can delete the views and
lose nothing but the time to rebuild them.
Provenance carried through to exports¶
Because a revision pins exact record content, it is also the unit of
provenance. The vendored-lexicon manifest records the source revision and
a content hash of the lexicon tree, each generated module embeds that
hash, and a corpus revision pins the record CIDs. An export carries this
bundle forward rather than copying data away from its source: an
experiment-tracking hook logs a Repository revision as an artifact, not
a copy, so a logged run pins exact record content, and a dataset pushed
to an external hub carries a provenance card naming the corpus AT-URI,
the Repository revision and tag, the lexicon manifest hash, the Layers
version, and a license identifier supplied by the caller from the corpus
record. (The card stores that license string verbatim; the structured
licensing to SPDX projection, expression else the first license's
slug, belongs to the discovery summary, not the export card.) The external
copy is a mirror, and the PDS
and the Repository stay canonical. Reproducibility therefore does not
stop at the store boundary. It travels with the data wherever an
adapter takes it.
For the operations (committing, tagging, diffing, and materializing) see the store guide. For how exports bind to the revision rather than to a copy, see integrations.