Store

The store holds records and their derived views: an in-memory pool with cross-reference resolution, a schematic version-control repository, the rebuildable Arrow views, and a content-addressed blob cache.

Pool

An in-memory pool keyed by AT-URI, resolving cross-refs and back-refs over the loaded record set.

lairs.store.pool

In-memory model pool with cross-reference resolution.

Maps AT-URIs to model instances and resolves cross-refs and back-refs, built on top of didactic's :class:didactic.api.ModelPool and back-ref machinery.

The pool is the default surface for "load a corpus and work with it now". It keeps every loaded record indexed by its AT-URI, resolves the AT-URI strings that records carry as cross-references back to model instances, and answers back-reference queries (which records point at a given target). Resolution degrades gracefully: when a referenced AT-URI is not present in the pool the string is preserved and resolve reports the absence rather than raising on a missing target.

ModelPool

ModelPool()

An in-memory pool of records keyed by AT-URI.

Wraps a :class:didactic.api.ModelPool (which indexes instances by their concrete class) with an AT-URI index, so records can be looked up by their canonical address and their cross-references resolved.

ATTRIBUTE DESCRIPTION
inner

The underlying class-indexed pool that backs all_of-style queries.

TYPE: ModelPool

uris

uris() -> list[str]

Return the AT-URIs of every record in the pool.

RETURNS DESCRIPTION
list of str

The AT-URIs, in insertion order.

models

models() -> list[Model]

Return every model instance held in the pool.

RETURNS DESCRIPTION
list of didactic.api.Model

The models, in insertion order.

add

add(uri: str, model: Model) -> None

Add a model to the pool under its AT-URI.

Adding the same AT-URI twice replaces the earlier instance in the AT-URI index. The model is also registered with the underlying class-indexed didactic pool.

PARAMETER DESCRIPTION
uri

The AT-URI of the record.

TYPE: str

model

The decoded model instance.

TYPE: Model

get

get(uri: str) -> Model | None

Return the model stored under uri, or None if absent.

PARAMETER DESCRIPTION
uri

The AT-URI to look up.

TYPE: str

RETURNS DESCRIPTION
Model or None

The model, or None when uri is not in the pool.

resolve

resolve(ref: str) -> Model | None

Resolve a reference to its target model.

A reference is an AT-URI string (the same form held by dx.Ref fields and free AT-URI fields). Resolution degrades gracefully: when the target is not loaded, None is returned and the caller may keep using the AT-URI string.

PARAMETER DESCRIPTION
ref

The AT-URI to resolve.

TYPE: str

RETURNS DESCRIPTION
Model or None

The target model, or None when it is not present in the pool.

refs_of

refs_of(uri: str) -> list[str]

Return the AT-URIs that the record at uri points at.

Walks the dumped record generically, collecting every AT-URI-shaped string nested anywhere in its fields (including embedded objects, arrays, and union members). Duplicate targets are reported once, in first-seen order, and a record never lists itself.

PARAMETER DESCRIPTION
uri

The AT-URI of the referring record.

TYPE: str

RETURNS DESCRIPTION
list of str

The distinct AT-URIs referenced by the record, or an empty list when the record is absent.

backrefs

backrefs(target: str) -> list[Model]

List the models that reference a target.

Scans every record in the pool and returns those that hold the target AT-URI anywhere in their fields. A target that is itself absent from the pool still resolves its back-references, so inbound links to a not-yet-loaded record are discoverable.

PARAMETER DESCRIPTION
target

The AT-URI of the referenced record.

TYPE: str

RETURNS DESCRIPTION
list of didactic.api.Model

The referring models, in insertion order.

backref_uris

backref_uris(target: str) -> list[str]

List the AT-URIs of the records that reference a target.

PARAMETER DESCRIPTION
target

The AT-URI of the referenced record.

TYPE: str

RETURNS DESCRIPTION
list of str

The AT-URIs of the referring records, in insertion order.

Repository

A wrapper over didactic.api.Repository for Layers records: a corpus snapshot is a commit and a named dataset version is a tag.

lairs.store.repository

didactic Repository wrapper: the on-disk schematic version control.

Wraps :class:didactic.api.Repository (panproto-backed, content-addressed, git-like) so a corpus snapshot is a commit and a named dataset version is a tag, giving reproducibility, provenance, and cheap diffing.

add stages a Model class (or a panproto.Schema) and records the structural schema, while add_data stages a record's value as committed data keyed by AT-URI and associated with that schema. lairs writes each record's value as JSON under records/, stages it as committed data under its AT-URI, and stages the record type's Model schema alongside it, so one commit captures both. A corpus snapshot is a single commit; the values committed at a revision are read back through data_at under their AT-URI keys, so a tag pins an exact, byte-reproducible set of values, and a revision-to-revision diff compares the values folded over each revision's commit ancestry.

Removal is representable: forget stages a tombstone under a record's AT-URI, which the ancestry fold drops, so a record present at a base revision can be absent at a descendant head and a diff reports it in removed.

didactic 0.9.0 exposes tag creation (create_tag and friends), the committed-data write (add_data), and the committed-data read (data_at) on the public Repository surface, all of which this wrapper uses directly.

RecordDiff

Bases: Model

A structural diff of the record set between two revisions.

ATTRIBUTE DESCRIPTION
added

AT-URIs present at the head revision but not the base revision.

TYPE: tuple of str

removed

AT-URIs present at the base revision but not the head revision.

TYPE: tuple of str

changed

AT-URIs present in both revisions whose stored value differs.

TYPE: tuple of str

Repository

Repository(inner: Repository)

A wrapper over a didactic Repository for Layers records.

The repository is created on disk with :meth:init or reopened with :meth:open; the constructor takes an already-open inner handle and is mostly for internal use.

PARAMETER DESCRIPTION
inner

An already-open didactic repository handle.

TYPE: Repository

ATTRIBUTE DESCRIPTION
inner

The wrapped didactic repository.

TYPE: Repository

path

The repository working directory.

TYPE: Path

init classmethod

init(path: Path) -> Repository

Initialise a new repository at path.

PARAMETER DESCRIPTION
path

Directory in which to create the repository.

TYPE: Path

RETURNS DESCRIPTION
Repository

A handle to the newly initialised repository.

open classmethod

open(path: Path) -> Repository

Open an existing repository at path.

PARAMETER DESCRIPTION
path

Directory containing an existing repository.

TYPE: Path

RETURNS DESCRIPTION
Repository

A handle to the existing repository.

save

save(uri: str, model: Model) -> None

Stage a record value and its schema for the next commit.

The record value is written as JSON into the working tree, indexed by its AT-URI, and staged as committed data so a revision pins the exact value. The record type's Model schema is staged alongside it, so the commit captures both the data and its structure.

PARAMETER DESCRIPTION
uri

The AT-URI of the record.

TYPE: str

model

The record value to persist.

TYPE: Model

forget

forget(uri: str) -> None

Stage a record's removal for the next commit.

Removes the AT-URI from the working-tree index and its JSON file, and stages a tombstone as committed data under the same key. The tombstone is folded by :meth:_state_at, so the record is absent from the reconstructed value set at the committed revision and a revision-to-revision :meth:diff reports it in removed.

Staging a tombstone requires a schema to bind to, so the repository must already have at least one commit (the record's type schema was committed when it was first saved).

PARAMETER DESCRIPTION
uri

The AT-URI of the record to remove.

TYPE: str

RAISES DESCRIPTION
KeyError

If the AT-URI is not present in the working tree.

staged_uris

staged_uris() -> list[str]

Return the AT-URIs currently present in the working tree.

RETURNS DESCRIPTION
list of str

The AT-URIs of records written to the working tree, sorted.

load

load(uri: str, model_cls: type[Model]) -> Model | None

Load a record value from the working tree by AT-URI.

PARAMETER DESCRIPTION
uri

The AT-URI of the record to load.

TYPE: str

model_cls

The Model class to validate the stored JSON against.

TYPE: type of didactic.api.Model

RETURNS DESCRIPTION
Model or None

The validated model, or None when the AT-URI is not stored.

load_raw

load_raw(uri: str) -> JsonValue | None

Load a stored record value as raw JSON by AT-URI.

PARAMETER DESCRIPTION
uri

The AT-URI of the record to load.

TYPE: str

RETURNS DESCRIPTION
JsonValue or None

The decoded JSON value, or None when the AT-URI is not stored.

commit

commit(
    message: str, *, author: str = _DEFAULT_AUTHOR
) -> str

Commit the staged records as a corpus snapshot.

PARAMETER DESCRIPTION
message

The commit message.

TYPE: str

author

The commit author, in conventional "Name <email>" form.

TYPE: str DEFAULT: _DEFAULT_AUTHOR

RETURNS DESCRIPTION
str

The new revision identifier.

head

head() -> str | None

Return the current head revision, or None for an empty repository.

RETURNS DESCRIPTION
str or None

The head commit id, or None when there are no commits yet.

log

log() -> list[dict[str, JsonValue]]

Return the commit log, newest first.

RETURNS DESCRIPTION
list of dict

One commit-record dict per commit, newest first.

tag

tag(name: str, *, revision: str | None = None) -> None

Tag a revision as a named dataset version.

A tag pins the exact record values committed at the revision, giving a reproducible named version.

PARAMETER DESCRIPTION
name

The tag name.

TYPE: str

revision

The revision to tag; defaults to the current head.

TYPE: str or None DEFAULT: None

RAISES DESCRIPTION
ValueError

If no revision is given and the repository has no commits.

tags

tags() -> list[tuple[str, str]]

Return the list of tags.

RETURNS DESCRIPTION
list of (str, str)

One (name, target_revision) pair per tag.

resolve

resolve(ref: str) -> str

Resolve a ref expression to a commit id.

PARAMETER DESCRIPTION
ref

A branch name, tag name, or commit-id prefix.

TYPE: str

RETURNS DESCRIPTION
str

The full commit id.

diff

diff(base: str, head: str) -> RecordDiff

Diff the committed record values between two revisions.

The value set committed at each revision is reconstructed from the committed data read with data_at, keyed by AT-URI, and the two sets are compared by content.

PARAMETER DESCRIPTION
base

The base revision (ref expression).

TYPE: str

head

The head revision (ref expression).

TYPE: str

RETURNS DESCRIPTION
RecordDiff

The added, removed, and changed AT-URIs between the revisions.

content_at

content_at(ref: str) -> dict[str, JsonValue]

Return the committed record values at a revision, keyed by AT-URI.

Each value is the JSON-decoded record content folded over the revision's commit ancestry, so a record removed by a tombstone at or before ref is absent. This is the decoded counterpart to the raw byte state behind :meth:diff, suitable for a field-level comparison of a record across two revisions.

PARAMETER DESCRIPTION
ref

The revision to read (ref expression).

TYPE: str

RETURNS DESCRIPTION
dict of str to JsonValue

The decoded record values at the revision, keyed by AT-URI.

schema_diff

schema_diff(
    old: type[Model], new: type[Model]
) -> dict[str, JsonValue]

Compute a structural diff between two record-type schemas.

This wraps :func:didactic.api.diff, which compares two Model classes (for example two generated record types across a Layers version bump), not two revisions of the same record's values.

PARAMETER DESCRIPTION
old

The base schema class.

TYPE: type of didactic.api.Model

new

The head schema class.

TYPE: type of didactic.api.Model

RETURNS DESCRIPTION
dict

The structural schema diff.

Workspace

Workspace(repository: Repository)

A record-type-aware grouping over a :class:Repository.

Indexes the AT-URIs in a repository by their collection NSID so that per-record-type listing and history are cheap, mirroring the way a corpus is a graph of many record types.

PARAMETER DESCRIPTION
repository

The repository to index.

TYPE: Repository

ATTRIBUTE DESCRIPTION
repository

The wrapped repository.

TYPE: Repository

by_nsid

by_nsid() -> dict[str, list[str]]

Group the working-tree AT-URIs by collection NSID.

RETURNS DESCRIPTION
dict of str to list of str

A mapping from collection NSID to the sorted AT-URIs of that type.

nsids

nsids() -> list[str]

Return the collection NSIDs present in the workspace.

RETURNS DESCRIPTION
list of str

The distinct collection NSIDs, sorted.

uris_of

uris_of(nsid: str) -> list[str]

Return the AT-URIs of records of a given collection NSID.

PARAMETER DESCRIPTION
nsid

The collection NSID to select.

TYPE: str

RETURNS DESCRIPTION
list of str

The sorted AT-URIs of that record type.

Arrow views

Derived, rebuildable columnar views with anchors flattened into typed columns. These are never the source of truth and can be regenerated from the record store with materialize.

lairs.store.arrow

Arrow/Parquet materialized views over the record store.

Derived, rebuildable columnar views for ML-speed access, with anchors flattened into typed columns. These views are never the source of truth: they are computed from the record store and can always be regenerated with :func:materialize.

The flattening is driven by generic model field access (model_dump), so it works against the abstract :class:didactic.api.Model interface today and against the real generated record models once they land. Polymorphic anchors are resolved into a fixed set of typed columns (anchor_kind, byte_start, byte_end, token_id, token_index, token_indexes, t_start_ms, t_end_ms, bbox_x, bbox_y, bbox_w, bbox_h, page, ext_source) so a consumer can filter and scan without re-dispatching the union per row.

Every one of the seven generated :class:~lairs.records._generated.defs.Anchor variants (textSpan, tokenRef, tokenRefSequence, temporalSpan, spatioTemporalAnchor, pageAnchor, externalTarget) projects to a distinguishable anchor_kind so no variant collapses into an anchorless row.

The view set mirrors the appview's normalization: an expressions table (one row per expression), an annotations table (one row per (layer_uri, annotation_index) produced by exploding each layer's annotations array), plus segmentations, media, and edges tables. Only top-level scalar fields and the flattened anchor columns reach these views; non-anchor nested arrays and objects (for example features or per-keyframe data) are intentionally dropped by the flatten-to-typed-columns boundary (see :func:_scalar_columns).

RecordLike

Bases: Protocol

The minimal record shape the Arrow views consume.

Any :class:didactic.api.Model satisfies this protocol; the views never depend on a concrete generated type, only on the ability to dump a record to a JSON string. The JSON form is used (rather than the shallow model_dump) so nested models, tuples, and union members all normalise to plain JSON containers that the flattening can descend into uniformly.

model_dump_json

model_dump_json() -> str

Return the record's fields as a JSON string.

flatten_anchor

flatten_anchor(
    anchor: Mapping[str, JsonValue] | None,
) -> dict[str, JsonValue]

Flatten a polymorphic anchor mapping into typed columns.

Recognises every :class:~lairs.records._generated.defs.Anchor variant by the fields it carries and projects each into the fixed :data:ANCHOR_COLUMNS, leaving unrelated columns unset. Unknown or absent anchors yield an all-None row with an anchor_kind of None, so the resulting column set is uniform across rows regardless of which anchor variant each record uses.

The seven recognised variants map to these anchor_kind values:

  • textSpan -> "span" (byte_start, byte_end)
  • tokenRef -> "tokenRef" (token_id, token_index)
  • tokenRefSequence -> "tokenRefSequence" (token_id, token_indexes)
  • temporalSpan -> "temporalSpan" (t_start_ms, t_end_ms)
  • spatioTemporalAnchor -> "spatioTemporalAnchor" (t_start_ms, t_end_ms)
  • pageAnchor -> "pageAnchor" (page, nested byte_start / byte_end and bbox_*)
  • externalTarget -> "externalTarget" (ext_source)
PARAMETER DESCRIPTION
anchor

A dumped anchor value, or None when the record has no anchor.

TYPE: Mapping or None

RETURNS DESCRIPTION
dict

A mapping over :data:ANCHOR_COLUMNS with the recognised fields filled.

records_to_table

records_to_table(records: Iterable[RecordLike]) -> Table

Flatten a sequence of records into an Arrow table.

Each record's scalar fields become columns, and any anchor field is expanded into the typed :data:ANCHOR_COLUMNS. The column union across all records is used, with missing values filled as None, so heterogeneous records share one schema.

PARAMETER DESCRIPTION
records

The records to flatten; anchors become typed columns.

TYPE: collections.abc.Iterable of RecordLike

RETURNS DESCRIPTION
Table

The flattened columnar view.

expressions_table

expressions_table(records: Iterable[RecordLike]) -> Table

Build the expressions view: one row per expression record.

PARAMETER DESCRIPTION
records

The expression records.

TYPE: collections.abc.Iterable of RecordLike

RETURNS DESCRIPTION
Table

One row per expression, anchors flattened into typed columns.

annotations_table

annotations_table(
    layers: Iterable[tuple[str, RecordLike]],
) -> Table

Build the annotations view by exploding each layer's annotations array.

Produces one row per (layer_uri, annotation_index), mirroring the appview's PG normalization. Each annotation's scalar fields become columns, its anchor is flattened into the typed columns, and layer_uri plus annotation_index identify the source.

PARAMETER DESCRIPTION
layers

Pairs of layer AT-URI and the layer record. The record is expected to carry an annotations array; layers without one contribute no rows.

TYPE: collections.abc.Iterable of (str, RecordLike)

RETURNS DESCRIPTION
Table

One row per exploded annotation.

materialize

materialize(
    repo: Repository,
    out_dir: Path,
    *,
    views: Mapping[str, Table] | None = None,
) -> list[Path]

Materialize named Arrow views into Parquet files.

The views are derived, rebuildable outputs and never the source of truth. When views is omitted the repository's record store is read and grouped by collection NSID, with each NSID written as its own Parquet file; callers that have already built the normalized expressions / annotations / segmentations / media / edges tables can pass them explicitly.

Each view is written as a single <name>.parquet file with pyarrow's default write options. Compression, row-group sizing, and partitioning are not parameterized; a caller needing those should write the returned tables with :func:pyarrow.parquet.write_table directly.

PARAMETER DESCRIPTION
repo

The repository whose record store is materialized when views is not supplied.

TYPE: Repository

out_dir

The output directory for the Parquet views; created if absent.

TYPE: Path

views

Pre-built named views to write. When None the views are derived from the repository working tree.

TYPE: collections.abc.Mapping of str to pyarrow.Table or None DEFAULT: None

RETURNS DESCRIPTION
list of pathlib.Path

The written Parquet files, in name order.

Blob cache

A content-addressed on-disk cache of blob bytes, keyed by CID.

lairs.store.blobcache

Content-addressed blob cache.

Caches blob bytes on disk under blobs/<cid>, populated lazily by the media layer and shared across corpora. The cache is content-addressed: a blob's CID is its file name, so identical content stored under the same CID is deduplicated and put is idempotent. When the key is a decodable CID, :meth:BlobCache.put verifies that it actually addresses the bytes (the multihash digest matches), so a wrong CID can never poison the cache. A key that is not a CID (for example an external URI used by the media layer to cache fetched bytes) is treated as an opaque, trusted cache key and is stored without a digest check. Writes are atomic so a crash mid-write cannot leave a truncated blob that :meth:BlobCache.exists would report as present.

BlobCacheError

Bases: ValueError

Raised when a blob operation violates the cache's content-addressing.

Carried by :meth:BlobCache.put when the key contains path separators that would escape the blobs directory, or when a key that is a decodable CID has a multihash digest that does not match the supplied bytes.

BlobCache

BlobCache(root: Path)

A content-addressed on-disk cache of blob bytes.

PARAMETER DESCRIPTION
root

The cache root directory. Blobs are stored under root/blobs/<cid>.

TYPE: Path

ATTRIBUTE DESCRIPTION
root

The cache root directory.

TYPE: Path

path_for

path_for(cid: str) -> Path

Return the on-disk path a blob would occupy for a CID.

The path is returned whether or not the blob is present, so the media layer can stream bytes straight to it.

PARAMETER DESCRIPTION
cid

The content identifier.

TYPE: str

RETURNS DESCRIPTION
Path

The path root/blobs/<cid>.

RAISES DESCRIPTION
BlobCacheError

If the CID is not a safe single path component.

exists

exists(cid: str) -> bool

Return True if a blob for cid is cached.

PARAMETER DESCRIPTION
cid

The content identifier.

TYPE: str

RETURNS DESCRIPTION
bool

Whether the blob is present on disk.

get

get(cid: str) -> bytes | None

Return cached bytes for a CID, or None if absent.

PARAMETER DESCRIPTION
cid

The content identifier.

TYPE: str

RETURNS DESCRIPTION
bytes or None

The cached bytes, or None when the blob is not cached.

put

put(cid: str, data: bytes, *, verify: bool = True) -> Path

Store bytes under a cache key.

Storing the same key again overwrites the existing file with identical content, so put is idempotent for content-addressed input. When the key is a decodable CID it is verified against the bytes (its multihash digest must match) so a wrong CID cannot poison the cache; a non-CID key (for example an external URI) is stored as an opaque trusted key. The write is atomic (a temp file in the same directory is renamed onto the final path) so a crash mid-write cannot leave a truncated blob.

PARAMETER DESCRIPTION
cid

The cache key: a content identifier, or an opaque trusted key.

TYPE: str

data

The bytes to cache.

TYPE: bytes

verify

When True (the default), a CID key's multihash digest is checked against data before storing. Pass False to skip the check when the key has already been verified.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
Path

The path the bytes were written to.

RAISES DESCRIPTION
BlobCacheError

If the key is not a safe path component, or (when verify) is a decodable CID that does not address data.