Dataset API

The corpus surface, the lazy dataset, and the feature description derived from the generated models. A Corpus joins Layers records by AT-URI and exposes them as Dataset views. Features reads a dataset's columnar schema off the model field specs. For usage see Guides > The dataset API.

Corpus

lairs.data.corpus

The corpus surface: a graph of records joined by AT-URIs.

A Corpus exposes dataset views (expressions, annotation layers) over the graph of Layers records, plus authoring and persistence entry points. The graph is held in a :class:lairs.store.pool.ModelPool keyed by AT-URI, so cross-refs (an annotation layer's expression, an expression's mediaRef, a segmentation's expression) resolve to model instances. The join helpers walk those refs to group related records per expression.

Membership records (pub.layers.corpus.membership) tie an expression to a corpus via corpusRef and carry an optional split slug. When the pool holds membership records for this corpus, the expression views and joins are restricted to the expressions those memberships reference, so loading one corpus from an authority that hosts several does not bleed the others' expressions in. When no membership records are present (for example a freshly authored corpus built only through the add_* helpers) every pooled expression is treated as a member, which keeps direct authoring ergonomic.

Loading dispatches on a source. The pds source enumerates the relevant collections of an authority's repository through a PDS client; the appview source uses the appview query API; auto prefers the appview and falls back to the PDS. A client may be injected for testing without network access.

The record :class:pub.layers.corpus.Corpus model is imported qualified as corpus_records.Corpus to avoid clashing with the dataset-surface :class:Corpus defined here.

ExpressionWithAnnotations

Bases: Model

An expression joined to its annotation layers.

ATTRIBUTE DESCRIPTION
expression

The expression record.

TYPE: Expression

uri

The AT-URI of the expression.

TYPE: str

annotation_layers

The annotation layers whose expression ref points at this one.

TYPE: tuple of pub.layers.annotation.AnnotationLayer

ExpressionWithMedia

Bases: Model

An expression joined to its media record.

ATTRIBUTE DESCRIPTION
expression

The expression record.

TYPE: Expression

uri

The AT-URI of the expression.

TYPE: str

media

The media record referenced by the expression's mediaRef, if loaded.

TYPE: Media or None

ExpressionWithSegmentation

Bases: Model

An expression joined to its segmentation records.

ATTRIBUTE DESCRIPTION
expression

The expression record.

TYPE: Expression

uri

The AT-URI of the expression.

TYPE: str

segmentations

The segmentations whose expression ref points at this one.

TYPE: tuple of pub.layers.segmentation.Segmentation

Corpus

Corpus(
    pool: ModelPool | None = None, *, uri: str | None = None
)

A graph of Layers records joined by AT-URI cross-references.

PARAMETER DESCRIPTION
pool

A pre-populated pool of records keyed by AT-URI. When omitted an empty pool is created and records may be added through the authoring helpers.

TYPE: ModelPool or None DEFAULT: None

uri

The AT-URI of the backing corpus record, when the corpus was loaded from one.

TYPE: str or None DEFAULT: None

ATTRIBUTE DESCRIPTION
pool

The AT-URI-keyed record graph.

TYPE: ModelPool

uri

The corpus record AT-URI, if any.

TYPE: str or None

expressions property

expressions: Dataset[Expression]

Return a dataset of the corpus member expressions.

When the pool holds membership records for this corpus only the expressions those memberships reference are returned; otherwise every pooled expression is returned.

RETURNS DESCRIPTION
Dataset

A dataset of expression models, in pool order.

corpus_record property

corpus_record: Corpus | None

Return the backing corpus record, if one is loaded.

The record is looked up in the pool at :attr:uri; it is None when the corpus has no AT-URI or when no corpus record was loaded for it.

RETURNS DESCRIPTION
Corpus or None

The backing corpus record, or None when absent.

new classmethod

new(uri: str | None = None) -> Corpus

Create an empty corpus for authoring.

PARAMETER DESCRIPTION
uri

An AT-URI to associate with the corpus record.

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
Corpus

A new, empty corpus.

expression_uris

expression_uris() -> list[str]

Return the AT-URIs of the corpus member expressions.

RETURNS DESCRIPTION
list of str

The member expression AT-URIs, in pool order.

annotation_layers

annotation_layers(
    *, kind: str | None = None, subkind: str | None = None
) -> Dataset[AnnotationLayer]

Return a dataset of annotation layers, optionally filtered.

PARAMETER DESCRIPTION
kind

An annotation-layer kind filter (for example "token-tag").

TYPE: str or None DEFAULT: None

subkind

An annotation-layer subkind filter (for example "pos").

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
Dataset

A dataset of annotation-layer models matching the filters.

segmentations

segmentations() -> Dataset[Segmentation]

Return a dataset of the corpus segmentations.

RETURNS DESCRIPTION
Dataset

A dataset of segmentation models, in pool order.

media

media() -> Dataset[Media]

Return a dataset of the corpus media records.

RETURNS DESCRIPTION
Dataset

A dataset of media models, in pool order.

memberships

memberships() -> Dataset[Membership]

Return a dataset of the corpus membership records.

Each membership ties an expression to this corpus via corpusRef and may carry a split slug and an ordinal. When :attr:uri is set only the memberships whose corpusRef equals it are returned.

RETURNS DESCRIPTION
Dataset

A dataset of membership models, in pool order.

split

split(name: str) -> Dataset[Expression]

Return the corpus member expressions assigned to a named split.

Expressions are joined to their membership records by AT-URI and kept when a membership's split slug equals name (for example "train", "dev", "test", or "unlabeled"). An expression with several memberships is included when any of them carries the split.

PARAMETER DESCRIPTION
name

The split slug to select.

TYPE: str

RETURNS DESCRIPTION
Dataset

A dataset of the expression models in that split, in pool order.

splits

splits() -> tuple[str, ...]

Return the distinct split slugs present in the corpus memberships.

RETURNS DESCRIPTION
tuple of str

The split slugs, sorted, excluding memberships with no split.

add_membership

add_membership(uri: str, membership: Membership) -> None

Add a membership record to the corpus graph.

PARAMETER DESCRIPTION
uri

The AT-URI of the membership record.

TYPE: str

membership

The membership record binding an expression to a corpus.

TYPE: Membership

with_annotations

with_annotations() -> Dataset[ExpressionWithAnnotations]

Join each expression to the annotation layers that target it.

Annotation layers carry an expression AT-URI; this groups the layers by that ref and attaches them to the matching expression. Expressions with no layers still appear, with an empty group.

RETURNS DESCRIPTION
Dataset

A dataset of expression-and-annotations join rows.

with_media

with_media() -> Dataset[ExpressionWithMedia]

Join each expression to the media record it references.

An expression's mediaRef AT-URI is resolved through the pool; when the media record is not loaded the join row carries None.

RETURNS DESCRIPTION
Dataset

A dataset of expression-and-media join rows.

with_segmentation

with_segmentation() -> Dataset[ExpressionWithSegmentation]

Join each expression to the segmentations that target it.

Segmentations carry an expression AT-URI; this groups them by that ref and attaches them to the matching expression.

RETURNS DESCRIPTION
Dataset

A dataset of expression-and-segmentation join rows.

add_expression

add_expression(uri: str, expression: Expression) -> None

Add an expression record to the corpus graph.

PARAMETER DESCRIPTION
uri

The AT-URI of the expression.

TYPE: str

expression

The expression record to add.

TYPE: Expression

add_annotation_layer

add_annotation_layer(
    uri: str, layer: AnnotationLayer
) -> None

Add an annotation layer record to the corpus graph.

PARAMETER DESCRIPTION
uri

The AT-URI of the annotation layer.

TYPE: str

layer

The annotation layer record to add.

TYPE: AnnotationLayer

add_record

add_record(uri: str, record: Model) -> None

Add any Layers record to the corpus graph by AT-URI.

PARAMETER DESCRIPTION
uri

The AT-URI of the record.

TYPE: str

record

The record to add (expression, layer, segmentation, media, etc.).

TYPE: Model

save_to_repo

save_to_repo(path: Path) -> str

Persist the corpus graph to a didactic Repository and commit.

Delegates to the store's :class:lairs.store.repository.Repository, staging every record under its AT-URI and committing a single snapshot.

PARAMETER DESCRIPTION
path

The repository directory to initialise or reuse.

TYPE: Path

RETURNS DESCRIPTION
str

The new commit revision identifier.

materialize

materialize(out_dir: Path) -> list[Path]

Materialize the corpus to Parquet views.

Builds the normalized expressions and annotations Arrow views from the graph and delegates writing to the store's Arrow :func:lairs.store.arrow.materialize. The expressions view holds the corpus member expressions only (see :attr:expressions).

PARAMETER DESCRIPTION
out_dir

The output directory for the views.

TYPE: Path

RETURNS DESCRIPTION
list of pathlib.Path

The written view files, in name order.

load_corpus

load_corpus(
    uri: str,
    *,
    source: str = "auto",
    cache_dir: str | None = None,
    revision: str | None = None,
    pds_client: PdsClient | None = None,
) -> Corpus

Load a corpus by AT-URI from a PDS or the appview.

The loader enumerates the Layers record collections of the AT-URI's authority and builds the joined graph. The corpus's expression views and joins are then scoped to the expressions reachable through membership records whose corpusRef matches uri, so an authority that hosts several corpora yields only this corpus's members. The pds source reads directly from a PDS; appview and auto are not implemented without an appview client yet and currently require the pds source with an injected client.

PARAMETER DESCRIPTION
uri

The corpus AT-URI (its authority is enumerated).

TYPE: str

source

The source to load from ("pds", "appview", or "auto").

TYPE: str DEFAULT: 'auto'

cache_dir

A local cache directory (reserved; not yet used).

TYPE: str or None DEFAULT: None

revision

A revision (Repository tag) to resolve (reserved; not yet used).

TYPE: str or None DEFAULT: None

pds_client

An injected PDS client. Required for the pds source; supplying it avoids network setup in tests.

TYPE: PdsClient or None DEFAULT: None

RETURNS DESCRIPTION
Corpus

The loaded corpus.

RAISES DESCRIPTION
ValueError

When source is not a recognised source value.

NotImplementedError

When the appview source is requested without an appview client, or the PDS source is requested without an injected client.

Dataset

lairs.data.dataset

HuggingFace-like dataset over generated record models.

A Dataset is a lazy, optionally streaming sequence of generated model instances, with map and materialization helpers. It is generic over the model type it yields so indexing and iteration are precisely typed.

The dataset is lazy by default: it holds a source that produces model instances on demand, plus an optional chain of per-record transforms applied as records flow through. Two source shapes are supported. An in-memory source wraps a concrete tuple of models and supports random access and len. A streaming source wraps a zero-argument factory that returns a fresh iterator of models (for example a PDS cursor or a repository scan); it has no length and no random access until it is drained.

Dataset

Dataset(
    records: Sequence[ModelT] | None = None,
    *,
    model: type[ModelT] | None = None,
    source: Callable[[], Iterator[ModelT]] | None = None,
)

A lazy dataset of generated record models of one type.

The dataset is generic over ModelT, the model type it yields, so indexing and iteration are precisely typed rather than widened. A dataset is constructed from an in-memory tuple of records (the default and the form random access and len require) or from a streaming factory.

PARAMETER DESCRIPTION
records

The in-memory records the dataset yields. Mutually exclusive with source; when both are omitted the dataset is empty.

TYPE: collections.abc.Sequence of ModelT or None DEFAULT: None

model

The model type the dataset yields. Required for an empty or streaming dataset so that :attr:features can be derived; inferred from the first record otherwise.

TYPE: type of ModelT or None DEFAULT: None

source

A zero-argument factory returning a fresh iterator of records for a streaming dataset. Mutually exclusive with records.

TYPE: Callable or None DEFAULT: None

is_streaming property

is_streaming: bool

Return whether the dataset is backed by a streaming source.

RETURNS DESCRIPTION
bool

True when the dataset pulls lazily and has no random access.

features property

features: Features

Return the dataset schema derived from the model.

RETURNS DESCRIPTION
Features

The feature description for the dataset's model type.

streaming classmethod

streaming(
    source: Callable[[], Iterator[ModelT]],
    *,
    model: type[ModelT],
) -> Dataset[ModelT]

Build a streaming dataset from an iterator factory.

A streaming dataset pulls records lazily from source and never materializes the whole collection in memory until a materializing call (for example :meth:to_arrow) drains it.

PARAMETER DESCRIPTION
source

A zero-argument factory returning a fresh iterator of records.

TYPE: Callable

model

The model type the stream yields, used to derive features.

TYPE: type of ModelT

RETURNS DESCRIPTION
Dataset

A streaming dataset over the source.

iter

iter(batch_size: int = 1) -> Iterator[tuple[ModelT, ...]]

Iterate over the dataset in batches.

PARAMETER DESCRIPTION
batch_size

The number of records per batch. The final batch may be smaller.

TYPE: int DEFAULT: 1

YIELDS DESCRIPTION
tuple of ModelT

Successive batches of records.

RAISES DESCRIPTION
ValueError

When batch_size is not positive.

map

map(
    fn: Callable[[ModelT], ModelT],
    *,
    model: type[ModelT] | None = None,
) -> Dataset[ModelT]

Apply a lazy per-record transform.

The transform is not applied eagerly: it is composed onto the dataset's source so it runs as records flow through a later iteration or materialization. The result preserves the source's laziness and streaming behaviour.

This is strictly per-record. There is no batched mode that hands the callable a batch, because the transform signature is fixed to one record in and one record out; group the records yourself with :meth:iter when a batch view is needed.

PARAMETER DESCRIPTION
fn

The per-record transform mapping a model to a model.

TYPE: Callable

model

The model type the transformed dataset yields. Defaults to this dataset's model type; supply it when the transform changes the feature shape and the new shape must be derivable.

TYPE: type of ModelT or None DEFAULT: None

RETURNS DESCRIPTION
Dataset

A new lazy dataset with the transform applied.

map_batched

map_batched(
    fn: Callable[[Sequence[ModelT]], Iterable[ModelT]],
    *,
    batch_size: int = 1000,
    model: type[ModelT] | None = None,
) -> Dataset[ModelT]

Apply a lazy transform over batches of records.

Unlike :meth:map, the callable receives a batch (a sequence of records) and returns an iterable of records, so a transform can add, drop, or reshape records across a batch (the HuggingFace map(batched=True) affordance). The transform is composed lazily onto the source and preserves streaming behaviour; the output record count need not match the input count.

PARAMETER DESCRIPTION
fn

The batch transform mapping a sequence of records to an iterable of records.

TYPE: Callable

batch_size

The number of records handed to fn per call. The final batch may be smaller.

TYPE: int DEFAULT: 1000

model

The model type the transformed dataset yields. Defaults to this dataset's model type.

TYPE: type of ModelT or None DEFAULT: None

RETURNS DESCRIPTION
Dataset

A new lazy dataset with the batch transform applied.

RAISES DESCRIPTION
ValueError

When batch_size is not positive.

filter

filter(
    predicate: Callable[[ModelT], bool],
) -> Dataset[ModelT]

Filter the dataset by a per-record predicate, lazily.

PARAMETER DESCRIPTION
predicate

A predicate selecting which records to keep.

TYPE: Callable

RETURNS DESCRIPTION
Dataset

A new lazy dataset of the records for which predicate is true.

take

take(count: int) -> Dataset[ModelT]

Materialize the first count records into a new in-memory dataset.

PARAMETER DESCRIPTION
count

The number of records to take from the front.

TYPE: int

RETURNS DESCRIPTION
Dataset

An in-memory dataset of at most count records.

materialize

materialize() -> Dataset[ModelT]

Drain the dataset into an in-memory dataset with random access.

RETURNS DESCRIPTION
Dataset

An in-memory copy supporting len and indexing.

to_arrow

to_arrow() -> Table

Materialize the dataset to an Arrow table.

The table is the flattened columnar view produced by the store's Arrow machinery: scalar fields become columns and any anchor field is expanded into the typed anchor columns.

This is a full materialization: a streaming dataset is drained and every row is buffered in memory while the table is built, so this should not be called on an unbounded stream without a bounding :meth:take first.

RETURNS DESCRIPTION
Table

The materialized columnar view.

to_pandas

to_pandas() -> DataFrame

Materialize the dataset to a pandas DataFrame.

pandas is an optional dependency, resolved through pyarrow's :meth:pyarrow.Table.to_pandas, which raises a clear ImportError when pandas is not installed. Like :meth:to_arrow, this is a full materialization that drains and buffers a streaming dataset.

RETURNS DESCRIPTION
DataFrame

The materialized table as a DataFrame.

RAISES DESCRIPTION
ImportError

When pandas is not installed.

from_iterable classmethod

from_iterable(
    records: Iterable[ModelT],
    *,
    model: type[ModelT] | None = None,
) -> Dataset[ModelT]

Build an in-memory dataset by draining an iterable of records.

PARAMETER DESCRIPTION
records

The records to collect.

TYPE: collections.abc.Iterable of ModelT

model

The model type the dataset yields.

TYPE: type of ModelT or None DEFAULT: None

RETURNS DESCRIPTION
Dataset

An in-memory dataset over the drained records.

Features

lairs.data.features

Dataset feature description derived from the generated models.

Features is a didactic model describing a dataset's columnar schema, read off the generated record model field specs so it always matches the lexicons. The derivation maps each didactic field annotation to a dtype token, unwrapping optionality, exploding tuples into sequence tokens, descending into nested dx.Embed structs, and marking opaque fields as a binary dtype.

FeatureSpec

Bases: Model

A single named feature and its dtype.

ATTRIBUTE DESCRIPTION
name

The feature (column) name.

TYPE: str

dtype

The feature dtype as a string token (for example "string").

TYPE: str

nullable

Whether the feature admits null values.

TYPE: (bool, optional)

Features

Bases: Model

A dataset schema description as an ordered tuple of feature specs.

ATTRIBUTE DESCRIPTION
specs

The ordered feature specifications.

TYPE: tuple of FeatureSpec

names

names() -> tuple[str, ...]

Return the feature names in order.

RETURNS DESCRIPTION
tuple of str

The ordered feature column names.

get

get(name: str) -> FeatureSpec | None

Return the spec for a feature name, or None when absent.

PARAMETER DESCRIPTION
name

The feature column name to look up.

TYPE: str

RETURNS DESCRIPTION
FeatureSpec or None

The matching spec, or None when no feature has that name.

dtype_of

dtype_of(annotation: _Annotation) -> str

Map a didactic field annotation to a dtype token.

The mapping unwraps optionality, turns tuples into sequence<...> tokens, descends through dx.Embed to its inner type, renders model-valued fields (including embeds and tagged unions) as struct, and renders literals as string. Unrecognised annotations fall back to string.

PARAMETER DESCRIPTION
annotation

The field annotation from a model's field spec.

TYPE: _Annotation

RETURNS DESCRIPTION
str

The dtype token for the annotation.

features_of

features_of(model: type[Model]) -> Features

Derive a :class:Features description from a model's field specs.

The feature order matches the model's field-spec order. Each feature's dtype is mapped from the field annotation by :func:dtype_of, except that opaque fields are forced to a binary token. A feature is nullable when its field is not required or its annotation admits None.

PARAMETER DESCRIPTION
model

The generated record model to describe.

TYPE: type of didactic.api.Model

RETURNS DESCRIPTION
Features

The derived feature description, one spec per model field.