Dataset API¶

The corpus surface, the lazy dataset, and the feature description derived from the generated models. A Corpus joins Layers records by AT-URI and exposes them as Dataset views. Features reads a dataset's columnar schema off the model field specs. For usage see Guides > The dataset API.

Corpus¶

lairs.data.corpus ¶

The corpus surface: a graph of records joined by AT-URIs.

A Corpus exposes dataset views (expressions, annotation layers) over the graph of Layers records, plus authoring and persistence entry points. The graph is held in a :class:lairs.store.pool.ModelPool keyed by AT-URI, so cross-refs (an annotation layer's expression, an expression's mediaRef, a segmentation's expression) resolve to model instances. The join helpers walk those refs to group related records per expression.

Membership records (pub.layers.corpus.membership) tie an expression to a corpus via corpusRef and carry an optional split slug. When the pool holds membership records for this corpus, the expression views and joins are restricted to the expressions those memberships reference, so loading one corpus from an authority that hosts several does not bleed the others' expressions in. When no membership records are present (for example a freshly authored corpus built only through the add_* helpers) every pooled expression is treated as a member, which keeps direct authoring ergonomic.

Loading dispatches on a source. The pds source enumerates the relevant collections of an authority's repository through a PDS client; the appview source uses the appview query API; auto prefers the appview and falls back to the PDS. A client may be injected for testing without network access.

The record :class:pub.layers.corpus.Corpus model is imported qualified as corpus_records.Corpus to avoid clashing with the dataset-surface :class:Corpus defined here.

ExpressionWithAnnotations ¶

Bases: Model

An expression joined to its annotation layers.

ATTRIBUTE	DESCRIPTION
`expression`	The expression record. TYPE: `Expression`
`uri`	The AT-URI of the expression. TYPE: `str`
`annotation_layers`	The annotation layers whose `expression` ref points at this one. TYPE: `tuple of pub.layers.annotation.AnnotationLayer`

ExpressionWithMedia ¶

Bases: Model

An expression joined to its media record.

ATTRIBUTE	DESCRIPTION
`expression`	The expression record. TYPE: `Expression`
`uri`	The AT-URI of the expression. TYPE: `str`
`media`	The media record referenced by the expression's `mediaRef`, if loaded. TYPE: `Media or None`

ExpressionWithSegmentation ¶

Bases: Model

An expression joined to its segmentation records.

ATTRIBUTE	DESCRIPTION
`expression`	The expression record. TYPE: `Expression`
`uri`	The AT-URI of the expression. TYPE: `str`
`segmentations`	The segmentations whose `expression` ref points at this one. TYPE: `tuple of pub.layers.segmentation.Segmentation`

Corpus ¶

Corpus(
    pool: ModelPool | None = None, *, uri: str | None = None
)

A graph of Layers records joined by AT-URI cross-references.

PARAMETER	DESCRIPTION
`pool`	A pre-populated pool of records keyed by AT-URI. When omitted an empty pool is created and records may be added through the authoring helpers. TYPE: `ModelPool or None` DEFAULT: `None`
`uri`	The AT-URI of the backing corpus record, when the corpus was loaded from one. TYPE: `str or None` DEFAULT: `None`

ATTRIBUTE	DESCRIPTION
`pool`	The AT-URI-keyed record graph. TYPE: `ModelPool`
`uri`	The corpus record AT-URI, if any. TYPE: `str or None`

expressions `property` ¶

expressions: Dataset[Expression]

Return a dataset of the corpus member expressions.

When the pool holds membership records for this corpus only the expressions those memberships reference are returned; otherwise every pooled expression is returned.

RETURNS	DESCRIPTION
`Dataset`	A dataset of expression models, in pool order.

corpus_record `property` ¶

corpus_record: Corpus | None

Return the backing corpus record, if one is loaded.

The record is looked up in the pool at :attr:uri; it is None when the corpus has no AT-URI or when no corpus record was loaded for it.

RETURNS	DESCRIPTION
`Corpus or None`	The backing corpus record, or `None` when absent.

new `classmethod` ¶

new(uri: str | None = None) -> Corpus

Create an empty corpus for authoring.

PARAMETER	DESCRIPTION
`uri`	An AT-URI to associate with the corpus record. TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Corpus`	A new, empty corpus.

expression_uris ¶

expression_uris() -> list[str]

Return the AT-URIs of the corpus member expressions.

RETURNS	DESCRIPTION
`list of str`	The member expression AT-URIs, in pool order.

annotation_layers ¶

annotation_layers(
    *, kind: str | None = None, subkind: str | None = None
) -> Dataset[AnnotationLayer]

Return a dataset of annotation layers, optionally filtered.

PARAMETER	DESCRIPTION
`kind`	An annotation-layer kind filter (for example `"token-tag"`). TYPE: `str or None` DEFAULT: `None`
`subkind`	An annotation-layer subkind filter (for example `"pos"`). TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dataset`	A dataset of annotation-layer models matching the filters.

segmentations ¶

segmentations() -> Dataset[Segmentation]

Return a dataset of the corpus segmentations.

RETURNS	DESCRIPTION
`Dataset`	A dataset of segmentation models, in pool order.

media ¶

media() -> Dataset[Media]

Return a dataset of the corpus media records.

RETURNS	DESCRIPTION
`Dataset`	A dataset of media models, in pool order.

memberships ¶

memberships() -> Dataset[Membership]

Return a dataset of the corpus membership records.

Each membership ties an expression to this corpus via corpusRef and may carry a split slug and an ordinal. When :attr:uri is set only the memberships whose corpusRef equals it are returned.

RETURNS	DESCRIPTION
`Dataset`	A dataset of membership models, in pool order.

split ¶

split(name: str) -> Dataset[Expression]

Return the corpus member expressions assigned to a named split.

Expressions are joined to their membership records by AT-URI and kept when a membership's split slug equals name (for example "train", "dev", "test", or "unlabeled"). An expression with several memberships is included when any of them carries the split.

PARAMETER	DESCRIPTION
`name`	The split slug to select. TYPE: `str`

RETURNS	DESCRIPTION
`Dataset`	A dataset of the expression models in that split, in pool order.

splits ¶

splits() -> tuple[str, ...]

Return the distinct split slugs present in the corpus memberships.

RETURNS	DESCRIPTION
`tuple of str`	The split slugs, sorted, excluding memberships with no split.

add_membership ¶

add_membership(uri: str, membership: Membership) -> None

Add a membership record to the corpus graph.

PARAMETER	DESCRIPTION
`uri`	The AT-URI of the membership record. TYPE: `str`
`membership`	The membership record binding an expression to a corpus. TYPE: `Membership`

with_annotations ¶

with_annotations() -> Dataset[ExpressionWithAnnotations]

Join each expression to the annotation layers that target it.

Annotation layers carry an expression AT-URI; this groups the layers by that ref and attaches them to the matching expression. Expressions with no layers still appear, with an empty group.

RETURNS	DESCRIPTION
`Dataset`	A dataset of expression-and-annotations join rows.

with_media ¶

with_media() -> Dataset[ExpressionWithMedia]

Join each expression to the media record it references.

An expression's mediaRef AT-URI is resolved through the pool; when the media record is not loaded the join row carries None.

RETURNS	DESCRIPTION
`Dataset`	A dataset of expression-and-media join rows.

with_segmentation ¶

with_segmentation() -> Dataset[ExpressionWithSegmentation]

Join each expression to the segmentations that target it.

Segmentations carry an expression AT-URI; this groups them by that ref and attaches them to the matching expression.

RETURNS	DESCRIPTION
`Dataset`	A dataset of expression-and-segmentation join rows.

add_expression ¶

add_expression(uri: str, expression: Expression) -> None

Add an expression record to the corpus graph.

PARAMETER	DESCRIPTION
`uri`	The AT-URI of the expression. TYPE: `str`
`expression`	The expression record to add. TYPE: `Expression`

add_annotation_layer ¶

add_annotation_layer(
    uri: str, layer: AnnotationLayer
) -> None

Add an annotation layer record to the corpus graph.

PARAMETER	DESCRIPTION
`uri`	The AT-URI of the annotation layer. TYPE: `str`
`layer`	The annotation layer record to add. TYPE: `AnnotationLayer`

add_record ¶

add_record(uri: str, record: Model) -> None

Add any Layers record to the corpus graph by AT-URI.

PARAMETER	DESCRIPTION
`uri`	The AT-URI of the record. TYPE: `str`
`record`	The record to add (expression, layer, segmentation, media, etc.). TYPE: `Model`

save_to_repo ¶

save_to_repo(path: Path) -> str

Persist the corpus graph to a didactic Repository and commit.

Delegates to the store's :class:lairs.store.repository.Repository, staging every record under its AT-URI and committing a single snapshot.

PARAMETER	DESCRIPTION
`path`	The repository directory to initialise or reuse. TYPE: `Path`

RETURNS	DESCRIPTION
`str`	The new commit revision identifier.

materialize ¶

materialize(out_dir: Path) -> list[Path]

Materialize the corpus to Parquet views.

Builds the normalized expressions and annotations Arrow views from the graph and delegates writing to the store's Arrow :func:lairs.store.arrow.materialize. The expressions view holds the corpus member expressions only (see :attr:expressions).

PARAMETER	DESCRIPTION
`out_dir`	The output directory for the views. TYPE: `Path`

RETURNS	DESCRIPTION
`list of pathlib.Path`	The written view files, in name order.

load_corpus ¶

load_corpus(
    uri: str,
    *,
    source: str = "auto",
    cache_dir: str | None = None,
    revision: str | None = None,
    pds_client: PdsClient | None = None,
) -> Corpus

Load a corpus by AT-URI from a PDS or the appview.

The loader enumerates the Layers record collections of the AT-URI's authority and builds the joined graph. The corpus's expression views and joins are then scoped to the expressions reachable through membership records whose corpusRef matches uri, so an authority that hosts several corpora yields only this corpus's members. The pds source reads directly from a PDS; appview and auto are not implemented without an appview client yet and currently require the pds source with an injected client.

PARAMETER	DESCRIPTION
`uri`	The corpus AT-URI (its authority is enumerated). TYPE: `str`
`source`	The source to load from (`"pds"`, `"appview"`, or `"auto"`). TYPE: `str` DEFAULT: `'auto'`
`cache_dir`	A local cache directory (reserved; not yet used). TYPE: `str or None` DEFAULT: `None`
`revision`	A revision (Repository tag) to resolve (reserved; not yet used). TYPE: `str or None` DEFAULT: `None`
`pds_client`	An injected PDS client. Required for the `pds` source; supplying it avoids network setup in tests. TYPE: `PdsClient or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Corpus`	The loaded corpus.

RAISES	DESCRIPTION
`ValueError`	When `source` is not a recognised source value.
`NotImplementedError`	When the appview source is requested without an appview client, or the PDS source is requested without an injected client.

Dataset¶

lairs.data.dataset ¶

HuggingFace-like dataset over generated record models.

A Dataset is a lazy, optionally streaming sequence of generated model instances, with map and materialization helpers. It is generic over the model type it yields so indexing and iteration are precisely typed.

The dataset is lazy by default: it holds a source that produces model instances on demand, plus an optional chain of per-record transforms applied as records flow through. Two source shapes are supported. An in-memory source wraps a concrete tuple of models and supports random access and len. A streaming source wraps a zero-argument factory that returns a fresh iterator of models (for example a PDS cursor or a repository scan); it has no length and no random access until it is drained.

Dataset ¶

Dataset(
    records: Sequence[ModelT] | None = None,
    *,
    model: type[ModelT] | None = None,
    source: Callable[[], Iterator[ModelT]] | None = None,
)

A lazy dataset of generated record models of one type.

The dataset is generic over ModelT, the model type it yields, so indexing and iteration are precisely typed rather than widened. A dataset is constructed from an in-memory tuple of records (the default and the form random access and len require) or from a streaming factory.

PARAMETER	DESCRIPTION
`records`	The in-memory records the dataset yields. Mutually exclusive with `source`; when both are omitted the dataset is empty. TYPE: `collections.abc.Sequence of ModelT or None` DEFAULT: `None`
`model`	The model type the dataset yields. Required for an empty or streaming dataset so that :attr:`features` can be derived; inferred from the first record otherwise. TYPE: `type of ModelT or None` DEFAULT: `None`
`source`	A zero-argument factory returning a fresh iterator of records for a streaming dataset. Mutually exclusive with `records`. TYPE: `Callable or None` DEFAULT: `None`

is_streaming `property` ¶

is_streaming: bool

Return whether the dataset is backed by a streaming source.

RETURNS	DESCRIPTION
`bool`	`True` when the dataset pulls lazily and has no random access.

features `property` ¶

features: Features

Return the dataset schema derived from the model.

RETURNS	DESCRIPTION
`Features`	The feature description for the dataset's model type.

streaming `classmethod` ¶

streaming(
    source: Callable[[], Iterator[ModelT]],
    *,
    model: type[ModelT],
) -> Dataset[ModelT]

Build a streaming dataset from an iterator factory.

A streaming dataset pulls records lazily from source and never materializes the whole collection in memory until a materializing call (for example :meth:to_arrow) drains it.

PARAMETER	DESCRIPTION
`source`	A zero-argument factory returning a fresh iterator of records. TYPE: `Callable`
`model`	The model type the stream yields, used to derive features. TYPE: `type of ModelT`

RETURNS	DESCRIPTION
`Dataset`	A streaming dataset over the source.

iter ¶

iter(batch_size: int = 1) -> Iterator[tuple[ModelT, ...]]

Iterate over the dataset in batches.

PARAMETER	DESCRIPTION
`batch_size`	The number of records per batch. The final batch may be smaller. TYPE: `int` DEFAULT: `1`

YIELDS	DESCRIPTION
`tuple of ModelT`	Successive batches of records.

RAISES	DESCRIPTION
`ValueError`	When `batch_size` is not positive.

map ¶

map(
    fn: Callable[[ModelT], ModelT],
    *,
    model: type[ModelT] | None = None,
) -> Dataset[ModelT]

Apply a lazy per-record transform.

The transform is not applied eagerly: it is composed onto the dataset's source so it runs as records flow through a later iteration or materialization. The result preserves the source's laziness and streaming behaviour.

This is strictly per-record. There is no batched mode that hands the callable a batch, because the transform signature is fixed to one record in and one record out; group the records yourself with :meth:iter when a batch view is needed.

PARAMETER	DESCRIPTION
`fn`	The per-record transform mapping a model to a model. TYPE: `Callable`
`model`	The model type the transformed dataset yields. Defaults to this dataset's model type; supply it when the transform changes the feature shape and the new shape must be derivable. TYPE: `type of ModelT or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dataset`	A new lazy dataset with the transform applied.

map_batched ¶

map_batched(
    fn: Callable[[Sequence[ModelT]], Iterable[ModelT]],
    *,
    batch_size: int = 1000,
    model: type[ModelT] | None = None,
) -> Dataset[ModelT]

Apply a lazy transform over batches of records.

Unlike :meth:map, the callable receives a batch (a sequence of records) and returns an iterable of records, so a transform can add, drop, or reshape records across a batch (the HuggingFace map(batched=True) affordance). The transform is composed lazily onto the source and preserves streaming behaviour; the output record count need not match the input count.

PARAMETER	DESCRIPTION
`fn`	The batch transform mapping a sequence of records to an iterable of records. TYPE: `Callable`
`batch_size`	The number of records handed to `fn` per call. The final batch may be smaller. TYPE: `int` DEFAULT: `1000`
`model`	The model type the transformed dataset yields. Defaults to this dataset's model type. TYPE: `type of ModelT or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dataset`	A new lazy dataset with the batch transform applied.

RAISES	DESCRIPTION
`ValueError`	When `batch_size` is not positive.

filter ¶

filter(
    predicate: Callable[[ModelT], bool],
) -> Dataset[ModelT]

Filter the dataset by a per-record predicate, lazily.

PARAMETER	DESCRIPTION
`predicate`	A predicate selecting which records to keep. TYPE: `Callable`

RETURNS	DESCRIPTION
`Dataset`	A new lazy dataset of the records for which `predicate` is true.

take ¶

take(count: int) -> Dataset[ModelT]

Materialize the first count records into a new in-memory dataset.

PARAMETER	DESCRIPTION
`count`	The number of records to take from the front. TYPE: `int`

RETURNS	DESCRIPTION
`Dataset`	An in-memory dataset of at most `count` records.

materialize ¶

materialize() -> Dataset[ModelT]

Drain the dataset into an in-memory dataset with random access.

RETURNS	DESCRIPTION
`Dataset`	An in-memory copy supporting `len` and indexing.

to_arrow ¶

to_arrow() -> Table

Materialize the dataset to an Arrow table.

The table is the flattened columnar view produced by the store's Arrow machinery: scalar fields become columns and any anchor field is expanded into the typed anchor columns.

This is a full materialization: a streaming dataset is drained and every row is buffered in memory while the table is built, so this should not be called on an unbounded stream without a bounding :meth:take first.

RETURNS	DESCRIPTION
`Table`	The materialized columnar view.

to_pandas ¶

to_pandas() -> DataFrame

Materialize the dataset to a pandas DataFrame.

pandas is an optional dependency, resolved through pyarrow's :meth:pyarrow.Table.to_pandas, which raises a clear ImportError when pandas is not installed. Like :meth:to_arrow, this is a full materialization that drains and buffers a streaming dataset.

RETURNS	DESCRIPTION
`DataFrame`	The materialized table as a DataFrame.

RAISES	DESCRIPTION
`ImportError`	When pandas is not installed.

from_iterable `classmethod` ¶

from_iterable(
    records: Iterable[ModelT],
    *,
    model: type[ModelT] | None = None,
) -> Dataset[ModelT]

Build an in-memory dataset by draining an iterable of records.

PARAMETER	DESCRIPTION
`records`	The records to collect. TYPE: `collections.abc.Iterable of ModelT`
`model`	The model type the dataset yields. TYPE: `type of ModelT or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dataset`	An in-memory dataset over the drained records.

Features¶

lairs.data.features ¶

Dataset feature description derived from the generated models.

Features is a didactic model describing a dataset's columnar schema, read off the generated record model field specs so it always matches the lexicons. The derivation maps each didactic field annotation to a dtype token, unwrapping optionality, exploding tuples into sequence tokens, descending into nested dx.Embed structs, and marking opaque fields as a binary dtype.

FeatureSpec ¶

Bases: Model

A single named feature and its dtype.

ATTRIBUTE	DESCRIPTION
`name`	The feature (column) name. TYPE: `str`
`dtype`	The feature dtype as a string token (for example `"string"`). TYPE: `str`
`nullable`	Whether the feature admits null values. TYPE: `(bool, optional)`

Features ¶

Bases: Model

A dataset schema description as an ordered tuple of feature specs.

ATTRIBUTE	DESCRIPTION
`specs`	The ordered feature specifications. TYPE: `tuple of FeatureSpec`

names ¶

names() -> tuple[str, ...]

Return the feature names in order.

RETURNS	DESCRIPTION
`tuple of str`	The ordered feature column names.

get ¶

get(name: str) -> FeatureSpec | None

Return the spec for a feature name, or None when absent.

PARAMETER	DESCRIPTION
`name`	The feature column name to look up. TYPE: `str`

RETURNS	DESCRIPTION
`FeatureSpec or None`	The matching spec, or `None` when no feature has that name.

dtype_of ¶

dtype_of(annotation: _Annotation) -> str

Map a didactic field annotation to a dtype token.

The mapping unwraps optionality, turns tuples into sequence<...> tokens, descends through dx.Embed to its inner type, renders model-valued fields (including embeds and tagged unions) as struct, and renders literals as string. Unrecognised annotations fall back to string.

PARAMETER	DESCRIPTION
`annotation`	The field annotation from a model's field spec. TYPE: `_Annotation`

RETURNS	DESCRIPTION
`str`	The dtype token for the annotation.

features_of ¶

features_of(model: type[Model]) -> Features

Derive a :class:Features description from a model's field specs.

The feature order matches the model's field-spec order. Each feature's dtype is mapped from the field annotation by :func:dtype_of, except that opaque fields are forced to a binary token. A feature is nullable when its field is not required or its annotation admits None.

PARAMETER	DESCRIPTION
`model`	The generated record model to describe. TYPE: `type of didactic.api.Model`

RETURNS	DESCRIPTION
`Features`	The derived feature description, one spec per model field.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Dataset API¶

Corpus¶

lairs.data.corpus ¶

ExpressionWithAnnotations ¶

ExpressionWithMedia ¶

ExpressionWithSegmentation ¶

Corpus ¶

expressions property ¶

corpus_record property ¶

new classmethod ¶

expression_uris ¶

annotation_layers ¶

segmentations ¶

media ¶

memberships ¶

split ¶

splits ¶

add_membership ¶

with_annotations ¶

with_media ¶

with_segmentation ¶

add_expression ¶

add_annotation_layer ¶

add_record ¶

save_to_repo ¶

materialize ¶

load_corpus ¶

Dataset¶

lairs.data.dataset ¶

Dataset ¶

is_streaming property ¶

features property ¶

streaming classmethod ¶

iter ¶

map ¶

map_batched ¶

filter ¶

take ¶

materialize ¶

to_arrow ¶

to_pandas ¶

from_iterable classmethod ¶

Features¶

lairs.data.features ¶

FeatureSpec ¶

Features ¶

names ¶

get ¶

dtype_of ¶

features_of ¶

expressions `property` ¶

corpus_record `property` ¶

new `classmethod` ¶

is_streaming `property` ¶

features `property` ¶

streaming `classmethod` ¶

from_iterable `classmethod` ¶