Dataset API¶
The corpus surface, the lazy dataset, and the feature description
derived from the generated models. A Corpus joins Layers records by
AT-URI and exposes them as Dataset views. Features reads a
dataset's columnar schema off the model field specs. For usage see
Guides > The dataset API.
Corpus¶
lairs.data.corpus ¶
The corpus surface: a graph of records joined by AT-URIs.
A Corpus exposes dataset views (expressions, annotation layers) over the
graph of Layers records, plus authoring and persistence entry points. The graph
is held in a :class:lairs.store.pool.ModelPool keyed by AT-URI, so cross-refs
(an annotation layer's expression, an expression's mediaRef, a
segmentation's expression) resolve to model instances. The join helpers walk
those refs to group related records per expression.
Membership records (pub.layers.corpus.membership) tie an expression to a
corpus via corpusRef and carry an optional split slug. When the pool
holds membership records for this corpus, the expression views and joins are
restricted to the expressions those memberships reference, so loading one corpus
from an authority that hosts several does not bleed the others' expressions in.
When no membership records are present (for example a freshly authored corpus
built only through the add_* helpers) every pooled expression is treated as a
member, which keeps direct authoring ergonomic.
Loading dispatches on a source. The pds source enumerates the relevant
collections of an authority's repository through a PDS client; the appview
source uses the appview query API; auto prefers the appview and falls back to
the PDS. A client may be injected for testing without network access.
The record :class:pub.layers.corpus.Corpus model is imported qualified as
corpus_records.Corpus to avoid clashing with the dataset-surface
:class:Corpus defined here.
ExpressionWithAnnotations ¶
Bases: Model
An expression joined to its annotation layers.
| ATTRIBUTE | DESCRIPTION |
|---|---|
expression |
The expression record.
TYPE:
|
uri |
The AT-URI of the expression.
TYPE:
|
annotation_layers |
The annotation layers whose
TYPE:
|
ExpressionWithMedia ¶
Bases: Model
An expression joined to its media record.
| ATTRIBUTE | DESCRIPTION |
|---|---|
expression |
The expression record.
TYPE:
|
uri |
The AT-URI of the expression.
TYPE:
|
media |
The media record referenced by the expression's
TYPE:
|
ExpressionWithSegmentation ¶
Bases: Model
An expression joined to its segmentation records.
| ATTRIBUTE | DESCRIPTION |
|---|---|
expression |
The expression record.
TYPE:
|
uri |
The AT-URI of the expression.
TYPE:
|
segmentations |
The segmentations whose
TYPE:
|
Corpus ¶
Corpus(
pool: ModelPool | None = None, *, uri: str | None = None
)
A graph of Layers records joined by AT-URI cross-references.
| PARAMETER | DESCRIPTION |
|---|---|
pool
|
A pre-populated pool of records keyed by AT-URI. When omitted an empty pool is created and records may be added through the authoring helpers.
TYPE:
|
uri
|
The AT-URI of the backing corpus record, when the corpus was loaded from one.
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
pool |
The AT-URI-keyed record graph.
TYPE:
|
uri |
The corpus record AT-URI, if any.
TYPE:
|
expressions
property
¶
expressions: Dataset[Expression]
Return a dataset of the corpus member expressions.
When the pool holds membership records for this corpus only the expressions those memberships reference are returned; otherwise every pooled expression is returned.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of expression models, in pool order. |
corpus_record
property
¶
corpus_record: Corpus | None
Return the backing corpus record, if one is loaded.
The record is looked up in the pool at :attr:uri; it is None when
the corpus has no AT-URI or when no corpus record was loaded for it.
| RETURNS | DESCRIPTION |
|---|---|
Corpus or None
|
The backing corpus record, or |
new
classmethod
¶
new(uri: str | None = None) -> Corpus
Create an empty corpus for authoring.
| PARAMETER | DESCRIPTION |
|---|---|
uri
|
An AT-URI to associate with the corpus record.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Corpus
|
A new, empty corpus. |
expression_uris ¶
Return the AT-URIs of the corpus member expressions.
| RETURNS | DESCRIPTION |
|---|---|
list of str
|
The member expression AT-URIs, in pool order. |
annotation_layers ¶
annotation_layers(
*, kind: str | None = None, subkind: str | None = None
) -> Dataset[AnnotationLayer]
Return a dataset of annotation layers, optionally filtered.
| PARAMETER | DESCRIPTION |
|---|---|
kind
|
An annotation-layer kind filter (for example
TYPE:
|
subkind
|
An annotation-layer subkind filter (for example
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of annotation-layer models matching the filters. |
segmentations ¶
segmentations() -> Dataset[Segmentation]
Return a dataset of the corpus segmentations.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of segmentation models, in pool order. |
media ¶
Return a dataset of the corpus media records.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of media models, in pool order. |
memberships ¶
memberships() -> Dataset[Membership]
Return a dataset of the corpus membership records.
Each membership ties an expression to this corpus via corpusRef and
may carry a split slug and an ordinal. When :attr:uri is set
only the memberships whose corpusRef equals it are returned.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of membership models, in pool order. |
split ¶
split(name: str) -> Dataset[Expression]
Return the corpus member expressions assigned to a named split.
Expressions are joined to their membership records by AT-URI and kept
when a membership's split slug equals name (for example
"train", "dev", "test", or "unlabeled"). An expression
with several memberships is included when any of them carries the split.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The split slug to select.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of the expression models in that split, in pool order. |
splits ¶
Return the distinct split slugs present in the corpus memberships.
| RETURNS | DESCRIPTION |
|---|---|
tuple of str
|
The split slugs, sorted, excluding memberships with no split. |
add_membership ¶
add_membership(uri: str, membership: Membership) -> None
Add a membership record to the corpus graph.
| PARAMETER | DESCRIPTION |
|---|---|
uri
|
The AT-URI of the membership record.
TYPE:
|
membership
|
The membership record binding an expression to a corpus.
TYPE:
|
with_annotations ¶
with_annotations() -> Dataset[ExpressionWithAnnotations]
Join each expression to the annotation layers that target it.
Annotation layers carry an expression AT-URI; this groups the layers
by that ref and attaches them to the matching expression. Expressions
with no layers still appear, with an empty group.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of expression-and-annotations join rows. |
with_media ¶
with_media() -> Dataset[ExpressionWithMedia]
Join each expression to the media record it references.
An expression's mediaRef AT-URI is resolved through the pool; when
the media record is not loaded the join row carries None.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of expression-and-media join rows. |
with_segmentation ¶
with_segmentation() -> Dataset[ExpressionWithSegmentation]
Join each expression to the segmentations that target it.
Segmentations carry an expression AT-URI; this groups them by that
ref and attaches them to the matching expression.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dataset of expression-and-segmentation join rows. |
add_expression ¶
add_expression(uri: str, expression: Expression) -> None
Add an expression record to the corpus graph.
| PARAMETER | DESCRIPTION |
|---|---|
uri
|
The AT-URI of the expression.
TYPE:
|
expression
|
The expression record to add.
TYPE:
|
add_annotation_layer ¶
add_annotation_layer(
uri: str, layer: AnnotationLayer
) -> None
Add an annotation layer record to the corpus graph.
| PARAMETER | DESCRIPTION |
|---|---|
uri
|
The AT-URI of the annotation layer.
TYPE:
|
layer
|
The annotation layer record to add.
TYPE:
|
add_record ¶
Add any Layers record to the corpus graph by AT-URI.
| PARAMETER | DESCRIPTION |
|---|---|
uri
|
The AT-URI of the record.
TYPE:
|
record
|
The record to add (expression, layer, segmentation, media, etc.).
TYPE:
|
save_to_repo ¶
Persist the corpus graph to a didactic Repository and commit.
Delegates to the store's :class:lairs.store.repository.Repository,
staging every record under its AT-URI and committing a single snapshot.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
The repository directory to initialise or reuse.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The new commit revision identifier. |
materialize ¶
Materialize the corpus to Parquet views.
Builds the normalized expressions and annotations Arrow views
from the graph and delegates writing to the store's Arrow
:func:lairs.store.arrow.materialize. The expressions view holds the
corpus member expressions only (see :attr:expressions).
| PARAMETER | DESCRIPTION |
|---|---|
out_dir
|
The output directory for the views.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of pathlib.Path
|
The written view files, in name order. |
load_corpus ¶
load_corpus(
uri: str,
*,
source: str = "auto",
cache_dir: str | None = None,
revision: str | None = None,
pds_client: PdsClient | None = None,
) -> Corpus
Load a corpus by AT-URI from a PDS or the appview.
The loader enumerates the Layers record collections of the AT-URI's
authority and builds the joined graph. The corpus's expression views and
joins are then scoped to the expressions reachable through membership records
whose corpusRef matches uri, so an authority that hosts several
corpora yields only this corpus's members. The pds source reads directly
from a PDS; appview and auto are not implemented without an appview
client yet and currently require the pds source with an injected client.
| PARAMETER | DESCRIPTION |
|---|---|
uri
|
The corpus AT-URI (its authority is enumerated).
TYPE:
|
source
|
The source to load from (
TYPE:
|
cache_dir
|
A local cache directory (reserved; not yet used).
TYPE:
|
revision
|
A revision (Repository tag) to resolve (reserved; not yet used).
TYPE:
|
pds_client
|
An injected PDS client. Required for the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Corpus
|
The loaded corpus. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
When |
NotImplementedError
|
When the appview source is requested without an appview client, or the PDS source is requested without an injected client. |
Dataset¶
lairs.data.dataset ¶
HuggingFace-like dataset over generated record models.
A Dataset is a lazy, optionally streaming sequence of generated model
instances, with map and materialization helpers. It is generic over the model
type it yields so indexing and iteration are precisely typed.
The dataset is lazy by default: it holds a source that produces model
instances on demand, plus an optional chain of per-record transforms applied as
records flow through. Two source shapes are supported. An in-memory source wraps
a concrete tuple of models and supports random access and len. A streaming
source wraps a zero-argument factory that returns a fresh iterator of models
(for example a PDS cursor or a repository scan); it has no length and no random
access until it is drained.
Dataset ¶
Dataset(
records: Sequence[ModelT] | None = None,
*,
model: type[ModelT] | None = None,
source: Callable[[], Iterator[ModelT]] | None = None,
)
A lazy dataset of generated record models of one type.
The dataset is generic over ModelT, the model type it yields, so
indexing and iteration are precisely typed rather than widened. A dataset is
constructed from an in-memory tuple of records (the default and the form
random access and len require) or from a streaming factory.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
The in-memory records the dataset yields. Mutually exclusive with
TYPE:
|
model
|
The model type the dataset yields. Required for an empty or streaming
dataset so that :attr:
TYPE:
|
source
|
A zero-argument factory returning a fresh iterator of records for a
streaming dataset. Mutually exclusive with
TYPE:
|
is_streaming
property
¶
Return whether the dataset is backed by a streaming source.
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
features
property
¶
features: Features
Return the dataset schema derived from the model.
| RETURNS | DESCRIPTION |
|---|---|
Features
|
The feature description for the dataset's model type. |
streaming
classmethod
¶
streaming(
source: Callable[[], Iterator[ModelT]],
*,
model: type[ModelT],
) -> Dataset[ModelT]
Build a streaming dataset from an iterator factory.
A streaming dataset pulls records lazily from source and never
materializes the whole collection in memory until a materializing call
(for example :meth:to_arrow) drains it.
| PARAMETER | DESCRIPTION |
|---|---|
source
|
A zero-argument factory returning a fresh iterator of records.
TYPE:
|
model
|
The model type the stream yields, used to derive features.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A streaming dataset over the source. |
iter ¶
Iterate over the dataset in batches.
| PARAMETER | DESCRIPTION |
|---|---|
batch_size
|
The number of records per batch. The final batch may be smaller.
TYPE:
|
| YIELDS | DESCRIPTION |
|---|---|
tuple of ModelT
|
Successive batches of records. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
When |
map ¶
map(
fn: Callable[[ModelT], ModelT],
*,
model: type[ModelT] | None = None,
) -> Dataset[ModelT]
Apply a lazy per-record transform.
The transform is not applied eagerly: it is composed onto the dataset's source so it runs as records flow through a later iteration or materialization. The result preserves the source's laziness and streaming behaviour.
This is strictly per-record. There is no batched mode that hands the
callable a batch, because the transform signature is fixed to one record
in and one record out; group the records yourself with :meth:iter when
a batch view is needed.
| PARAMETER | DESCRIPTION |
|---|---|
fn
|
The per-record transform mapping a model to a model.
TYPE:
|
model
|
The model type the transformed dataset yields. Defaults to this dataset's model type; supply it when the transform changes the feature shape and the new shape must be derivable.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A new lazy dataset with the transform applied. |
map_batched ¶
map_batched(
fn: Callable[[Sequence[ModelT]], Iterable[ModelT]],
*,
batch_size: int = 1000,
model: type[ModelT] | None = None,
) -> Dataset[ModelT]
Apply a lazy transform over batches of records.
Unlike :meth:map, the callable receives a batch (a sequence of
records) and returns an iterable of records, so a transform can add,
drop, or reshape records across a batch (the HuggingFace
map(batched=True) affordance). The transform is composed lazily onto
the source and preserves streaming behaviour; the output record count
need not match the input count.
| PARAMETER | DESCRIPTION |
|---|---|
fn
|
The batch transform mapping a sequence of records to an iterable of records.
TYPE:
|
batch_size
|
The number of records handed to
TYPE:
|
model
|
The model type the transformed dataset yields. Defaults to this dataset's model type.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A new lazy dataset with the batch transform applied. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
When |
filter ¶
filter(
predicate: Callable[[ModelT], bool],
) -> Dataset[ModelT]
Filter the dataset by a per-record predicate, lazily.
| PARAMETER | DESCRIPTION |
|---|---|
predicate
|
A predicate selecting which records to keep.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A new lazy dataset of the records for which |
take ¶
take(count: int) -> Dataset[ModelT]
Materialize the first count records into a new in-memory dataset.
| PARAMETER | DESCRIPTION |
|---|---|
count
|
The number of records to take from the front.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
An in-memory dataset of at most |
materialize ¶
materialize() -> Dataset[ModelT]
Drain the dataset into an in-memory dataset with random access.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
An in-memory copy supporting |
to_arrow ¶
Materialize the dataset to an Arrow table.
The table is the flattened columnar view produced by the store's Arrow
machinery: scalar fields become columns and any anchor field is
expanded into the typed anchor columns.
This is a full materialization: a streaming dataset is drained and every
row is buffered in memory while the table is built, so this should not be
called on an unbounded stream without a bounding :meth:take first.
| RETURNS | DESCRIPTION |
|---|---|
Table
|
The materialized columnar view. |
to_pandas ¶
Materialize the dataset to a pandas DataFrame.
pandas is an optional dependency, resolved through pyarrow's
:meth:pyarrow.Table.to_pandas, which raises a clear ImportError
when pandas is not installed. Like :meth:to_arrow, this is a full
materialization that drains and buffers a streaming dataset.
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The materialized table as a DataFrame. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When pandas is not installed. |
from_iterable
classmethod
¶
from_iterable(
records: Iterable[ModelT],
*,
model: type[ModelT] | None = None,
) -> Dataset[ModelT]
Build an in-memory dataset by draining an iterable of records.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
The records to collect.
TYPE:
|
model
|
The model type the dataset yields.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
An in-memory dataset over the drained records. |
Features¶
lairs.data.features ¶
Dataset feature description derived from the generated models.
Features is a didactic model describing a dataset's columnar schema, read
off the generated record model field specs so it always matches the lexicons.
The derivation maps each didactic field annotation to a dtype token, unwrapping
optionality, exploding tuples into sequence tokens, descending into nested
dx.Embed structs, and marking opaque fields as a binary dtype.
FeatureSpec ¶
Bases: Model
A single named feature and its dtype.
| ATTRIBUTE | DESCRIPTION |
|---|---|
name |
The feature (column) name.
TYPE:
|
dtype |
The feature dtype as a string token (for example
TYPE:
|
nullable |
Whether the feature admits null values.
TYPE:
|
Features ¶
Bases: Model
A dataset schema description as an ordered tuple of feature specs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
specs |
The ordered feature specifications.
TYPE:
|
names ¶
Return the feature names in order.
| RETURNS | DESCRIPTION |
|---|---|
tuple of str
|
The ordered feature column names. |
get ¶
get(name: str) -> FeatureSpec | None
Return the spec for a feature name, or None when absent.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The feature column name to look up.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
FeatureSpec or None
|
The matching spec, or |
dtype_of ¶
Map a didactic field annotation to a dtype token.
The mapping unwraps optionality, turns tuples into sequence<...> tokens,
descends through dx.Embed to its inner type, renders model-valued fields
(including embeds and tagged unions) as struct, and renders literals as
string. Unrecognised annotations fall back to string.
| PARAMETER | DESCRIPTION |
|---|---|
annotation
|
The field annotation from a model's field spec.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The dtype token for the annotation. |
features_of ¶
features_of(model: type[Model]) -> Features
Derive a :class:Features description from a model's field specs.
The feature order matches the model's field-spec order. Each feature's dtype
is mapped from the field annotation by :func:dtype_of, except that opaque
fields are forced to a binary token. A feature is nullable when its field is
not required or its annotation admits None.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
The generated record model to describe.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Features
|
The derived feature description, one spec per model field. |