Exporters

The data-plane exporters, which bind to the Exporter port and emit framework-native datasets from an Arrow view. Each backend library is an optional dependency imported lazily. For usage see [Guides

Exporters](../guide/exporters.md).

HuggingFace datasets

lairs.integrations.hf.datasets

HuggingFace datasets exporter.

Builds a datasets.Dataset straight from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. Because the Arrow flattening done by :mod:lairs.store.arrow already resolves the polymorphic Layers anchors into typed columns, the schema logic here stays thin: the exporter wraps the Arrow table (near zero-copy), optionally selects columns for a task template, and derives a HuggingFace Features mapping from the generated model field specs through :mod:lairs.data.features.

datasets is an optional dependency provided by the lairs[hf] extra. It is imported lazily inside the methods that need it, so importing this module never pulls datasets in; the concrete return types are bound only under TYPE_CHECKING.

Shape

Shape = Literal['nested', 'exploded']

The tabular shape an Arrow view is exported in.

nested keeps one row per expression with annotations as sequence-valued columns; exploded keeps one row per annotation. The Arrow builders in :mod:lairs.store.arrow already produce one shape or the other, so the shape is descriptive metadata the exporter records rather than a re-shaping step.

TaskTemplate

Bases: Model

A canonical HuggingFace task shape for a Layers annotation layer.

A template maps a Layers (kind, subkind, formalism) triple to a named HuggingFace task and the columns that task expects, so a token-classification or span layer exports under the conventional column names HuggingFace tooling recognises.

ATTRIBUTE DESCRIPTION
task

The canonical HuggingFace task name (for example "token-classification").

TYPE: str

kind

The Layers annotation kind this template applies to.

TYPE: str

subkind

The Layers annotation subkind, when the template is subkind-specific.

TYPE: (str or None, optional)

formalism

The Layers formalism, when the template is formalism-specific.

TYPE: (str or None, optional)

columns

The columns the task shape expects, in order.

TYPE: tuple of str, optional

ExportSpec

Bases: Model

An export specification controlling the HuggingFace dataset shape.

The Arrow flattening has already done the heavy lifting, so the spec only records the tabular shape, an optional column projection, and an optional canonical task name. It is a plain didactic model so it is serialisable and carries cleanly into a dataset card's provenance.

ATTRIBUTE DESCRIPTION
shape

The tabular shape of the Arrow view being exported.

TYPE: ({'nested', 'exploded'}, optional)

columns

An optional projection: when set, only these columns are kept, in this order. Columns absent from the view are skipped.

TYPE: tuple of str or None, optional

task

An optional canonical HuggingFace task name, recorded for downstream tooling and dataset-card provenance.

TYPE: (str or None, optional)

for_template classmethod

for_template(
    template: TaskTemplate, *, shape: Shape = "exploded"
) -> ExportSpec

Build a spec that selects a task template's columns.

PARAMETER DESCRIPTION
template

The task template whose columns to project and whose task to record.

TYPE: TaskTemplate

shape

The tabular shape of the view being exported.

TYPE: ('nested', 'exploded') DEFAULT: "nested"

RETURNS DESCRIPTION
ExportSpec

A spec projecting the template's columns and recording its task.

HuggingFaceExporter

An exporter that emits a datasets.Dataset from an Arrow view.

The exporter binds to the :class:~lairs.integrations.ports.Exporter port with the Arrow table as the view, :class:ExportSpec as the spec, and a datasets.Dataset as the produced object. Because the Arrow view already carries typed anchor columns, the exporter only applies the spec's column projection and hands the table to datasets near zero-copy.

export

export(
    view: Table, *, spec: ExportSpec | None = None
) -> Dataset

Export an Arrow view to a HuggingFace dataset.

The export wraps the Arrow table directly: datasets builds an Arrow-backed dataset with no row-wise copy. When the spec carries a column projection, the table is narrowed to those columns first.

PARAMETER DESCRIPTION
view

The flattened Arrow view to export.

TYPE: Table

spec

An optional export specification (shape, column projection, task).

TYPE: ExportSpec or None DEFAULT: None

RETURNS DESCRIPTION
Dataset

The exported, Arrow-backed dataset.

RAISES DESCRIPTION
ImportError

When the optional datasets dependency is not installed.

to_hf_iterable

to_hf_iterable(
    source: Callable[[], Iterator[RecordBatch]],
    *,
    spec: ExportSpec | None = None,
) -> IterableDataset

Build a streaming datasets.IterableDataset from a batch source.

The source is a zero-argument factory returning a fresh iterator of Arrow record batches, for example one driven by a PDS cursor or a Repository scan, so a large corpus trains without a full download. Each batch is narrowed by the spec's column projection before its rows are yielded.

datasets.IterableDataset.from_generator may invoke the underlying generator more than once (re-iteration, multi-epoch training), so source must yield a fresh iterator on each call. A one-shot factory that returns an already-consumed iterator on a later call yields no rows on re-iteration; callers driving a cursor must reset it per call.

PARAMETER DESCRIPTION
source

A zero-argument factory returning a fresh iterator of Arrow record batches. Called once per iteration of the resulting dataset, so it must return a fresh iterator each time.

TYPE: Callable

spec

An optional export specification (column projection).

TYPE: ExportSpec or None DEFAULT: None

RETURNS DESCRIPTION
IterableDataset

A streaming dataset over the batch source.

RAISES DESCRIPTION
ImportError

When the optional datasets dependency is not installed.

task_template_for

task_template_for(
    kind: str,
    *,
    subkind: str | None = None,
    formalism: str | None = None,
) -> TaskTemplate | None

Return the most specific task template matching a Layers triple.

A template matches when its kind equals kind and each of its set subkind and formalism fields equals the corresponding argument. Templates are ranked by specificity, so a template constraining a subkind or formalism is preferred over a kind-only template for the same kind.

PARAMETER DESCRIPTION
kind

The Layers annotation kind to match.

TYPE: str

subkind

The Layers annotation subkind, when known.

TYPE: str or None DEFAULT: None

formalism

The Layers formalism, when known.

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
TaskTemplate or None

The best-matching template, or None when no template applies.

hf_features_from

hf_features_from(features: Features) -> Features

Derive a HuggingFace Features mapping from a lairs feature schema.

Each lairs :class:~lairs.data.features.FeatureSpec becomes a HuggingFace Value (or a Sequence of a Value for sequence tokens). Because the lairs features are read off the generated model field specs, the resulting HuggingFace schema always matches the lexicons.

PARAMETER DESCRIPTION
features

The lairs feature schema, typically from :func:lairs.data.features.features_of.

TYPE: Features

RETURNS DESCRIPTION
Features

The HuggingFace feature mapping.

RAISES DESCRIPTION
ImportError

When the optional datasets dependency is not installed.

HuggingFace Hub

lairs.integrations.hf.hub

HuggingFace Hub push and pull.

Mirrors a corpus to the Hub as Arrow/Parquet shards with an auto-generated dataset card carrying full provenance, and reads a mirror back. The Hub is an export and mirror target; the PDS and the didactic Repository stay canonical, so the card records exactly the references needed to reproduce the mirror from source: the corpus AT-URI, the Repository revision or tag, the vendored lexicon manifest hash, and the license from the corpus record.

datasets and huggingface_hub are optional dependencies provided by the lairs[hf] extra. They are imported lazily inside the functions that need them, so importing this module never pulls them in; the concrete return types are bound only under TYPE_CHECKING.

ProvenanceBundle

Bases: Model

The provenance carried on a mirrored corpus's dataset card.

The bundle records everything needed to trace a Hub mirror back to its canonical sources. The PDS and the Repository remain the source of truth; the bundle pins the exact corpus AT-URI, the Repository revision or tag, the vendored lexicon manifest hash, and the corpus license.

ATTRIBUTE DESCRIPTION
corpus_uri

The AT-URI of the source corpus record.

TYPE: (str or None, optional)

revision

The didactic Repository revision id the mirror was built from.

TYPE: (str or None, optional)

tag

The Repository tag naming the revision, when one exists.

TYPE: (str or None, optional)

lexicon_manifest_hash

The content hash of the vendored lexicon tree, from the manifest.

TYPE: (str or None, optional)

layers_version

The Layers lexicon version recorded in the manifest.

TYPE: (str or None, optional)

license

The license identifier from the corpus record.

TYPE: (str or None, optional)

name

The corpus name from the corpus record.

TYPE: (str or None, optional)

provenance_bundle

provenance_bundle(
    *,
    corpus_uri: str | None = None,
    revision: str | None = None,
    tag: str | None = None,
    license: str | None = None,
    name: str | None = None,
) -> ProvenanceBundle

Assemble a provenance bundle, filling the lexicon manifest fields.

The lexicon manifest hash and Layers version are read from the vendored manifest packaged with lairs, so a mirror always records the schema version it was generated against; the remaining fields are supplied by the caller from the corpus record and the Repository revision being mirrored.

PARAMETER DESCRIPTION
corpus_uri

The AT-URI of the source corpus record.

TYPE: str or None DEFAULT: None

revision

The Repository revision id the mirror was built from.

TYPE: str or None DEFAULT: None

tag

The Repository tag naming the revision.

TYPE: str or None DEFAULT: None

license

The license identifier from the corpus record.

TYPE: str or None DEFAULT: None

name

The corpus name from the corpus record.

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
ProvenanceBundle

The assembled bundle with the lexicon manifest fields filled in.

dataset_card

dataset_card(bundle: ProvenanceBundle) -> str

Render a markdown dataset card from a provenance bundle.

The card documents the canonical sources of the mirror so a reader can trace it back to the PDS and the Repository, which remain authoritative. A leading YAML front-matter block carries the machine-readable license and source AT-URI for the Hub dataset viewer and metadata indexing; the prose section below repeats the full provenance. Only the fields that are set appear, so a sparse bundle yields a compact card.

PARAMETER DESCRIPTION
bundle

The provenance to render.

TYPE: ProvenanceBundle

RETURNS DESCRIPTION
str

The rendered markdown dataset card, prefixed with a YAML front-matter block when any front-matter field is set.

push_to_hub

push_to_hub(
    view: Table,
    repo_id: str,
    *,
    private: bool = False,
    provenance: ProvenanceBundle | None = None,
    token: str | None = None,
    split: str = "train",
    config_name: str = "default",
) -> str

Push an Arrow view to the Hub with a provenance dataset card.

The Arrow view is written as the dataset's data (Arrow/Parquet shards) and the provenance bundle is rendered into the dataset card so the mirror records its canonical sources. The Hub is a mirror target only; the PDS and the Repository stay canonical.

The push is two Hub commits: datasets writes the data shards, then the provenance card is uploaded as a second commit. These are not atomic; if the card upload fails after the data is pushed, the mirror exists without its provenance and the card upload must be retried. The data and card never diverge in content, so a retry is always safe.

PARAMETER DESCRIPTION
view

The Arrow view to push.

TYPE: Table

repo_id

The target Hub dataset repository identifier ("org/name").

TYPE: str

private

Whether to create a private repository.

TYPE: bool DEFAULT: False

provenance

The provenance to render into the dataset card. When omitted, a bundle carrying only the vendored lexicon manifest fields is used.

TYPE: ProvenanceBundle or None DEFAULT: None

token

A HuggingFace access token with write scope. When omitted, the ambient login state (a prior huggingface-cli login or the HF_TOKEN environment variable) is used; an unauthenticated caller surfaces a huggingface_hub authentication error from the underlying push.

TYPE: str or None DEFAULT: None

split

The dataset split name to write the shards under.

TYPE: str DEFAULT: 'train'

config_name

The dataset configuration (subset) name to write the shards under.

TYPE: str DEFAULT: 'default'

RETURNS DESCRIPTION
str

The URL of the pushed dataset on the Hub.

RAISES DESCRIPTION
ImportError

When the optional datasets or huggingface_hub dependency is not installed.

load_from_hub

load_from_hub(
    repo_id: str,
    *,
    split: str | None = None,
    revision: str | None = None,
    token: str | None = None,
) -> Dataset | DatasetDict

Load a mirrored dataset back from the Hub.

When split is given, a single :class:datasets.Dataset is returned; when it is omitted, datasets.load_dataset returns a :class:datasets.DatasetDict keyed by split name for a multi-split repository (and a :class:datasets.Dataset for a single-split repository). Callers that need a concrete Dataset should pass split or index the returned dict by split name.

PARAMETER DESCRIPTION
repo_id

The Hub dataset repository identifier.

TYPE: str

split

A single split to load (for example "train"). When set, a datasets.Dataset is returned; when omitted, the whole DatasetDict is returned for a multi-split repository.

TYPE: str or None DEFAULT: None

revision

A Hub revision (branch, tag, or commit) to read.

TYPE: str or None DEFAULT: None

token

A HuggingFace access token for a private or gated repository. When omitted, the ambient login state is used.

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
Dataset or DatasetDict

The loaded dataset when split is given (or the repository is single-split), otherwise the split-keyed mapping.

RAISES DESCRIPTION
ImportError

When the optional datasets dependency is not installed.

PyTorch

lairs.integrations.torch

PyTorch data-plane exporter.

Emits PyTorch datasets from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. The exporter produces a map-style :class:torch.utils.data.Dataset, an :class:torch.utils.data.IterableDataset variant, and a collate helper, bundled in a :class:TorchExportResult.

PyTorch is an optional dependency (the lairs[torch] extra). It is imported lazily inside the methods that need it, with a clear error when it is missing, so importing this module never pulls torch in. The column-selection and batching logic is pure Python over the Arrow table, so it is exercisable without torch installed.

The Arrow view already carries typed anchor columns, so per-row union dispatch is unnecessary: numeric and anchor columns become tensors directly, and the remaining columns are passed through as Python values. The flat view carries no blob payloads, so media bytes are not materialised here; the exporter records the requested media-resolution intent on its result for a downstream loader transform (which owns the :mod:lairs.media anchor-aware resolver and the blob transport) to act on.

TorchExportSpec

Bases: Model

The export specification for the PyTorch exporter.

Selects which Arrow columns become tensor features, which are passed through as plain Python values, and whether media references are resolved as records flow. When columns is unset every column is kept; when tensor_columns is unset the numeric (and anchor) columns are inferred from the Arrow schema.

ATTRIBUTE DESCRIPTION
columns

The ordered subset of Arrow columns to keep. None keeps every column in the view's schema order.

TYPE: tuple of str or None, optional

tensor_columns

The columns to convert to tensors. None selects the numeric and anchor columns automatically from the Arrow schema.

TYPE: tuple of str or None, optional

resolve_media

Whether to resolve a per-row media reference through the :mod:lairs.media anchor resolver as rows are produced.

TYPE: (bool, optional)

TorchExportResult

Bases: Model

The bundle a PyTorch export produces.

Carries the map-style dataset, the iterable-dataset variant, and the tensor columns the collate helper stacks. The two datasets are behavioural objects held in opaque fields; the tensor columns are typed metadata so a caller can build a DataLoader with the matching collate function.

ATTRIBUTE DESCRIPTION
dataset

The map-style dataset over the Arrow rows.

TYPE: Dataset

iterable

The iterable-dataset variant over the Arrow rows.

TYPE: IterableDataset

tensor_columns

The columns the collate helper stacks into tensors.

TYPE: tuple of str, optional

resolve_media

The recorded media-resolution intent from the export spec, for a downstream loader transform to act on.

TYPE: (bool, optional)

collate

collate(
    batch: list[dict[str, JsonValue]],
) -> dict[str, JsonValue | Tensor]

Collate a batch of rows, stacking the tensor columns.

PARAMETER DESCRIPTION
batch

The per-row mappings to collate.

TYPE: list of dict

RETURNS DESCRIPTION
dict

The column-major batch with tensor columns stacked.

TorchExporter

An exporter that emits PyTorch datasets from an Arrow view.

Binds to :class:~lairs.integrations.ports.Exporter with the Arrow table as its view, :class:TorchExportSpec as its specification, and :class:TorchExportResult as its return type. PyTorch is imported lazily, so constructing the exporter and inspecting an Arrow view never require the optional extra.

export

export(
    view: Table, *, spec: TorchExportSpec | None = None
) -> TorchExportResult

Export an Arrow view to a bundle of PyTorch datasets.

Builds a map-style dataset and an iterable-dataset variant over the kept columns of the view, recording the tensor columns the bundled collate helper stacks. The datasets read rows lazily from the Arrow table, so no per-row tensor is built until a row is fetched.

PARAMETER DESCRIPTION
view

The flattened Arrow view to export.

TYPE: Table

spec

An optional export specification (column selection, tensor columns, media resolution). None keeps every column and infers the tensor columns from the schema.

TYPE: TorchExportSpec or None DEFAULT: None

RETURNS DESCRIPTION
TorchExportResult

The map-style dataset, the iterable variant, and the tensor columns.

RAISES DESCRIPTION
ImportError

When the optional torch dependency is not installed.

KeyError

When the spec names a column absent from the view.

collate_records

collate_records(
    batch: list[dict[str, JsonValue]],
    tensor_columns: tuple[str, ...],
) -> dict[str, JsonValue | Tensor]

Collate a batch of row mappings into a column-major batch mapping.

Tensor columns are stacked into a single torch tensor; the remaining columns are collected into a list, one entry per row. The function is pure apart from the lazy torch import that the tensor columns require, so a batch with no tensor columns collates without torch installed.

PARAMETER DESCRIPTION
batch

The per-row mappings to collate.

TYPE: list of dict

tensor_columns

The columns to stack into a tensor.

TYPE: tuple of str

RETURNS DESCRIPTION
dict

A column-major mapping: each tensor column maps to a stacked tensor, and every other column maps to a list of its per-row values.

TensorFlow tf.data

lairs.integrations.tfdata

TensorFlow tf.data data-plane exporter.

Emits a tf.data.Dataset from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. Requires the lairs[tf] extra at runtime; tensorflow is imported lazily so that importing this module, deriving a feature spec from an Arrow schema, and running the unit tests never require tensorflow to be installed.

The Arrow-schema to feature-spec derivation is pure and tensorflow-free: each Arrow column maps to a :class:TfFeatureSpec carrying a stable dtype token (and a flag for list-valued columns). Converting those tokens to concrete tf.dtypes.DType values, and building the tf.data.Dataset itself, are the only steps that touch tensorflow, and they do so behind a lazy import.

TfFeatureSpec

Bases: Model

A single Arrow column described as a tensorflow feature.

ATTRIBUTE DESCRIPTION
name

The column name.

TYPE: str

dtype

The tensorflow dtype token (for example "int64" or "string").

TYPE: str

is_sequence

Whether the column is list-valued (a ragged or variable-length feature), in which case dtype describes the list's element type.

TYPE: (bool, optional)

TfDataSpec

Bases: Model

Options that shape the emitted tf.data.Dataset.

ATTRIBUTE DESCRIPTION
columns

The columns to keep, in order. An empty tuple keeps every column.

TYPE: tuple of str, optional

batch_size

The batch size. When None the dataset is not batched.

TYPE: (int or None, optional)

shuffle_buffer

The shuffle buffer size. When None the dataset is not shuffled.

TYPE: (int or None, optional)

seed

The shuffle seed, used only when shuffle_buffer is set.

TYPE: (int or None, optional)

drop_remainder

Whether a trailing partial batch is dropped when batching.

TYPE: (bool, optional)

TfDataExporter

An exporter that emits a tf.data.Dataset from an Arrow view.

The exporter binds to the generic :class:~lairs.integrations.ports.Exporter port with the Arrow Table as its view and :class:TfDataSpec as its specification. tensorflow is imported lazily inside :meth:export, so importing the module and deriving feature specs never require the lairs[tf] extra.

export

export(
    view: Table, *, spec: TfDataSpec | None = None
) -> Dataset

Export an Arrow view to a tf.data.Dataset.

Each retained column becomes one tensor in a dictionary-structured dataset, keyed by column name. The optional spec selects and orders columns and toggles shuffling and batching.

PARAMETER DESCRIPTION
view

The flattened Arrow view to export.

TYPE: Table

spec

An optional export specification. When None every column is kept and the dataset is neither shuffled nor batched.

TYPE: TfDataSpec or None DEFAULT: None

RETURNS DESCRIPTION
Dataset

A dictionary-structured dataset, one entry per retained column.

RAISES DESCRIPTION
ImportError

When tensorflow is not installed.

token_of

token_of(arrow_type: DataType) -> tuple[str, bool]

Map an Arrow data type to a tensorflow dtype token and a sequence flag.

The mapping is pure and tensorflow-free. List and large-list types are reported as sequences over their element token; every other type collapses to a scalar token. Unrecognised types fall back to the string token.

PARAMETER DESCRIPTION
arrow_type

The Arrow column type to map.

TYPE: DataType

RETURNS DESCRIPTION
tuple of (str, bool)

A (token, is_sequence) pair. token is a tensorflow dtype name and is_sequence reports whether the column is list-valued.

feature_specs_of

feature_specs_of(
    schema: Schema, *, columns: tuple[str, ...] = ()
) -> tuple[TfFeatureSpec, ...]

Derive the tensorflow feature specs for an Arrow schema.

The derivation is pure and tensorflow-free. Each retained column becomes one :class:TfFeatureSpec carrying its dtype token and whether it is list-valued.

PARAMETER DESCRIPTION
schema

The Arrow schema to describe.

TYPE: Schema

columns

The columns to keep, in order. An empty tuple keeps every column in schema order. Names absent from the schema are skipped.

TYPE: tuple of str DEFAULT: ()

RETURNS DESCRIPTION
tuple of TfFeatureSpec

One feature spec per retained column.

WebDataset

lairs.integrations.webdataset

WebDataset data-plane exporter.

Emits tar shards for heavy media from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. Each sample is a keyed group of files: a __key__, a .json member holding the row's scalar fields, and, when a media column is present, a media member carrying the resolved bytes.

The tar-writing path uses the standard-library :mod:tarfile, so basic sharding is exercised without the optional webdataset library. The webdataset library (the lairs[webdataset] extra) is imported lazily inside the read-back loader only, with a clear error when it is missing, so importing this module never pulls the dependency in.

WebDatasetSpec

Bases: Model

An export specification for the WebDataset exporter.

ATTRIBUTE DESCRIPTION
output_dir

The directory the tar shards are written into. Created if absent.

TYPE: (str, optional)

shard_size

The maximum number of samples per shard. The final shard may be smaller.

TYPE: (int, optional)

shard_prefix

The filename stem each shard is named after (<prefix>-000000.tar).

TYPE: (str, optional)

key_column

The Arrow column whose value names each sample (its __key__). When None the row index is used, zero-padded to a stable width.

TYPE: (str or None, optional)

media_column

The Arrow column carrying media bytes or a resolvable media record. When present, each sample gains a media member alongside its json metadata.

TYPE: (str or None, optional)

WebDatasetExporter

An exporter that emits WebDataset tar shards from an Arrow view.

The exporter binds to the generic :class:~lairs.integrations.ports.Exporter port with a :class:pyarrow.Table view, a :class:WebDatasetSpec specification, and a list of written shard paths as its return type.

export

export(
    view: Table, *, spec: WebDatasetSpec | None = None
) -> list[Path]

Export an Arrow view to WebDataset tar shards.

Each row becomes one sample. A sample carries a .json member with the row's scalar (non-media) fields and, when spec.media_column is set, a media member with the resolved bytes. Samples are grouped into shards of at most spec.shard_size rows, each written as a tar archive.

PARAMETER DESCRIPTION
view

The flattened Arrow view to export.

TYPE: Table

spec

The export specification. A default spec is used when omitted.

TYPE: WebDatasetSpec or None DEFAULT: None

RETURNS DESCRIPTION
list of pathlib.Path

The written tar shard files, in shard order.

RAISES DESCRIPTION
ValueError

When spec.shard_size is not positive, or a named column is absent from the view.

load

load(shards: list[Path]) -> Iterator[dict[str, JsonValue]]

Read shards back through the webdataset loader.

This is the read-back path used by training loops; it requires the optional webdataset library and is imported lazily so importing this module never pulls the dependency in.

PARAMETER DESCRIPTION
shards

The shard files to read, in order.

TYPE: list of pathlib.Path

RETURNS DESCRIPTION
collections.abc.Iterator of dict

The decoded samples, one mapping per sample.

RAISES DESCRIPTION
ImportError

When the optional webdataset library is not installed.