Exporters¶
The data-plane exporters, which bind to the Exporter
port and emit framework-native datasets from an Arrow view. Each backend
library is an optional dependency imported lazily. For usage see [Guides
Exporters](../guide/exporters.md).
HuggingFace datasets¶
lairs.integrations.hf.datasets ¶
HuggingFace datasets exporter.
Builds a datasets.Dataset straight from an Arrow view, binding to the
:class:~lairs.integrations.ports.Exporter port. Because the Arrow flattening
done by :mod:lairs.store.arrow already resolves the polymorphic Layers anchors
into typed columns, the schema logic here stays thin: the exporter wraps the
Arrow table (near zero-copy), optionally selects columns for a task template, and
derives a HuggingFace Features mapping from the generated model field specs
through :mod:lairs.data.features.
datasets is an optional dependency provided by the lairs[hf] extra. It is
imported lazily inside the methods that need it, so importing this module never
pulls datasets in; the concrete return types are bound only under
TYPE_CHECKING.
Shape ¶
The tabular shape an Arrow view is exported in.
nested keeps one row per expression with annotations as sequence-valued
columns; exploded keeps one row per annotation. The Arrow builders in
:mod:lairs.store.arrow already produce one shape or the other, so the shape is
descriptive metadata the exporter records rather than a re-shaping step.
TaskTemplate ¶
Bases: Model
A canonical HuggingFace task shape for a Layers annotation layer.
A template maps a Layers (kind, subkind, formalism) triple to a named
HuggingFace task and the columns that task expects, so a token-classification
or span layer exports under the conventional column names HuggingFace tooling
recognises.
| ATTRIBUTE | DESCRIPTION |
|---|---|
task |
The canonical HuggingFace task name (for example
TYPE:
|
kind |
The Layers annotation kind this template applies to.
TYPE:
|
subkind |
The Layers annotation subkind, when the template is subkind-specific.
TYPE:
|
formalism |
The Layers formalism, when the template is formalism-specific.
TYPE:
|
columns |
The columns the task shape expects, in order.
TYPE:
|
ExportSpec ¶
Bases: Model
An export specification controlling the HuggingFace dataset shape.
The Arrow flattening has already done the heavy lifting, so the spec only records the tabular shape, an optional column projection, and an optional canonical task name. It is a plain didactic model so it is serialisable and carries cleanly into a dataset card's provenance.
| ATTRIBUTE | DESCRIPTION |
|---|---|
shape |
The tabular shape of the Arrow view being exported.
TYPE:
|
columns |
An optional projection: when set, only these columns are kept, in this order. Columns absent from the view are skipped.
TYPE:
|
task |
An optional canonical HuggingFace task name, recorded for downstream tooling and dataset-card provenance.
TYPE:
|
for_template
classmethod
¶
for_template(
template: TaskTemplate, *, shape: Shape = "exploded"
) -> ExportSpec
Build a spec that selects a task template's columns.
| PARAMETER | DESCRIPTION |
|---|---|
template
|
The task template whose columns to project and whose task to record.
TYPE:
|
shape
|
The tabular shape of the view being exported.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ExportSpec
|
A spec projecting the template's columns and recording its task. |
HuggingFaceExporter ¶
An exporter that emits a datasets.Dataset from an Arrow view.
The exporter binds to the :class:~lairs.integrations.ports.Exporter port
with the Arrow table as the view, :class:ExportSpec as the spec, and a
datasets.Dataset as the produced object. Because the Arrow view already
carries typed anchor columns, the exporter only applies the spec's column
projection and hands the table to datasets near zero-copy.
export ¶
export(
view: Table, *, spec: ExportSpec | None = None
) -> Dataset
Export an Arrow view to a HuggingFace dataset.
The export wraps the Arrow table directly: datasets builds an
Arrow-backed dataset with no row-wise copy. When the spec carries a
column projection, the table is narrowed to those columns first.
| PARAMETER | DESCRIPTION |
|---|---|
view
|
The flattened Arrow view to export.
TYPE:
|
spec
|
An optional export specification (shape, column projection, task).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
The exported, Arrow-backed dataset. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When the optional |
to_hf_iterable ¶
to_hf_iterable(
source: Callable[[], Iterator[RecordBatch]],
*,
spec: ExportSpec | None = None,
) -> IterableDataset
Build a streaming datasets.IterableDataset from a batch source.
The source is a zero-argument factory returning a fresh iterator of Arrow record batches, for example one driven by a PDS cursor or a Repository scan, so a large corpus trains without a full download. Each batch is narrowed by the spec's column projection before its rows are yielded.
datasets.IterableDataset.from_generator may invoke the underlying
generator more than once (re-iteration, multi-epoch training), so
source must yield a fresh iterator on each call. A one-shot factory
that returns an already-consumed iterator on a later call yields no rows
on re-iteration; callers driving a cursor must reset it per call.
| PARAMETER | DESCRIPTION |
|---|---|
source
|
A zero-argument factory returning a fresh iterator of Arrow record batches. Called once per iteration of the resulting dataset, so it must return a fresh iterator each time.
TYPE:
|
spec
|
An optional export specification (column projection).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
IterableDataset
|
A streaming dataset over the batch source. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When the optional |
task_template_for ¶
task_template_for(
kind: str,
*,
subkind: str | None = None,
formalism: str | None = None,
) -> TaskTemplate | None
Return the most specific task template matching a Layers triple.
A template matches when its kind equals kind and each of its set
subkind and formalism fields equals the corresponding argument.
Templates are ranked by specificity, so a template constraining a subkind or
formalism is preferred over a kind-only template for the same kind.
| PARAMETER | DESCRIPTION |
|---|---|
kind
|
The Layers annotation kind to match.
TYPE:
|
subkind
|
The Layers annotation subkind, when known.
TYPE:
|
formalism
|
The Layers formalism, when known.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TaskTemplate or None
|
The best-matching template, or |
hf_features_from ¶
hf_features_from(features: Features) -> Features
Derive a HuggingFace Features mapping from a lairs feature schema.
Each lairs :class:~lairs.data.features.FeatureSpec becomes a HuggingFace
Value (or a Sequence of a Value for sequence tokens). Because the
lairs features are read off the generated model field specs, the resulting
HuggingFace schema always matches the lexicons.
| PARAMETER | DESCRIPTION |
|---|---|
features
|
The lairs feature schema, typically from
:func:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Features
|
The HuggingFace feature mapping. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When the optional |
HuggingFace Hub¶
lairs.integrations.hf.hub ¶
HuggingFace Hub push and pull.
Mirrors a corpus to the Hub as Arrow/Parquet shards with an auto-generated
dataset card carrying full provenance, and reads a mirror back. The Hub is an
export and mirror target; the PDS and the didactic Repository stay canonical, so
the card records exactly the references needed to reproduce the mirror from
source: the corpus AT-URI, the Repository revision or tag, the vendored lexicon
manifest hash, and the license from the corpus record.
datasets and huggingface_hub are optional dependencies provided by the
lairs[hf] extra. They are imported lazily inside the functions that need
them, so importing this module never pulls them in; the concrete return types are
bound only under TYPE_CHECKING.
ProvenanceBundle ¶
Bases: Model
The provenance carried on a mirrored corpus's dataset card.
The bundle records everything needed to trace a Hub mirror back to its canonical sources. The PDS and the Repository remain the source of truth; the bundle pins the exact corpus AT-URI, the Repository revision or tag, the vendored lexicon manifest hash, and the corpus license.
| ATTRIBUTE | DESCRIPTION |
|---|---|
corpus_uri |
The AT-URI of the source
TYPE:
|
revision |
The didactic Repository revision id the mirror was built from.
TYPE:
|
tag |
The Repository tag naming the revision, when one exists.
TYPE:
|
lexicon_manifest_hash |
The content hash of the vendored lexicon tree, from the manifest.
TYPE:
|
layers_version |
The Layers lexicon version recorded in the manifest.
TYPE:
|
license |
The license identifier from the
TYPE:
|
name |
The corpus name from the
TYPE:
|
provenance_bundle ¶
provenance_bundle(
*,
corpus_uri: str | None = None,
revision: str | None = None,
tag: str | None = None,
license: str | None = None,
name: str | None = None,
) -> ProvenanceBundle
Assemble a provenance bundle, filling the lexicon manifest fields.
The lexicon manifest hash and Layers version are read from the vendored manifest packaged with lairs, so a mirror always records the schema version it was generated against; the remaining fields are supplied by the caller from the corpus record and the Repository revision being mirrored.
| PARAMETER | DESCRIPTION |
|---|---|
corpus_uri
|
The AT-URI of the source
TYPE:
|
revision
|
The Repository revision id the mirror was built from.
TYPE:
|
tag
|
The Repository tag naming the revision.
TYPE:
|
license
|
The license identifier from the
TYPE:
|
name
|
The corpus name from the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ProvenanceBundle
|
The assembled bundle with the lexicon manifest fields filled in. |
dataset_card ¶
dataset_card(bundle: ProvenanceBundle) -> str
Render a markdown dataset card from a provenance bundle.
The card documents the canonical sources of the mirror so a reader can trace it back to the PDS and the Repository, which remain authoritative. A leading YAML front-matter block carries the machine-readable license and source AT-URI for the Hub dataset viewer and metadata indexing; the prose section below repeats the full provenance. Only the fields that are set appear, so a sparse bundle yields a compact card.
| PARAMETER | DESCRIPTION |
|---|---|
bundle
|
The provenance to render.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The rendered markdown dataset card, prefixed with a YAML front-matter block when any front-matter field is set. |
push_to_hub ¶
push_to_hub(
view: Table,
repo_id: str,
*,
private: bool = False,
provenance: ProvenanceBundle | None = None,
token: str | None = None,
split: str = "train",
config_name: str = "default",
) -> str
Push an Arrow view to the Hub with a provenance dataset card.
The Arrow view is written as the dataset's data (Arrow/Parquet shards) and the provenance bundle is rendered into the dataset card so the mirror records its canonical sources. The Hub is a mirror target only; the PDS and the Repository stay canonical.
The push is two Hub commits: datasets writes the data shards, then the
provenance card is uploaded as a second commit. These are not atomic; if the
card upload fails after the data is pushed, the mirror exists without its
provenance and the card upload must be retried. The data and card never
diverge in content, so a retry is always safe.
| PARAMETER | DESCRIPTION |
|---|---|
view
|
The Arrow view to push.
TYPE:
|
repo_id
|
The target Hub dataset repository identifier (
TYPE:
|
private
|
Whether to create a private repository.
TYPE:
|
provenance
|
The provenance to render into the dataset card. When omitted, a bundle carrying only the vendored lexicon manifest fields is used.
TYPE:
|
token
|
A HuggingFace access token with write scope. When omitted, the ambient
login state (a prior
TYPE:
|
split
|
The dataset split name to write the shards under.
TYPE:
|
config_name
|
The dataset configuration (subset) name to write the shards under.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The URL of the pushed dataset on the Hub. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When the optional |
load_from_hub ¶
load_from_hub(
repo_id: str,
*,
split: str | None = None,
revision: str | None = None,
token: str | None = None,
) -> Dataset | DatasetDict
Load a mirrored dataset back from the Hub.
When split is given, a single :class:datasets.Dataset is returned; when
it is omitted, datasets.load_dataset returns a
:class:datasets.DatasetDict keyed by split name for a multi-split
repository (and a :class:datasets.Dataset for a single-split repository).
Callers that need a concrete Dataset should pass split or index the
returned dict by split name.
| PARAMETER | DESCRIPTION |
|---|---|
repo_id
|
The Hub dataset repository identifier.
TYPE:
|
split
|
A single split to load (for example
TYPE:
|
revision
|
A Hub revision (branch, tag, or commit) to read.
TYPE:
|
token
|
A HuggingFace access token for a private or gated repository. When omitted, the ambient login state is used.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset or DatasetDict
|
The loaded dataset when |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When the optional |
PyTorch¶
lairs.integrations.torch ¶
PyTorch data-plane exporter.
Emits PyTorch datasets from an Arrow view, binding to the
:class:~lairs.integrations.ports.Exporter port. The exporter produces a
map-style :class:torch.utils.data.Dataset, an
:class:torch.utils.data.IterableDataset variant, and a collate helper,
bundled in a :class:TorchExportResult.
PyTorch is an optional dependency (the lairs[torch] extra). It is imported
lazily inside the methods that need it, with a clear error when it is missing,
so importing this module never pulls torch in. The column-selection and
batching logic is pure Python over the Arrow table, so it is exercisable without
torch installed.
The Arrow view already carries typed anchor columns, so per-row union dispatch
is unnecessary: numeric and anchor columns become tensors directly, and the
remaining columns are passed through as Python values. The flat view carries no
blob payloads, so media bytes are not materialised here; the exporter records
the requested media-resolution intent on its result for a downstream loader
transform (which owns the :mod:lairs.media anchor-aware resolver and the blob
transport) to act on.
TorchExportSpec ¶
Bases: Model
The export specification for the PyTorch exporter.
Selects which Arrow columns become tensor features, which are passed through
as plain Python values, and whether media references are resolved as records
flow. When columns is unset every column is kept; when tensor_columns
is unset the numeric (and anchor) columns are inferred from the Arrow schema.
| ATTRIBUTE | DESCRIPTION |
|---|---|
columns |
The ordered subset of Arrow columns to keep.
TYPE:
|
tensor_columns |
The columns to convert to tensors.
TYPE:
|
resolve_media |
Whether to resolve a per-row media reference through the
:mod:
TYPE:
|
TorchExportResult ¶
Bases: Model
The bundle a PyTorch export produces.
Carries the map-style dataset, the iterable-dataset variant, and the tensor
columns the collate helper stacks. The two datasets are behavioural
objects held in opaque fields; the tensor columns are typed metadata so a
caller can build a DataLoader with the matching collate function.
| ATTRIBUTE | DESCRIPTION |
|---|---|
dataset |
The map-style dataset over the Arrow rows.
TYPE:
|
iterable |
The iterable-dataset variant over the Arrow rows.
TYPE:
|
tensor_columns |
The columns the collate helper stacks into tensors.
TYPE:
|
resolve_media |
The recorded media-resolution intent from the export spec, for a downstream loader transform to act on.
TYPE:
|
TorchExporter ¶
An exporter that emits PyTorch datasets from an Arrow view.
Binds to :class:~lairs.integrations.ports.Exporter with the Arrow table as
its view, :class:TorchExportSpec as its specification, and
:class:TorchExportResult as its return type. PyTorch is imported lazily, so
constructing the exporter and inspecting an Arrow view never require the
optional extra.
export ¶
export(
view: Table, *, spec: TorchExportSpec | None = None
) -> TorchExportResult
Export an Arrow view to a bundle of PyTorch datasets.
Builds a map-style dataset and an iterable-dataset variant over the kept
columns of the view, recording the tensor columns the bundled
collate helper stacks. The datasets read rows lazily from the Arrow
table, so no per-row tensor is built until a row is fetched.
| PARAMETER | DESCRIPTION |
|---|---|
view
|
The flattened Arrow view to export.
TYPE:
|
spec
|
An optional export specification (column selection, tensor columns,
media resolution).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TorchExportResult
|
The map-style dataset, the iterable variant, and the tensor columns. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When the optional |
KeyError
|
When the spec names a column absent from the view. |
collate_records ¶
collate_records(
batch: list[dict[str, JsonValue]],
tensor_columns: tuple[str, ...],
) -> dict[str, JsonValue | Tensor]
Collate a batch of row mappings into a column-major batch mapping.
Tensor columns are stacked into a single torch tensor; the remaining
columns are collected into a list, one entry per row. The function is pure
apart from the lazy torch import that the tensor columns require, so a
batch with no tensor columns collates without torch installed.
| PARAMETER | DESCRIPTION |
|---|---|
batch
|
The per-row mappings to collate.
TYPE:
|
tensor_columns
|
The columns to stack into a tensor.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
A column-major mapping: each tensor column maps to a stacked tensor, and every other column maps to a list of its per-row values. |
TensorFlow tf.data¶
lairs.integrations.tfdata ¶
TensorFlow tf.data data-plane exporter.
Emits a tf.data.Dataset from an Arrow view, binding to the
:class:~lairs.integrations.ports.Exporter port. Requires the lairs[tf]
extra at runtime; tensorflow is imported lazily so that importing this module,
deriving a feature spec from an Arrow schema, and running the unit tests never
require tensorflow to be installed.
The Arrow-schema to feature-spec derivation is pure and tensorflow-free: each
Arrow column maps to a :class:TfFeatureSpec carrying a stable dtype token (and
a flag for list-valued columns). Converting those tokens to concrete
tf.dtypes.DType values, and building the tf.data.Dataset itself, are the
only steps that touch tensorflow, and they do so behind a lazy import.
TfFeatureSpec ¶
Bases: Model
A single Arrow column described as a tensorflow feature.
| ATTRIBUTE | DESCRIPTION |
|---|---|
name |
The column name.
TYPE:
|
dtype |
The tensorflow dtype token (for example
TYPE:
|
is_sequence |
Whether the column is list-valued (a ragged or variable-length feature),
in which case
TYPE:
|
TfDataSpec ¶
Bases: Model
Options that shape the emitted tf.data.Dataset.
| ATTRIBUTE | DESCRIPTION |
|---|---|
columns |
The columns to keep, in order. An empty tuple keeps every column.
TYPE:
|
batch_size |
The batch size. When
TYPE:
|
shuffle_buffer |
The shuffle buffer size. When
TYPE:
|
seed |
The shuffle seed, used only when
TYPE:
|
drop_remainder |
Whether a trailing partial batch is dropped when batching.
TYPE:
|
TfDataExporter ¶
An exporter that emits a tf.data.Dataset from an Arrow view.
The exporter binds to the generic
:class:~lairs.integrations.ports.Exporter port with the Arrow Table as
its view and :class:TfDataSpec as its specification. tensorflow is imported
lazily inside :meth:export, so importing the module and deriving feature
specs never require the lairs[tf] extra.
export ¶
export(
view: Table, *, spec: TfDataSpec | None = None
) -> Dataset
Export an Arrow view to a tf.data.Dataset.
Each retained column becomes one tensor in a dictionary-structured dataset, keyed by column name. The optional spec selects and orders columns and toggles shuffling and batching.
| PARAMETER | DESCRIPTION |
|---|---|
view
|
The flattened Arrow view to export.
TYPE:
|
spec
|
An optional export specification. When
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
A dictionary-structured dataset, one entry per retained column. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When tensorflow is not installed. |
token_of ¶
Map an Arrow data type to a tensorflow dtype token and a sequence flag.
The mapping is pure and tensorflow-free. List and large-list types are reported as sequences over their element token; every other type collapses to a scalar token. Unrecognised types fall back to the string token.
| PARAMETER | DESCRIPTION |
|---|---|
arrow_type
|
The Arrow column type to map.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of (str, bool)
|
A |
feature_specs_of ¶
feature_specs_of(
schema: Schema, *, columns: tuple[str, ...] = ()
) -> tuple[TfFeatureSpec, ...]
Derive the tensorflow feature specs for an Arrow schema.
The derivation is pure and tensorflow-free. Each retained column becomes one
:class:TfFeatureSpec carrying its dtype token and whether it is list-valued.
| PARAMETER | DESCRIPTION |
|---|---|
schema
|
The Arrow schema to describe.
TYPE:
|
columns
|
The columns to keep, in order. An empty tuple keeps every column in schema order. Names absent from the schema are skipped.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of TfFeatureSpec
|
One feature spec per retained column. |
WebDataset¶
lairs.integrations.webdataset ¶
WebDataset data-plane exporter.
Emits tar shards for heavy media from an Arrow view, binding to the
:class:~lairs.integrations.ports.Exporter port. Each sample is a keyed group
of files: a __key__, a .json member holding the row's scalar fields, and,
when a media column is present, a media member carrying the resolved bytes.
The tar-writing path uses the standard-library :mod:tarfile, so basic sharding
is exercised without the optional webdataset library. The webdataset
library (the lairs[webdataset] extra) is imported lazily inside the read-back
loader only, with a clear error when it is missing, so importing this module never
pulls the dependency in.
WebDatasetSpec ¶
Bases: Model
An export specification for the WebDataset exporter.
| ATTRIBUTE | DESCRIPTION |
|---|---|
output_dir |
The directory the tar shards are written into. Created if absent.
TYPE:
|
shard_size |
The maximum number of samples per shard. The final shard may be smaller.
TYPE:
|
shard_prefix |
The filename stem each shard is named after (
TYPE:
|
key_column |
The Arrow column whose value names each sample (its
TYPE:
|
media_column |
The Arrow column carrying media bytes or a resolvable media record. When present, each sample gains a media member alongside its json metadata.
TYPE:
|
WebDatasetExporter ¶
An exporter that emits WebDataset tar shards from an Arrow view.
The exporter binds to the generic :class:~lairs.integrations.ports.Exporter
port with a :class:pyarrow.Table view, a :class:WebDatasetSpec
specification, and a list of written shard paths as its return type.
export ¶
export(
view: Table, *, spec: WebDatasetSpec | None = None
) -> list[Path]
Export an Arrow view to WebDataset tar shards.
Each row becomes one sample. A sample carries a .json member with the
row's scalar (non-media) fields and, when spec.media_column is set, a
media member with the resolved bytes. Samples are grouped into shards of
at most spec.shard_size rows, each written as a tar archive.
| PARAMETER | DESCRIPTION |
|---|---|
view
|
The flattened Arrow view to export.
TYPE:
|
spec
|
The export specification. A default spec is used when omitted.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of pathlib.Path
|
The written tar shard files, in shard order. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
When |
load ¶
load(shards: list[Path]) -> Iterator[dict[str, JsonValue]]
Read shards back through the webdataset loader.
This is the read-back path used by training loops; it requires the
optional webdataset library and is imported lazily so importing this
module never pulls the dependency in.
| PARAMETER | DESCRIPTION |
|---|---|
shards
|
The shard files to read, in order.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
collections.abc.Iterator of dict
|
The decoded samples, one mapping per sample. |
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
When the optional |