Exporters¶

The data-plane exporters, which bind to the Exporter port and emit framework-native datasets from an Arrow view. Each backend library is an optional dependency imported lazily. For usage see [Guides

Exporters](../guide/exporters.md).

HuggingFace datasets¶

lairs.integrations.hf.datasets ¶

HuggingFace datasets exporter.

Builds a datasets.Dataset straight from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. Because the Arrow flattening done by :mod:lairs.store.arrow already resolves the polymorphic Layers anchors into typed columns, the schema logic here stays thin: the exporter wraps the Arrow table (near zero-copy), optionally selects columns for a task template, and derives a HuggingFace Features mapping from the generated model field specs through :mod:lairs.data.features.

datasets is an optional dependency provided by the lairs[hf] extra. It is imported lazily inside the methods that need it, so importing this module never pulls datasets in; the concrete return types are bound only under TYPE_CHECKING.

Shape ¶

Shape = Literal['nested', 'exploded']

The tabular shape an Arrow view is exported in.

nested keeps one row per expression with annotations as sequence-valued columns; exploded keeps one row per annotation. The Arrow builders in :mod:lairs.store.arrow already produce one shape or the other, so the shape is descriptive metadata the exporter records rather than a re-shaping step.

TaskTemplate ¶

Bases: Model

A canonical HuggingFace task shape for a Layers annotation layer.

A template maps a Layers (kind, subkind, formalism) triple to a named HuggingFace task and the columns that task expects, so a token-classification or span layer exports under the conventional column names HuggingFace tooling recognises.

ATTRIBUTE	DESCRIPTION
`task`	The canonical HuggingFace task name (for example `"token-classification"`). TYPE: `str`
`kind`	The Layers annotation kind this template applies to. TYPE: `str`
`subkind`	The Layers annotation subkind, when the template is subkind-specific. TYPE: `(str or None, optional)`
`formalism`	The Layers formalism, when the template is formalism-specific. TYPE: `(str or None, optional)`
`columns`	The columns the task shape expects, in order. TYPE: `tuple of str, optional`

ExportSpec ¶

Bases: Model

An export specification controlling the HuggingFace dataset shape.

The Arrow flattening has already done the heavy lifting, so the spec only records the tabular shape, an optional column projection, and an optional canonical task name. It is a plain didactic model so it is serialisable and carries cleanly into a dataset card's provenance.

ATTRIBUTE	DESCRIPTION
`shape`	The tabular shape of the Arrow view being exported. TYPE: `({'nested', 'exploded'}, optional)`
`columns`	An optional projection: when set, only these columns are kept, in this order. Columns absent from the view are skipped. TYPE: `tuple of str or None, optional`
`task`	An optional canonical HuggingFace task name, recorded for downstream tooling and dataset-card provenance. TYPE: `(str or None, optional)`

for_template `classmethod` ¶

for_template(
    template: TaskTemplate, *, shape: Shape = "exploded"
) -> ExportSpec

Build a spec that selects a task template's columns.

PARAMETER	DESCRIPTION
`template`	The task template whose columns to project and whose task to record. TYPE: `TaskTemplate`
`shape`	The tabular shape of the view being exported. TYPE: `('nested', 'exploded')` DEFAULT: `"nested"`

RETURNS	DESCRIPTION
`ExportSpec`	A spec projecting the template's columns and recording its task.

HuggingFaceExporter ¶

An exporter that emits a datasets.Dataset from an Arrow view.

The exporter binds to the :class:~lairs.integrations.ports.Exporter port with the Arrow table as the view, :class:ExportSpec as the spec, and a datasets.Dataset as the produced object. Because the Arrow view already carries typed anchor columns, the exporter only applies the spec's column projection and hands the table to datasets near zero-copy.

export ¶

export(
    view: Table, *, spec: ExportSpec | None = None
) -> Dataset

Export an Arrow view to a HuggingFace dataset.

The export wraps the Arrow table directly: datasets builds an Arrow-backed dataset with no row-wise copy. When the spec carries a column projection, the table is narrowed to those columns first.

PARAMETER	DESCRIPTION
`view`	The flattened Arrow view to export. TYPE: `Table`
`spec`	An optional export specification (shape, column projection, task). TYPE: `ExportSpec or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dataset`	The exported, Arrow-backed dataset.

RAISES	DESCRIPTION
`ImportError`	When the optional `datasets` dependency is not installed.

to_hf_iterable ¶

to_hf_iterable(
    source: Callable[[], Iterator[RecordBatch]],
    *,
    spec: ExportSpec | None = None,
) -> IterableDataset

Build a streaming datasets.IterableDataset from a batch source.

The source is a zero-argument factory returning a fresh iterator of Arrow record batches, for example one driven by a PDS cursor or a Repository scan, so a large corpus trains without a full download. Each batch is narrowed by the spec's column projection before its rows are yielded.

datasets.IterableDataset.from_generator may invoke the underlying generator more than once (re-iteration, multi-epoch training), so source must yield a fresh iterator on each call. A one-shot factory that returns an already-consumed iterator on a later call yields no rows on re-iteration; callers driving a cursor must reset it per call.

PARAMETER	DESCRIPTION
`source`	A zero-argument factory returning a fresh iterator of Arrow record batches. Called once per iteration of the resulting dataset, so it must return a fresh iterator each time. TYPE: `Callable`
`spec`	An optional export specification (column projection). TYPE: `ExportSpec or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`IterableDataset`	A streaming dataset over the batch source.

RAISES	DESCRIPTION
`ImportError`	When the optional `datasets` dependency is not installed.

task_template_for ¶

task_template_for(
    kind: str,
    *,
    subkind: str | None = None,
    formalism: str | None = None,
) -> TaskTemplate | None

Return the most specific task template matching a Layers triple.

A template matches when its kind equals kind and each of its set subkind and formalism fields equals the corresponding argument. Templates are ranked by specificity, so a template constraining a subkind or formalism is preferred over a kind-only template for the same kind.

PARAMETER	DESCRIPTION
`kind`	The Layers annotation kind to match. TYPE: `str`
`subkind`	The Layers annotation subkind, when known. TYPE: `str or None` DEFAULT: `None`
`formalism`	The Layers formalism, when known. TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`TaskTemplate or None`	The best-matching template, or `None` when no template applies.

hf_features_from ¶

hf_features_from(features: Features) -> Features

Derive a HuggingFace Features mapping from a lairs feature schema.

Each lairs :class:~lairs.data.features.FeatureSpec becomes a HuggingFace Value (or a Sequence of a Value for sequence tokens). Because the lairs features are read off the generated model field specs, the resulting HuggingFace schema always matches the lexicons.

PARAMETER	DESCRIPTION
`features`	The lairs feature schema, typically from :func:`lairs.data.features.features_of`. TYPE: `Features`

RETURNS	DESCRIPTION
`Features`	The HuggingFace feature mapping.

RAISES	DESCRIPTION
`ImportError`	When the optional `datasets` dependency is not installed.

HuggingFace Hub¶

lairs.integrations.hf.hub ¶

HuggingFace Hub push and pull.

Mirrors a corpus to the Hub as Arrow/Parquet shards with an auto-generated dataset card carrying full provenance, and reads a mirror back. The Hub is an export and mirror target; the PDS and the didactic Repository stay canonical, so the card records exactly the references needed to reproduce the mirror from source: the corpus AT-URI, the Repository revision or tag, the vendored lexicon manifest hash, and the license from the corpus record.

datasets and huggingface_hub are optional dependencies provided by the lairs[hf] extra. They are imported lazily inside the functions that need them, so importing this module never pulls them in; the concrete return types are bound only under TYPE_CHECKING.

ProvenanceBundle ¶

Bases: Model

The provenance carried on a mirrored corpus's dataset card.

The bundle records everything needed to trace a Hub mirror back to its canonical sources. The PDS and the Repository remain the source of truth; the bundle pins the exact corpus AT-URI, the Repository revision or tag, the vendored lexicon manifest hash, and the corpus license.

ATTRIBUTE	DESCRIPTION
`corpus_uri`	The AT-URI of the source `corpus` record. TYPE: `(str or None, optional)`
`revision`	The didactic Repository revision id the mirror was built from. TYPE: `(str or None, optional)`
`tag`	The Repository tag naming the revision, when one exists. TYPE: `(str or None, optional)`
`lexicon_manifest_hash`	The content hash of the vendored lexicon tree, from the manifest. TYPE: `(str or None, optional)`
`layers_version`	The Layers lexicon version recorded in the manifest. TYPE: `(str or None, optional)`
`license`	The license identifier from the `corpus` record. TYPE: `(str or None, optional)`
`name`	The corpus name from the `corpus` record. TYPE: `(str or None, optional)`

provenance_bundle ¶

provenance_bundle(
    *,
    corpus_uri: str | None = None,
    revision: str | None = None,
    tag: str | None = None,
    license: str | None = None,
    name: str | None = None,
) -> ProvenanceBundle

Assemble a provenance bundle, filling the lexicon manifest fields.

The lexicon manifest hash and Layers version are read from the vendored manifest packaged with lairs, so a mirror always records the schema version it was generated against; the remaining fields are supplied by the caller from the corpus record and the Repository revision being mirrored.

PARAMETER	DESCRIPTION
`corpus_uri`	The AT-URI of the source `corpus` record. TYPE: `str or None` DEFAULT: `None`
`revision`	The Repository revision id the mirror was built from. TYPE: `str or None` DEFAULT: `None`
`tag`	The Repository tag naming the revision. TYPE: `str or None` DEFAULT: `None`
`license`	The license identifier from the `corpus` record. TYPE: `str or None` DEFAULT: `None`
`name`	The corpus name from the `corpus` record. TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`ProvenanceBundle`	The assembled bundle with the lexicon manifest fields filled in.

dataset_card ¶

dataset_card(bundle: ProvenanceBundle) -> str

Render a markdown dataset card from a provenance bundle.

The card documents the canonical sources of the mirror so a reader can trace it back to the PDS and the Repository, which remain authoritative. A leading YAML front-matter block carries the machine-readable license and source AT-URI for the Hub dataset viewer and metadata indexing; the prose section below repeats the full provenance. Only the fields that are set appear, so a sparse bundle yields a compact card.

PARAMETER	DESCRIPTION
`bundle`	The provenance to render. TYPE: `ProvenanceBundle`

RETURNS	DESCRIPTION
`str`	The rendered markdown dataset card, prefixed with a YAML front-matter block when any front-matter field is set.

push_to_hub ¶

push_to_hub(
    view: Table,
    repo_id: str,
    *,
    private: bool = False,
    provenance: ProvenanceBundle | None = None,
    token: str | None = None,
    split: str = "train",
    config_name: str = "default",
) -> str

Push an Arrow view to the Hub with a provenance dataset card.

The Arrow view is written as the dataset's data (Arrow/Parquet shards) and the provenance bundle is rendered into the dataset card so the mirror records its canonical sources. The Hub is a mirror target only; the PDS and the Repository stay canonical.

The push is two Hub commits: datasets writes the data shards, then the provenance card is uploaded as a second commit. These are not atomic; if the card upload fails after the data is pushed, the mirror exists without its provenance and the card upload must be retried. The data and card never diverge in content, so a retry is always safe.

PARAMETER	DESCRIPTION
`view`	The Arrow view to push. TYPE: `Table`
`repo_id`	The target Hub dataset repository identifier (`"org/name"`). TYPE: `str`
`private`	Whether to create a private repository. TYPE: `bool` DEFAULT: `False`
`provenance`	The provenance to render into the dataset card. When omitted, a bundle carrying only the vendored lexicon manifest fields is used. TYPE: `ProvenanceBundle or None` DEFAULT: `None`
`token`	A HuggingFace access token with write scope. When omitted, the ambient login state (a prior `huggingface-cli login` or the `HF_TOKEN` environment variable) is used; an unauthenticated caller surfaces a `huggingface_hub` authentication error from the underlying push. TYPE: `str or None` DEFAULT: `None`
`split`	The dataset split name to write the shards under. TYPE: `str` DEFAULT: `'train'`
`config_name`	The dataset configuration (subset) name to write the shards under. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`str`	The URL of the pushed dataset on the Hub.

RAISES	DESCRIPTION
`ImportError`	When the optional `datasets` or `huggingface_hub` dependency is not installed.

load_from_hub ¶

load_from_hub(
    repo_id: str,
    *,
    split: str | None = None,
    revision: str | None = None,
    token: str | None = None,
) -> Dataset | DatasetDict

Load a mirrored dataset back from the Hub.

When split is given, a single :class:datasets.Dataset is returned; when it is omitted, datasets.load_dataset returns a :class:datasets.DatasetDict keyed by split name for a multi-split repository (and a :class:datasets.Dataset for a single-split repository). Callers that need a concrete Dataset should pass split or index the returned dict by split name.

PARAMETER	DESCRIPTION
`repo_id`	The Hub dataset repository identifier. TYPE: `str`
`split`	A single split to load (for example `"train"`). When set, a `datasets.Dataset` is returned; when omitted, the whole `DatasetDict` is returned for a multi-split repository. TYPE: `str or None` DEFAULT: `None`
`revision`	A Hub revision (branch, tag, or commit) to read. TYPE: `str or None` DEFAULT: `None`
`token`	A HuggingFace access token for a private or gated repository. When omitted, the ambient login state is used. TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dataset or DatasetDict`	The loaded dataset when `split` is given (or the repository is single-split), otherwise the split-keyed mapping.

RAISES	DESCRIPTION
`ImportError`	When the optional `datasets` dependency is not installed.

PyTorch¶

lairs.integrations.torch ¶

PyTorch data-plane exporter.

Emits PyTorch datasets from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. The exporter produces a map-style :class:torch.utils.data.Dataset, an :class:torch.utils.data.IterableDataset variant, and a collate helper, bundled in a :class:TorchExportResult.

PyTorch is an optional dependency (the lairs[torch] extra). It is imported lazily inside the methods that need it, with a clear error when it is missing, so importing this module never pulls torch in. The column-selection and batching logic is pure Python over the Arrow table, so it is exercisable without torch installed.

The Arrow view already carries typed anchor columns, so per-row union dispatch is unnecessary: numeric and anchor columns become tensors directly, and the remaining columns are passed through as Python values. The flat view carries no blob payloads, so media bytes are not materialised here; the exporter records the requested media-resolution intent on its result for a downstream loader transform (which owns the :mod:lairs.media anchor-aware resolver and the blob transport) to act on.

TorchExportSpec ¶

Bases: Model

The export specification for the PyTorch exporter.

Selects which Arrow columns become tensor features, which are passed through as plain Python values, and whether media references are resolved as records flow. When columns is unset every column is kept; when tensor_columns is unset the numeric (and anchor) columns are inferred from the Arrow schema.

ATTRIBUTE	DESCRIPTION
`columns`	The ordered subset of Arrow columns to keep. `None` keeps every column in the view's schema order. TYPE: `tuple of str or None, optional`
`tensor_columns`	The columns to convert to tensors. `None` selects the numeric and anchor columns automatically from the Arrow schema. TYPE: `tuple of str or None, optional`
`resolve_media`	Whether to resolve a per-row media reference through the :mod:`lairs.media` anchor resolver as rows are produced. TYPE: `(bool, optional)`

TorchExportResult ¶

Bases: Model

The bundle a PyTorch export produces.

Carries the map-style dataset, the iterable-dataset variant, and the tensor columns the collate helper stacks. The two datasets are behavioural objects held in opaque fields; the tensor columns are typed metadata so a caller can build a DataLoader with the matching collate function.

ATTRIBUTE	DESCRIPTION
`dataset`	The map-style dataset over the Arrow rows. TYPE: `Dataset`
`iterable`	The iterable-dataset variant over the Arrow rows. TYPE: `IterableDataset`
`tensor_columns`	The columns the collate helper stacks into tensors. TYPE: `tuple of str, optional`
`resolve_media`	The recorded media-resolution intent from the export spec, for a downstream loader transform to act on. TYPE: `(bool, optional)`

collate ¶

collate(
    batch: list[dict[str, JsonValue]],
) -> dict[str, JsonValue | Tensor]

Collate a batch of rows, stacking the tensor columns.

PARAMETER	DESCRIPTION
`batch`	The per-row mappings to collate. TYPE: `list of dict`

RETURNS	DESCRIPTION
`dict`	The column-major batch with tensor columns stacked.

TorchExporter ¶

An exporter that emits PyTorch datasets from an Arrow view.

Binds to :class:~lairs.integrations.ports.Exporter with the Arrow table as its view, :class:TorchExportSpec as its specification, and :class:TorchExportResult as its return type. PyTorch is imported lazily, so constructing the exporter and inspecting an Arrow view never require the optional extra.

export ¶

export(
    view: Table, *, spec: TorchExportSpec | None = None
) -> TorchExportResult

Export an Arrow view to a bundle of PyTorch datasets.

Builds a map-style dataset and an iterable-dataset variant over the kept columns of the view, recording the tensor columns the bundled collate helper stacks. The datasets read rows lazily from the Arrow table, so no per-row tensor is built until a row is fetched.

PARAMETER	DESCRIPTION
`view`	The flattened Arrow view to export. TYPE: `Table`
`spec`	An optional export specification (column selection, tensor columns, media resolution). `None` keeps every column and infers the tensor columns from the schema. TYPE: `TorchExportSpec or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`TorchExportResult`	The map-style dataset, the iterable variant, and the tensor columns.

RAISES	DESCRIPTION
`ImportError`	When the optional `torch` dependency is not installed.
`KeyError`	When the spec names a column absent from the view.

collate_records ¶

collate_records(
    batch: list[dict[str, JsonValue]],
    tensor_columns: tuple[str, ...],
) -> dict[str, JsonValue | Tensor]

Collate a batch of row mappings into a column-major batch mapping.

Tensor columns are stacked into a single torch tensor; the remaining columns are collected into a list, one entry per row. The function is pure apart from the lazy torch import that the tensor columns require, so a batch with no tensor columns collates without torch installed.

PARAMETER	DESCRIPTION
`batch`	The per-row mappings to collate. TYPE: `list of dict`
`tensor_columns`	The columns to stack into a tensor. TYPE: `tuple of str`

RETURNS	DESCRIPTION
`dict`	A column-major mapping: each tensor column maps to a stacked tensor, and every other column maps to a list of its per-row values.

TensorFlow tf.data¶

lairs.integrations.tfdata ¶

TensorFlow tf.data data-plane exporter.

Emits a tf.data.Dataset from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. Requires the lairs[tf] extra at runtime; tensorflow is imported lazily so that importing this module, deriving a feature spec from an Arrow schema, and running the unit tests never require tensorflow to be installed.

The Arrow-schema to feature-spec derivation is pure and tensorflow-free: each Arrow column maps to a :class:TfFeatureSpec carrying a stable dtype token (and a flag for list-valued columns). Converting those tokens to concrete tf.dtypes.DType values, and building the tf.data.Dataset itself, are the only steps that touch tensorflow, and they do so behind a lazy import.

TfFeatureSpec ¶

Bases: Model

A single Arrow column described as a tensorflow feature.

ATTRIBUTE	DESCRIPTION
`name`	The column name. TYPE: `str`
`dtype`	The tensorflow dtype token (for example `"int64"` or `"string"`). TYPE: `str`
`is_sequence`	Whether the column is list-valued (a ragged or variable-length feature), in which case `dtype` describes the list's element type. TYPE: `(bool, optional)`

TfDataSpec ¶

Bases: Model

Options that shape the emitted tf.data.Dataset.

ATTRIBUTE	DESCRIPTION
`columns`	The columns to keep, in order. An empty tuple keeps every column. TYPE: `tuple of str, optional`
`batch_size`	The batch size. When `None` the dataset is not batched. TYPE: `(int or None, optional)`
`shuffle_buffer`	The shuffle buffer size. When `None` the dataset is not shuffled. TYPE: `(int or None, optional)`
`seed`	The shuffle seed, used only when `shuffle_buffer` is set. TYPE: `(int or None, optional)`
`drop_remainder`	Whether a trailing partial batch is dropped when batching. TYPE: `(bool, optional)`

TfDataExporter ¶

An exporter that emits a tf.data.Dataset from an Arrow view.

The exporter binds to the generic :class:~lairs.integrations.ports.Exporter port with the Arrow Table as its view and :class:TfDataSpec as its specification. tensorflow is imported lazily inside :meth:export, so importing the module and deriving feature specs never require the lairs[tf] extra.

export ¶

export(
    view: Table, *, spec: TfDataSpec | None = None
) -> Dataset

Export an Arrow view to a tf.data.Dataset.

Each retained column becomes one tensor in a dictionary-structured dataset, keyed by column name. The optional spec selects and orders columns and toggles shuffling and batching.

PARAMETER	DESCRIPTION
`view`	The flattened Arrow view to export. TYPE: `Table`
`spec`	An optional export specification. When `None` every column is kept and the dataset is neither shuffled nor batched. TYPE: `TfDataSpec or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dataset`	A dictionary-structured dataset, one entry per retained column.

RAISES	DESCRIPTION
`ImportError`	When tensorflow is not installed.

token_of ¶

token_of(arrow_type: DataType) -> tuple[str, bool]

Map an Arrow data type to a tensorflow dtype token and a sequence flag.

The mapping is pure and tensorflow-free. List and large-list types are reported as sequences over their element token; every other type collapses to a scalar token. Unrecognised types fall back to the string token.

PARAMETER	DESCRIPTION
`arrow_type`	The Arrow column type to map. TYPE: `DataType`

RETURNS	DESCRIPTION
`tuple of (str, bool)`	A `(token, is_sequence)` pair. `token` is a tensorflow dtype name and `is_sequence` reports whether the column is list-valued.

feature_specs_of ¶

feature_specs_of(
    schema: Schema, *, columns: tuple[str, ...] = ()
) -> tuple[TfFeatureSpec, ...]

Derive the tensorflow feature specs for an Arrow schema.

The derivation is pure and tensorflow-free. Each retained column becomes one :class:TfFeatureSpec carrying its dtype token and whether it is list-valued.

PARAMETER	DESCRIPTION
`schema`	The Arrow schema to describe. TYPE: `Schema`
`columns`	The columns to keep, in order. An empty tuple keeps every column in schema order. Names absent from the schema are skipped. TYPE: `tuple of str` DEFAULT: `()`

RETURNS	DESCRIPTION
`tuple of TfFeatureSpec`	One feature spec per retained column.

WebDataset¶

lairs.integrations.webdataset ¶

WebDataset data-plane exporter.

Emits tar shards for heavy media from an Arrow view, binding to the :class:~lairs.integrations.ports.Exporter port. Each sample is a keyed group of files: a __key__, a .json member holding the row's scalar fields, and, when a media column is present, a media member carrying the resolved bytes.

The tar-writing path uses the standard-library :mod:tarfile, so basic sharding is exercised without the optional webdataset library. The webdataset library (the lairs[webdataset] extra) is imported lazily inside the read-back loader only, with a clear error when it is missing, so importing this module never pulls the dependency in.

WebDatasetSpec ¶

Bases: Model

An export specification for the WebDataset exporter.

ATTRIBUTE	DESCRIPTION
`output_dir`	The directory the tar shards are written into. Created if absent. TYPE: `(str, optional)`
`shard_size`	The maximum number of samples per shard. The final shard may be smaller. TYPE: `(int, optional)`
`shard_prefix`	The filename stem each shard is named after (`<prefix>-000000.tar`). TYPE: `(str, optional)`
`key_column`	The Arrow column whose value names each sample (its `__key__`). When `None` the row index is used, zero-padded to a stable width. TYPE: `(str or None, optional)`
`media_column`	The Arrow column carrying media bytes or a resolvable media record. When present, each sample gains a media member alongside its json metadata. TYPE: `(str or None, optional)`

WebDatasetExporter ¶

An exporter that emits WebDataset tar shards from an Arrow view.

The exporter binds to the generic :class:~lairs.integrations.ports.Exporter port with a :class:pyarrow.Table view, a :class:WebDatasetSpec specification, and a list of written shard paths as its return type.

export ¶

export(
    view: Table, *, spec: WebDatasetSpec | None = None
) -> list[Path]

Export an Arrow view to WebDataset tar shards.

Each row becomes one sample. A sample carries a .json member with the row's scalar (non-media) fields and, when spec.media_column is set, a media member with the resolved bytes. Samples are grouped into shards of at most spec.shard_size rows, each written as a tar archive.

PARAMETER	DESCRIPTION
`view`	The flattened Arrow view to export. TYPE: `Table`
`spec`	The export specification. A default spec is used when omitted. TYPE: `WebDatasetSpec or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list of pathlib.Path`	The written tar shard files, in shard order.

RAISES	DESCRIPTION
`ValueError`	When `spec.shard_size` is not positive, or a named column is absent from the view.

load ¶

load(shards: list[Path]) -> Iterator[dict[str, JsonValue]]

Read shards back through the webdataset loader.

This is the read-back path used by training loops; it requires the optional webdataset library and is imported lazily so importing this module never pulls the dependency in.

PARAMETER	DESCRIPTION
`shards`	The shard files to read, in order. TYPE: `list of pathlib.Path`

RETURNS	DESCRIPTION
`collections.abc.Iterator of dict`	The decoded samples, one mapping per sample.

RAISES	DESCRIPTION
`ImportError`	When the optional `webdataset` library is not installed.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Exporters¶

HuggingFace datasets¶

lairs.integrations.hf.datasets ¶

Shape ¶

TaskTemplate ¶

ExportSpec ¶

for_template classmethod ¶

HuggingFaceExporter ¶

export ¶

to_hf_iterable ¶

task_template_for ¶

hf_features_from ¶

HuggingFace Hub¶

lairs.integrations.hf.hub ¶

ProvenanceBundle ¶

provenance_bundle ¶

dataset_card ¶

push_to_hub ¶

load_from_hub ¶

PyTorch¶

lairs.integrations.torch ¶

TorchExportSpec ¶

TorchExportResult ¶

collate ¶

TorchExporter ¶

export ¶

collate_records ¶

TensorFlow tf.data¶

lairs.integrations.tfdata ¶

TfFeatureSpec ¶

TfDataSpec ¶

TfDataExporter ¶

export ¶

token_of ¶

feature_specs_of ¶

WebDataset¶

lairs.integrations.webdataset ¶

WebDatasetSpec ¶

WebDatasetExporter ¶

export ¶

load ¶

for_template `classmethod` ¶