Exporters: Arrow views to framework datasets

This guide covers the Exporter port and the bundled exporters that turn a flattened Arrow view into a framework-native dataset: HuggingFace datasets, the Hub push/pull, PyTorch, tf.data, and WebDataset. It notes which extra each needs.

An exporter consumes an Arrow view (a pyarrow.Table, produced by Dataset.to_arrow) and emits a target-framework object. (Corpus.materialize instead writes Parquet view files to disk and returns their paths, so it is not a source for export.) The Exporter protocol in lairs.integrations.ports is generic over the view, the export specification, and the returned object. The one method is export(view, *, spec=None). Because the Arrow flattening already resolves the polymorphic Layers anchors into typed columns, the exporters stay thin: they project columns and hand the table over. Resolve an exporter by name:

import lairs

HuggingFaceExporter = lairs.exporter("hf")
TorchExporter = lairs.exporter("torch")
TfDataExporter = lairs.exporter("tfdata")
WebDatasetExporter = lairs.exporter("webdataset")

An unknown name raises UnknownAdapterError, listing the available exporters. Every framework dependency is imported lazily, so importing an exporter module never pulls its extra in. The optional import surfaces only when an export runs.

HuggingFace datasets

HuggingFaceExporter wraps the Arrow table near zero-copy into a datasets.Dataset.

from lairs.integrations.hf.datasets import ExportSpec, task_template_for

exporter = lairs.exporter("hf")()
ds = exporter.export(table)                              # all columns

spec = ExportSpec(shape="exploded", columns=("tokens", "labels", "token_index"))
ds = exporter.export(table, spec=spec)                  # projected

The shape is descriptive metadata, not a re-shaping step: nested keeps one row per expression with annotations as sequence-valued columns, exploded keeps one row per annotation. The Arrow builders produce one shape or the other.

Task templates. task_template_for(kind, subkind=, formalism=) returns the most specific canonical HuggingFace task shape for a Layers triple (for example token-classification for a token-tag layer, dependency-parsing for a tree/universal-dependencies layer). ExportSpec.for_template(template) projects that template's columns and records its task name for downstream tooling and dataset-card provenance.

template = task_template_for("token-tag", subkind="pos")
ds = exporter.export(table, spec=ExportSpec.for_template(template))

Streaming. to_hf_iterable(source, spec=) builds a datasets.IterableDataset from a zero-argument factory returning fresh Arrow RecordBatch iterators, so a large corpus trains without a full download.

Feature schema. hf_features_from(features) derives a datasets.Features mapping from a lairs Features schema, mapping each spec to a Value (or a Sequence of a Value for sequence tokens). Struct tokens degrade to a JSON string column.

datasets comes from the lairs[hf] extra. Every method here raises a clear ImportError when it is absent.

HuggingFace Hub

lairs.integrations.hf.hub mirrors a corpus to the Hub as Arrow/Parquet shards with an auto-generated dataset card carrying provenance, and reads a mirror back. The Hub is a mirror target only. The PDS and the Repository stay canonical.

from lairs.integrations.hf.hub import provenance_bundle, push_to_hub, load_from_hub

bundle = provenance_bundle(
    corpus_uri="at://did:plc:author/pub.layers.corpus.corpus/abc",
    revision="v1",
    license="CC-BY-4.0",
    name="my-corpus",
)
url = push_to_hub(table, "org/my-corpus", private=False, provenance=bundle)
ds = load_from_hub("org/my-corpus", revision="main")

The ProvenanceBundle records the corpus AT-URI, the Repository revision or tag, the vendored lexicon manifest hash and Layers version (filled from the manifest packaged with lairs), the license, and the corpus name. The lexicon manifest hash and Layers version are read from the vendored manifest; the remaining fields, including license, are supplied by the caller from the corpus record. The license field is a plain license-identifier string the caller passes through (a slug such as CC-BY-4.0, or an expression such as MIT OR Apache-2.0). dataset_card(bundle) renders the markdown card, where only set fields appear. push_to_hub needs both datasets and huggingface_hub from lairs[hf], and load_from_hub needs datasets. Each raises a clear ImportError when absent. Hub authentication is the caller's responsibility (the usual huggingface_hub login).

PyTorch

TorchExporter.export returns a TorchExportResult bundling a map-style Dataset, an IterableDataset variant, and the tensor columns its collate helper stacks.

from lairs.integrations.torch import TorchExportSpec

exporter = lairs.exporter("torch")()
result = exporter.export(table, spec=TorchExportSpec(
    columns=("token_index", "label", "byte_start", "byte_end"),
    tensor_columns=None,                 # infer numeric/anchor columns from the schema
))
loader = DataLoader(result.dataset, batch_size=32, collate_fn=result.collate)

Numeric and anchor columns become tensors, and the rest pass through as Python values. When tensor_columns is unset, the numeric columns are inferred from the Arrow schema. The flat view carries no blob payloads, so media bytes are not materialized here. spec.resolve_media is recorded on the result for a downstream loader transform (which owns the media resolver) to act on. The column-selection helpers are pure Python, but export always imports torch, because both the map-style and iterable datasets it builds subclass torch.utils.data.Dataset / IterableDataset. torch comes from the lairs[torch] extra, and a missing column named in the spec raises KeyError.

tf.data

TfDataExporter.export emits a dictionary-structured tf.data.Dataset, one tensor per retained column.

from lairs.integrations.tfdata import TfDataSpec, feature_specs_of

exporter = lairs.exporter("tfdata")()
ds = exporter.export(table, spec=TfDataSpec(
    columns=("tokens", "labels"),
    batch_size=64,
    shuffle_buffer=1000,
    seed=0,
    drop_remainder=True,
))

The Arrow-schema to feature-spec derivation is pure and tensorflow-free: feature_specs_of(schema, columns=) maps each column to a TfFeatureSpec with a dtype token and a list-valued flag. Only resolving those tokens to concrete dtypes and building the dataset touch tensorflow, behind a lazy import. tensorflow comes from the lairs[tf] extra (Python < 3.14), and export raises a clear ImportError when it is absent.

WebDataset

WebDatasetExporter.export writes tar shards for heavy media. Each row becomes one sample: a __key__, a .json member with the row's scalar fields, and, when a media column is set, a media member with the resolved bytes.

from lairs.integrations.webdataset import WebDatasetSpec

exporter = lairs.exporter("webdataset")()
shards = exporter.export(table, spec=WebDatasetSpec(
    output_dir="out/shards",
    shard_size=1000,
    shard_prefix="train",
    key_column="uri",
    media_column="media",
))

A media cell that is raw bytes is embedded directly. A JSON-shaped media record is resolved through the media resolver, whose mime type drives the member extension. A non-positive shard_size or a named column absent from the view raises ValueError. The tar-writing path uses the standard library, so sharding runs without any extra. The read-back loader, load(shards), requires the optional webdataset library (the lairs[webdataset] extra) and is imported lazily. Calling it without the library raises a clear ImportError.

Extras at a glance

Exporter Extra Works without the extra
hf lairs[hf] (datasets, huggingface_hub) task-template selection
torch lairs[torch] (export always needs it) the standalone column-selection helpers, and collate without tensor columns
tfdata lairs[tf] (Python < 3.14) feature-spec derivation from the schema
webdataset lairs[webdataset] for read-back tar sharding (write path)

See also