Integrations¶
lairs has to plug into the tools people already use (HuggingFace, PyTorch, the linguistic-annotation formats, the knowledge bases) but none of those belong in the core. The governing rule is that integrations are never in core. Each is an optional extra, discovered at runtime, and written against a small set of stable surfaces rather than against lairs internals. This page explains the ports-and-adapters design that makes that possible, the surfaces an adapter binds to, the adapter families and how they are discovered, and why the separation is worth the indirection.
Four data surfaces an adapter binds to¶
An adapter never reaches into lairs internals. It binds to one (or more) of four canonical data surfaces, each of which already exists and is stable:
- Records: the generated
dx.Modelinstances, the typed object layer. - Arrow views: the flattened table form with typed anchor columns. This is the data-plane lingua franca, and most ML targets go through Arrow rather than bespoke per-framework code.
- The anchor resolver:
resolve_anchor(anchor, target), the single entry for the text, token, audio, video, or signal slice an annotation points at (see anchors and modality). - Repository revisions: the version-control commits and tags that carry provenance and pin a reproducible dataset version (see reproducibility).
Binding only to these four is the abstraction that keeps per-integration code thin. Layers already normalizes everything to (expression text or media) plus (anchor) plus (annotation kind), so an adapter rarely needs schema logic of its own. A HuggingFace exporter, for instance, has almost no schema code because the Arrow flattening has already resolved the polymorphic anchor into typed columns, so the exporter consumes columns, not a union.
The three adapter families¶
Adapters come in three shapes, each a small Protocol declared in
lairs.integrations.ports, each generic over its payload and return
types so that no method ever returns a widened type:
- Codecs translate bidirectionally between an external annotation
format and lairs records. A codec
decodes an external source into a corpus fragment andencodes records back out. The pivot is the anchor model: a codec only has to translate its spans and labels into lairs anchors and one of the seven annotation kinds. lairs owns the rest. The registered codecs are CoNLL-U and brat standoff. - Exporters consume the Arrow views (and the anchor resolver) and
emit a framework-native dataset. An exporter
exports a view, with an optional spec, into a target object such as adatasets.Datasetor atorch.utils.data.Dataset. Examples: HuggingFacedatasets, PyTorch,tf.data, WebDataset. - Knowledge bases resolve, entity-link, and expand against external
knowledge graphs and lexical resources. A connector
resolves an identifier to an entity,searches surface text for candidate entities, and optionally returns an entity'sneighbors. Examples: Wikidata, a generic reconciliation endpoint, theglazinglexical-semantic resources.
A fourth port, StorageBackend, abstracts byte storage (read, write,
exists) so the blob cache and the Parquet views can sit on local or
remote storage. It is a supporting surface rather than a fourth adapter
family.
Experiment tracking is a further integration capability that sits outside
the three families. lairs.integrations.tracking.log_revision binds the
Repository-revisions surface to Weights & Biases or MLflow (the
lairs[tracking] extra): it records a ProvenanceBundle pinning the exact
commit or tag and the vendored lexicon manifest hash, not a copy of the
data, so the dataset behind a logged run can always be rebuilt from its
revision. Like the adapters, the backend libraries are imported lazily, so
importing the module never pulls in wandb or mlflow.
The three families correspond to the three places external tools meet Layers data: at the format boundary (codecs), at the data plane (exporters), and at the grounding boundary (knowledge bases). The data surfaces of the previous section and the adapter families of this one are two different axes: a surface is what an adapter touches, a family is what kind of adapter it is. A codec touches the records surface, an exporter touches the Arrow and anchor surfaces, and a knowledge base touches the records surface.
Entry-point discovery¶
Adapters are not imported by lairs. They are discovered at runtime through
Python entry points, in the groups lairs.codecs, lairs.exporters, and
lairs.knowledge_bases. A registry resolves a name to an adapter class:
it consults in-process registrations first, then (once, lazily) the
entry points, and an unknown name raises with the list of installed
adapters so the failure is legible. In-process registrations take
precedence over entry points.
This is what lets a third party ship an adapter as its own distribution. A broadly useful codec can graduate to its own PyPI package and register under the same entry-point group, and lairs needs no change to find it. The registry is generic over the adapter type it holds, so a lookup returns a precisely typed adapter class rather than a widened one.
Why integrations stay out of core¶
The separation has a concrete payoff: importing lairs never imports an
integration's heavy dependency. A reader who wants records off a PDS does
not pay for torch or datasets or a SPARQL client. Each integration is
an optional extra, and its dependency is loaded only when its adapter is
actually used.
The deeper reason is resilience to churn. An adapter that binds to the four stable surfaces does not break when lairs refactors its internals, because it never touched them. The ports are the contract, and everything behind them is free to change. This is the same ports-and-adapters discipline used elsewhere in the stack for emitter and lens frameworks, applied here to integrations, and it is what allows the integration catalog to be broad without making the core large or fragile.
Because codecs and exporters are uniform, registered, and bound only to
the four surfaces, pipelines compose: decode an external corpus with a
codec, transform it, export it with an exporter, mirror it to a hub with
its provenance intact. The mirror step is the HuggingFace Hub push/pull
surface (push_to_hub, load_from_hub, and the dataset_card and
provenance_bundle helpers, re-exported from lairs.integrations.hf),
which writes a corpus to the Hub as Arrow/Parquet shards behind a dataset
card carrying the corpus AT-URI, the Repository revision, and the vendored
lexicon manifest hash, and reads a mirror back. The PDS and the Repository
stay canonical; the Hub is an export and mirror target. Codecs carry
round-trip law fixtures (decode(encode(x)) recovers x on the supported
subset) and exporters carry schema-parity fixtures, so the composition is
checked rather than assumed.
For the stability contract on the ports and the extras, see stability. For the adapter that proves the data plane end to end, see the HuggingFace path in the guides.