Codecs: external annotation formats¶
This guide covers decoding external annotation formats into Layers records and
encoding records back, through the Codec port. It documents the bundled
CoNLL-U and brat codecs, the CorpusFragment they pivot through, the round-trip
Iso fixtures, and which extras each codec needs.
A codec is a bidirectional converter between an external format and lairs
records, bound to the Codec protocol in lairs.integrations.ports. It
translates an external format's spans and labels into lairs anchors and one of
the annotation kinds, and lairs owns the rest. Resolve a codec by name through the
registry:
import lairs
ConlluCodec = lairs.codec("conllu")
BratCodec = lairs.codec("brat")
codec = ConlluCodec()
An unknown name raises UnknownAdapterError, listing the available codecs.
The CorpusFragment pivot¶
A codec decodes into and encodes from a CorpusFragment: a didactic model
holding a batch of FragmentRecord entries. Each FragmentRecord carries its
local_id, its collection nsid, and the record value as a JSON string, so a
fragment is independent of any one generated namespace module and round-trips
losslessly.
from lairs.integrations.codecs import CorpusFragment, FragmentRecord
fragment = codec.decode(conllu_text)
fragment.source # "conllu"
for record in fragment.records:
print(record.local_id, record.nsid) # "expression" "pub.layers.expression", ...
decode accepts str or bytes and an optional into fragment to extend with
the decoded records. encode takes an iterable of FragmentRecord and returns
the external representation.
CoNLL-U¶
ConlluCodec maps Universal Dependencies CoNLL-U onto Layers:
- the
FORMcolumn becomes aToken, and a sentence's tokens become aTokenizationinside oneSegmentationrecord. UPOS,XPOS, andLEMMAbecome token-tagAnnotationLayerrecords anchored by token index, withFEATScarried as per-annotation features.HEAD/DEPRELbecome a relation (dependency) layer, each arc carrying aheadIndex(-1at the root) and atargetIndex.
codec = lairs.codec("conllu")()
fragment = codec.decode(conllu_text) # expression, segmentation, tag and dep layers
restored = codec.encode(fragment.records)
ConlluCodec.decode and .encode require the optional conllu library, from
the lairs[conllu] extra. It is imported lazily inside the methods, so importing
the module never pulls the dependency in. Calling decode/encode without it
raises ModuleNotFoundError with an install hint.
The parse used by the structural path is standard-library only. The dependency-
free path is ConlluIso, an Iso between a parsed sentence and a fragment that
operates over parsed structures and so needs no extra:
from lairs.integrations.codecs.conllu import ConlluIso
iso = ConlluIso()
fragment = iso.forward(parsed_sentence)
sentence = iso.backward(fragment) # backward(forward(x)) == x on the subset
brat¶
BratCodec parses brat standoff directly, with no third-party dependency even
when the lairs[brat] extra is declared. The .ann lines it understands:
Ttext-bound entities (START/ENDare UTF-8 byte offsets), which become a span layer anchored by aSpan.Rbinary relations, which become a relation layer.Aattributes on an entity, carried as annotation features.
The .txt and .ann halves are combined into one source string separated by a
sentinel line, so one decode/encode pair round-trips both halves.
codec = lairs.codec("brat")()
source = txt + "\n===ANN===\n" + ann
fragment = codec.decode(source) # expression, span layer, relation layer
combined = codec.encode(fragment.records)
Because brat is plain text, the whole codec works without any extra installed.
Round-trip laws¶
Each codec ships an Iso whose round-trip law fixtures verify
backward(forward(x)) == x on the supported subset. The CoNLL-U subset is one
tokenization with UPOS/XPOS/lemma tags, morphological features, and a
projective dependency tree. The brat subset is text-bound entities, binary
relations, and attributes. canonical_standoff returns the canonical,
round-trippable form (entity tags T1..Tn, relation tags R1..Rn, attribute
tags A1..An, attributes grouped under their target) that the fixtures sample
from.
Extras at a glance¶
| Codec | Extra | Works without the extra |
|---|---|---|
conllu |
lairs[conllu] (the conllu library) |
ConlluIso only. ConlluCodec.decode/.encode raise ModuleNotFoundError |
brat |
none required | the full codec |
See also¶
- Dataset API for working with the records a codec produces.
- Authoring for building records and layers directly.
- Knowledge bases for grounding decoded labels.