Format codecs

The shared corpus-fragment models and the CoNLL-U and brat codecs, which bind to the Codec port. A codec decodes an external format into a CorpusFragment of records and encodes records back. For usage see Guides > Format codecs.

Fragment models

lairs.integrations.codecs.CorpusFragment

Bases: Model

A batch of decoded records produced by a codec.

A fragment is the pivot a codec decodes into and encodes from.

ATTRIBUTE DESCRIPTION
records

The decoded records.

TYPE: tuple of FragmentRecord, optional

source

The originating format name, when known.

TYPE: (str or None, optional)

lairs.integrations.codecs.FragmentRecord

Bases: Model

A single decoded record inside a corpus fragment.

The record value is carried as a JSON string so a fragment is independent of any one generated namespace module and round-trips losslessly.

ATTRIBUTE DESCRIPTION
local_id

The local identifier or AT-URI of the record within the fragment.

TYPE: str

nsid

The record collection NSID (for example pub.layers.expression).

TYPE: str

value_json

The record value serialised as a JSON string.

TYPE: str

CoNLL-U

lairs.integrations.codecs.conllu

CoNLL-U format codec.

Converts between CoNLL-U (Universal Dependencies) and lairs records, binding to the :class:~lairs.integrations.ports.Codec port. The optional conllu library (the lairs[conllu] extra) is imported lazily inside the codec methods, with a clear error when it is missing, so importing this module never pulls the dependency in.

:class:ConlluCodec parses with the conllu library (conllu.parse) and serialises with its TokenList.serialize, so it inherits that library's multi-sentence, comment, multiword-token, and sentence-metadata handling. The :class:ConlluIso works over already-parsed structures with a self-contained standard-library parser, so it never requires the optional dependency.

The CoNLL-U surface maps onto Layers as follows:

  • the FORM column of each token-line becomes a :class:~lairs.records.segmentation.Token, and a sentence's tokens become a :class:~lairs.records.segmentation.Tokenization inside one :class:~lairs.records.segmentation.Segmentation record. Each token's textSpan carries the true UTF-8 byte offsets of its form in the sentence text (the tokens are joined with single spaces).
  • UPOS and XPOS become token-tag :class:~lairs.records.annotation.AnnotationLayer records anchored by token index; LEMMA becomes a token-tag lemma layer; FEATS are carried as per-annotation features.
  • HEAD/DEPREL become a relation (dependency) annotation layer, with each arc carrying headIndex (-1 at the root) and targetIndex.

Only the basic HEAD/DEPREL dependency tree is modelled. The CoNLL-U DEPS column (enhanced Universal Dependencies, which may attach several governors to one token) is not decoded: the relation layer models exactly one head per token. Multiword-token range rows (1-2) and empty-node rows (1.1) are skipped; their surface text is recovered from the per-token forms.

ConlluCodec

A bidirectional CoNLL-U codec.

Decodes CoNLL-U text into a :class:~lairs.integrations.codecs.CorpusFragment carrying an expression record, a segmentation record, token-tag layers for UPOS/XPOS/lemma, and a relation layer for the dependency parse. Encoding reverses the transform.

decode

decode(
    src: str | bytes, *, into: CorpusFragment | None = None
) -> CorpusFragment

Decode CoNLL-U text into a corpus fragment.

Every sentence in the source is decoded. Each sentence contributes its own expression, segmentation, and annotation-layer records, with the local ids and uuids suffixed by the sentence's 0-based index (the first sentence keeps the unsuffixed ids for symmetry with the single-sentence :class:ConlluIso).

PARAMETER DESCRIPTION
src

The CoNLL-U source, which may hold any number of sentences.

TYPE: str or bytes

into

An existing fragment to extend with the decoded records.

TYPE: CorpusFragment or None DEFAULT: None

RETURNS DESCRIPTION
CorpusFragment

The decoded fragment.

RAISES DESCRIPTION
ModuleNotFoundError

When the optional conllu library is not installed.

encode

encode(records: Iterable[FragmentRecord]) -> str

Encode fragment records into CoNLL-U text.

Records are grouped back into their sentences (by the index suffix the decode side assigned), and each sentence is serialised with the conllu library's TokenList.serialize so the output is conformant CoNLL-U.

PARAMETER DESCRIPTION
records

The records to encode.

TYPE: collections.abc.Iterable of FragmentRecord

RETURNS DESCRIPTION
str

The CoNLL-U representation.

RAISES DESCRIPTION
ModuleNotFoundError

When the optional conllu library is not installed.

ConlluIso

Bases: Iso[_ConlluSentence, CorpusFragment]

An :class:~didactic.api.Iso between a CoNLL-U sentence and a fragment.

The forward direction builds a corpus fragment from a parsed sentence; the backward direction recovers the sentence. Round-trip law fixtures verify that backward(forward(x)) == x on the supported subset (one tokenisation with UPOS/XPOS/lemma tags, morphological features, and a projective dependency tree). This Iso operates over parsed structures, so it does not require the optional conllu library.

forward

forward(a: _ConlluSentence) -> CorpusFragment

Build a corpus fragment from a parsed sentence.

PARAMETER DESCRIPTION
a

The parsed CoNLL-U sentence.

TYPE: _ConlluSentence

RETURNS DESCRIPTION
CorpusFragment

The fragment of expression, segmentation, and layer records.

backward

backward(b: CorpusFragment) -> _ConlluSentence

Recover a parsed sentence from a corpus fragment.

PARAMETER DESCRIPTION
b

The fragment to recover the sentence from.

TYPE: CorpusFragment

RETURNS DESCRIPTION
_ConlluSentence

The parsed CoNLL-U sentence.

brat

lairs.integrations.codecs.brat

brat standoff format codec.

Converts between brat standoff annotation files and lairs records, binding to the :class:~lairs.integrations.ports.Codec port. brat is a plain-text format, so this codec parses it directly with no third-party dependency even when the lairs[brat] extra is declared.

The brat standoff format pairs a .txt document with a .ann annotation file. The .ann lines this codec understands are:

  • Tn<TAB>TYPE START END<TAB>TEXT - a text-bound entity, where START and END are UTF-8 byte offsets into the document text (the pivot for a span annotation kind anchored by a :class:~lairs.records.defs.Span). A discontinuous entity (TYPE 0 5;8 12) is collapsed to the single enclosing span (see :func:_parse_entity).
  • Rn<TAB>TYPE Arg1:Tx Arg2:Ty - a binary relation between two entities (the pivot for a relation annotation kind).
  • An<TAB>TYPE Tx[ VALUE] - an attribute on an entity (a binary flag when no value is given), carried as annotation features.

Only the T, R, and A line kinds are decoded. Event (E), normalisation (N), equivalence (*), and note (#) lines, and any structurally malformed T/R/A line, are skipped: the decode is lossy for those, and the dropped lines are not reported. A consumer needing those kinds must extend the parser.

A brat binary-flag attribute (no value) is encoded as an empty feature value and recovered as a binary flag, so a valued attribute whose value is literally "true" is preserved verbatim rather than collapsing to a flag (an empty string is not a legal brat attribute value, so it is a safe flag sentinel; see :func:_attribute_pairs).

The .txt and .ann are combined into a single source string separated by a sentinel line so the codec round-trips both halves through one :meth:decode/:meth:encode pair. :func:split_standoff recovers the conventional .txt and .ann halves from an encoded source.

BratCodec

A bidirectional brat standoff codec.

Decodes a combined .txt/.ann source into a :class:~lairs.integrations.codecs.CorpusFragment holding one expression record, a span :class:~lairs.records.annotation.AnnotationLayer for the text-bound entities, and (when relations are present) a relation layer. Encoding reverses the transform.

decode

decode(
    src: str | bytes, *, into: CorpusFragment | None = None
) -> CorpusFragment

Decode brat standoff into a corpus fragment.

PARAMETER DESCRIPTION
src

The combined .txt and .ann source, with the two halves separated by the brat sentinel line.

TYPE: str or bytes

into

An existing fragment to extend with the decoded records.

TYPE: CorpusFragment or None DEFAULT: None

RETURNS DESCRIPTION
CorpusFragment

The decoded fragment.

encode

encode(records: Iterable[FragmentRecord]) -> str

Encode fragment records into brat standoff text.

PARAMETER DESCRIPTION
records

The records to encode. The expression record supplies the .txt half and the span/relation layers supply the .ann half.

TYPE: collections.abc.Iterable of FragmentRecord

RETURNS DESCRIPTION
str

The combined .txt and .ann representation.

BratIso

Bases: Iso[_Standoff, CorpusFragment]

An :class:~didactic.api.Iso between a brat standoff and a fragment.

The forward direction builds a corpus fragment from a parsed standoff; the backward direction recovers the standoff. Round-trip law fixtures verify that backward(forward(x)) == x on the supported subset (text-bound entities, binary relations, and attributes).

forward

forward(a: _Standoff) -> CorpusFragment

Build a corpus fragment from a parsed standoff.

PARAMETER DESCRIPTION
a

The parsed brat contents.

TYPE: _Standoff

RETURNS DESCRIPTION
CorpusFragment

The fragment of expression and annotation-layer records.

backward

backward(b: CorpusFragment) -> _Standoff

Recover a parsed standoff from a corpus fragment.

PARAMETER DESCRIPTION
b

The fragment to recover the standoff from.

TYPE: CorpusFragment

RETURNS DESCRIPTION
_Standoff

The parsed brat contents.

split_standoff

split_standoff(source: str | bytes) -> tuple[str, str]

Split an encoded brat source into its conventional .txt and .ann.

:meth:BratCodec.encode joins the document text and the standoff annotation body with a sentinel line so both halves round-trip through one string. This recovers the two halves a real brat corpus stores in separate files: write the first element to *.txt and the second to *.ann.

PARAMETER DESCRIPTION
source

An encoded brat source (the sentinel-joined string :meth:BratCodec.encode returns). A source without the sentinel is treated as a bare .ann body over empty text.

TYPE: str or bytes

RETURNS DESCRIPTION
tuple of (str, str)

The .txt document text and the .ann annotation body.

canonical_standoff

canonical_standoff(standoff: _Standoff) -> _Standoff

Return a standoff in the codec's canonical, round-trippable form.

The brat codec preserves a standoff's text, entity geometry, and relations exactly, but normalises identifiers and groups attributes under their target entity. Entity tags become T1..Tn in declaration order, relation tags become R1..Rn, and attribute tags become A1..An ordered by the entity they decorate. Round-trip law fixtures sample from this canonical subset, on which BratIso.backward(BratIso.forward(x)) == x holds.

This operates on the codec's internal parsed-standoff model, which a caller obtains from :meth:BratIso.backward (or builds directly for tests); it is not part of the decode/encode fragment surface, so it is intentionally not re-exported from the package.

PARAMETER DESCRIPTION
standoff

Any parsed standoff (for example the result of :meth:BratIso.backward).

TYPE: _Standoff

RETURNS DESCRIPTION
_Standoff

The canonicalised standoff.