Format codecs¶
The shared corpus-fragment models and the CoNLL-U and brat codecs,
which bind to the Codec port. A codec decodes an
external format into a CorpusFragment of records and encodes records
back. For usage see Guides > Format codecs.
Fragment models¶
lairs.integrations.codecs.CorpusFragment ¶
Bases: Model
A batch of decoded records produced by a codec.
A fragment is the pivot a codec decodes into and encodes from.
| ATTRIBUTE | DESCRIPTION |
|---|---|
records |
The decoded records.
TYPE:
|
source |
The originating format name, when known.
TYPE:
|
lairs.integrations.codecs.FragmentRecord ¶
Bases: Model
A single decoded record inside a corpus fragment.
The record value is carried as a JSON string so a fragment is independent of any one generated namespace module and round-trips losslessly.
| ATTRIBUTE | DESCRIPTION |
|---|---|
local_id |
The local identifier or AT-URI of the record within the fragment.
TYPE:
|
nsid |
The record collection NSID (for example
TYPE:
|
value_json |
The record value serialised as a JSON string.
TYPE:
|
CoNLL-U¶
lairs.integrations.codecs.conllu ¶
CoNLL-U format codec.
Converts between CoNLL-U (Universal Dependencies) and lairs records, binding to
the :class:~lairs.integrations.ports.Codec port. The optional conllu
library (the lairs[conllu] extra) is imported lazily inside the codec
methods, with a clear error when it is missing, so importing this module never
pulls the dependency in.
:class:ConlluCodec parses with the conllu library (conllu.parse) and
serialises with its TokenList.serialize, so it inherits that library's
multi-sentence, comment, multiword-token, and sentence-metadata handling. The
:class:ConlluIso works over already-parsed structures with a self-contained
standard-library parser, so it never requires the optional dependency.
The CoNLL-U surface maps onto Layers as follows:
- the
FORMcolumn of each token-line becomes a :class:~lairs.records.segmentation.Token, and a sentence's tokens become a :class:~lairs.records.segmentation.Tokenizationinside one :class:~lairs.records.segmentation.Segmentationrecord. Each token'stextSpancarries the true UTF-8 byte offsets of its form in the sentence text (the tokens are joined with single spaces). UPOSandXPOSbecome token-tag :class:~lairs.records.annotation.AnnotationLayerrecords anchored by token index;LEMMAbecomes a token-tag lemma layer;FEATSare carried as per-annotation features.HEAD/DEPRELbecome a relation (dependency) annotation layer, with each arc carryingheadIndex(-1at the root) andtargetIndex.
Only the basic HEAD/DEPREL dependency tree is modelled. The CoNLL-U
DEPS column (enhanced Universal Dependencies, which may attach several
governors to one token) is not decoded: the relation layer models exactly one
head per token. Multiword-token range rows (1-2) and empty-node rows
(1.1) are skipped; their surface text is recovered from the per-token forms.
ConlluCodec ¶
A bidirectional CoNLL-U codec.
Decodes CoNLL-U text into a
:class:~lairs.integrations.codecs.CorpusFragment carrying an expression
record, a segmentation record, token-tag layers for UPOS/XPOS/lemma,
and a relation layer for the dependency parse. Encoding reverses the
transform.
decode ¶
decode(
src: str | bytes, *, into: CorpusFragment | None = None
) -> CorpusFragment
Decode CoNLL-U text into a corpus fragment.
Every sentence in the source is decoded. Each sentence contributes its
own expression, segmentation, and annotation-layer records, with the
local ids and uuids suffixed by the sentence's 0-based index (the first
sentence keeps the unsuffixed ids for symmetry with the single-sentence
:class:ConlluIso).
| PARAMETER | DESCRIPTION |
|---|---|
src
|
The CoNLL-U source, which may hold any number of sentences.
TYPE:
|
into
|
An existing fragment to extend with the decoded records.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CorpusFragment
|
The decoded fragment. |
| RAISES | DESCRIPTION |
|---|---|
ModuleNotFoundError
|
When the optional |
encode ¶
encode(records: Iterable[FragmentRecord]) -> str
Encode fragment records into CoNLL-U text.
Records are grouped back into their sentences (by the index suffix the
decode side assigned), and each sentence is serialised with the
conllu library's TokenList.serialize so the output is conformant
CoNLL-U.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
The records to encode.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The CoNLL-U representation. |
| RAISES | DESCRIPTION |
|---|---|
ModuleNotFoundError
|
When the optional |
ConlluIso ¶
Bases: Iso[_ConlluSentence, CorpusFragment]
An :class:~didactic.api.Iso between a CoNLL-U sentence and a fragment.
The forward direction builds a corpus fragment from a parsed sentence; the
backward direction recovers the sentence. Round-trip law fixtures verify
that backward(forward(x)) == x on the supported subset (one tokenisation
with UPOS/XPOS/lemma tags, morphological features, and a projective
dependency tree). This Iso operates over parsed structures, so it does not
require the optional conllu library.
forward ¶
forward(a: _ConlluSentence) -> CorpusFragment
Build a corpus fragment from a parsed sentence.
| PARAMETER | DESCRIPTION |
|---|---|
a
|
The parsed CoNLL-U sentence.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CorpusFragment
|
The fragment of expression, segmentation, and layer records. |
backward ¶
backward(b: CorpusFragment) -> _ConlluSentence
Recover a parsed sentence from a corpus fragment.
| PARAMETER | DESCRIPTION |
|---|---|
b
|
The fragment to recover the sentence from.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
_ConlluSentence
|
The parsed CoNLL-U sentence. |
brat¶
lairs.integrations.codecs.brat ¶
brat standoff format codec.
Converts between brat standoff annotation files and lairs records, binding to
the :class:~lairs.integrations.ports.Codec port. brat is a plain-text format,
so this codec parses it directly with no third-party dependency even when the
lairs[brat] extra is declared.
The brat standoff format pairs a .txt document with a .ann annotation
file. The .ann lines this codec understands are:
Tn<TAB>TYPE START END<TAB>TEXT- a text-bound entity, whereSTARTandENDare UTF-8 byte offsets into the document text (the pivot for a span annotation kind anchored by a :class:~lairs.records.defs.Span). A discontinuous entity (TYPE 0 5;8 12) is collapsed to the single enclosing span (see :func:_parse_entity).Rn<TAB>TYPE Arg1:Tx Arg2:Ty- a binary relation between two entities (the pivot for a relation annotation kind).An<TAB>TYPE Tx[ VALUE]- an attribute on an entity (a binary flag when no value is given), carried as annotation features.
Only the T, R, and A line kinds are decoded. Event (E),
normalisation (N), equivalence (*), and note (#) lines, and any
structurally malformed T/R/A line, are skipped: the decode is lossy
for those, and the dropped lines are not reported. A consumer needing those
kinds must extend the parser.
A brat binary-flag attribute (no value) is encoded as an empty feature value
and recovered as a binary flag, so a valued attribute whose value is literally
"true" is preserved verbatim rather than collapsing to a flag (an empty
string is not a legal brat attribute value, so it is a safe flag sentinel; see
:func:_attribute_pairs).
The .txt and .ann are combined into a single source string separated by
a sentinel line so the codec round-trips both halves through one
:meth:decode/:meth:encode pair. :func:split_standoff recovers the
conventional .txt and .ann halves from an encoded source.
BratCodec ¶
A bidirectional brat standoff codec.
Decodes a combined .txt/.ann source into a
:class:~lairs.integrations.codecs.CorpusFragment holding one expression
record, a span :class:~lairs.records.annotation.AnnotationLayer for the
text-bound entities, and (when relations are present) a relation layer.
Encoding reverses the transform.
decode ¶
decode(
src: str | bytes, *, into: CorpusFragment | None = None
) -> CorpusFragment
Decode brat standoff into a corpus fragment.
| PARAMETER | DESCRIPTION |
|---|---|
src
|
The combined
TYPE:
|
into
|
An existing fragment to extend with the decoded records.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CorpusFragment
|
The decoded fragment. |
encode ¶
encode(records: Iterable[FragmentRecord]) -> str
Encode fragment records into brat standoff text.
| PARAMETER | DESCRIPTION |
|---|---|
records
|
The records to encode. The expression record supplies the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The combined |
BratIso ¶
Bases: Iso[_Standoff, CorpusFragment]
An :class:~didactic.api.Iso between a brat standoff and a fragment.
The forward direction builds a corpus fragment from a parsed standoff; the
backward direction recovers the standoff. Round-trip law fixtures verify
that backward(forward(x)) == x on the supported subset (text-bound
entities, binary relations, and attributes).
forward ¶
forward(a: _Standoff) -> CorpusFragment
Build a corpus fragment from a parsed standoff.
| PARAMETER | DESCRIPTION |
|---|---|
a
|
The parsed brat contents.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CorpusFragment
|
The fragment of expression and annotation-layer records. |
backward ¶
backward(b: CorpusFragment) -> _Standoff
Recover a parsed standoff from a corpus fragment.
| PARAMETER | DESCRIPTION |
|---|---|
b
|
The fragment to recover the standoff from.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
_Standoff
|
The parsed brat contents. |
split_standoff ¶
Split an encoded brat source into its conventional .txt and .ann.
:meth:BratCodec.encode joins the document text and the standoff annotation
body with a sentinel line so both halves round-trip through one string. This
recovers the two halves a real brat corpus stores in separate files: write
the first element to *.txt and the second to *.ann.
| PARAMETER | DESCRIPTION |
|---|---|
source
|
An encoded brat source (the sentinel-joined string :meth:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of (str, str)
|
The |
canonical_standoff ¶
Return a standoff in the codec's canonical, round-trippable form.
The brat codec preserves a standoff's text, entity geometry, and relations
exactly, but normalises identifiers and groups attributes under their
target entity. Entity tags become T1..Tn in declaration order, relation
tags become R1..Rn, and attribute tags become A1..An ordered by the
entity they decorate. Round-trip law fixtures sample from this canonical
subset, on which BratIso.backward(BratIso.forward(x)) == x holds.
This operates on the codec's internal parsed-standoff model, which a caller
obtains from :meth:BratIso.backward (or builds directly for tests); it is
not part of the decode/encode fragment surface, so it is intentionally not
re-exported from the package.
| PARAMETER | DESCRIPTION |
|---|---|
standoff
|
Any parsed standoff (for example the result of :meth:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
_Standoff
|
The canonicalised standoff. |