Dataset discovery¶

Discovery composes identity resolution, the PDS and appview clients, and the panproto store into a discovery surface over the Layers network. It spans three tiers: list a single actor's datasets and repository table of contents, fan out over a seed of actors and resolve cross-repo, ref-anchored link queries, and build a local, searchable index from the firehose and a backfill crawl. The discovery shapes carry over the same record envelopes used to read a PDS; see the ATProto reference.

Single-actor discovery¶

List an actor's datasets and repository table of contents. Resolves a handle or DID through the identity resolver and lists the actor's corpora as summary rows, preferring an appview when one is available.

lairs.discovery.actor ¶

Single-actor discovery: list an actor's datasets and repo table of contents.

Resolves a handle or DID through IdentityResolver and lists the actor's corpora as DatasetSummary rows, preferring an appview when one is available and falling back to direct PDS enumeration. table_of_contents reads a repo's collection inventory through describe_repo without dumping records.

list_datasets ¶

list_datasets(
    actor: str,
    *,
    source: str = "auto",
    appview: str | None = None,
    filters: DatasetFilter | None = None,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
    appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]

List an actor's datasets as summaries.

Resolves actor (handle or DID), lists its corpora through an appview when available (server-side language/domain facets) or direct PDS enumeration otherwise, maps each to a DatasetSummary, and applies the remaining facets client-side.

PARAMETER	DESCRIPTION
`actor`	A handle or DID to list datasets for. TYPE: `str`
`source`	One of `"auto"`, `"pds"`, or `"appview"`. TYPE: `str` DEFAULT: `'auto'`
`appview`	An appview base URL; enables the appview path under `auto`. TYPE: `str or None` DEFAULT: `None`
`filters`	Facet and text filters. TYPE: `DatasetFilter or None` DEFAULT: `None`
`resolver`	An injected identity resolver. TYPE: `IdentityResolver or None` DEFAULT: `None`
`pds_client`	An injected PDS client. TYPE: `PdsClient or None` DEFAULT: `None`
`appview_client`	An injected appview client. TYPE: `AppviewClient or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple of DatasetSummary`	The matching dataset summaries, in source order.

RAISES	DESCRIPTION
`ValueError`	If `source` is unknown, or a required endpoint or client is missing.

table_of_contents ¶

table_of_contents(
    actor: str,
    *,
    source: str = "auto",
    counts: bool = False,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
) -> RepoTableOfContents

Read an actor's repository inventory.

Uses describe_repo to list the collections present in the repo without enumerating records. Counts are filled only when counts is set, since counting drains every collection. This path is always PDS-backed; the source argument is accepted for API symmetry and validated.

PARAMETER	DESCRIPTION
`actor`	A handle or DID. TYPE: `str`
`source`	Accepted for symmetry with `list_datasets`; the inventory is PDS-backed. TYPE: `str` DEFAULT: `'auto'`
`counts`	Whether to fill per-collection record counts (drains each collection). TYPE: `bool` DEFAULT: `False`
`resolver`	An injected identity resolver. TYPE: `IdentityResolver or None` DEFAULT: `None`
`pds_client`	An injected PDS client. TYPE: `PdsClient or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`RepoTableOfContents`	The repository inventory.

RAISES	DESCRIPTION
`ValueError`	If `source` is unknown, or no PDS endpoint or client is available.

Federated discovery¶

Network discovery without a central index: given a seed of handles or DIDs, list every actor's datasets and merge them, and answer ontology-anchored queries across the seed.

lairs.discovery.federated ¶

Federated discovery: fan out over a seed of actors.

Network discovery without a central index: given a seed of handles or DIDs (a lab roster, a leaderboard, a curated list), list every actor's datasets and merge them. Per-actor transport and resolution failures are isolated so one unreachable actor does not abort the sweep.

discover_datasets ¶

discover_datasets(
    actors: Sequence[str],
    *,
    source: str = "auto",
    appview: str | None = None,
    filters: DatasetFilter | None = None,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
    appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]

List datasets across a seed of actors, deduplicated by corpus AT-URI.

Each actor is listed through :func:lairs.discovery.actor.list_datasets. A single resolver is shared across the whole seed so identity lookups (handle to DID, and the DID document that carries the PDS endpoint) are cached and the underlying HTTP client is opened once: an injected resolver is reused as is, and when none is given a throwaway resolver is created for the sweep and closed before returning. Duplicate corpora (the same AT-URI seen via more than one actor or source) collapse to the first occurrence. A per-actor transport or resolution failure is skipped, so the sweep is best-effort; a ValueError (an unknown source or a missing endpoint) propagates.

PARAMETER	DESCRIPTION
`actors`	The seed handles or DIDs to search across. TYPE: `collections.abc.Sequence of str`
`source`	One of `"auto"`, `"pds"`, or `"appview"`. TYPE: `str` DEFAULT: `'auto'`
`appview`	An appview base URL; enables the appview path under `auto`. TYPE: `str or None` DEFAULT: `None`
`filters`	Facet and text filters, applied per actor. TYPE: `DatasetFilter or None` DEFAULT: `None`
`resolver`	An injected identity resolver, shared across the seed. TYPE: `IdentityResolver or None` DEFAULT: `None`
`pds_client`	An injected PDS client. TYPE: `PdsClient or None` DEFAULT: `None`
`appview_client`	An injected appview client. TYPE: `AppviewClient or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple of DatasetSummary`	The merged, deduplicated summaries, in seed-then-corpus order.

datasets_using_ontology ¶

datasets_using_ontology(
    ontology_uri: str,
    actors: Sequence[str],
    *,
    source: str = "auto",
    appview: str | None = None,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
    appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]

Find datasets that use a given ontology, across a seed of actors.

The lexicons offer no ontology-to-corpus query, so this fans out over the seed (see :func:discover_datasets) and keeps the corpora whose ontology_refs contain ontology_uri. Cross-repo reach is therefore bounded by the seed; the Tier 3 index resolves this generally.

PARAMETER	DESCRIPTION
`ontology_uri`	The ontology AT-URI to match against each corpus's `ontology_refs`. TYPE: `str`
`actors`	The seed handles or DIDs to search across. TYPE: `collections.abc.Sequence of str`
`source`	One of `"auto"`, `"pds"`, or `"appview"`. TYPE: `str` DEFAULT: `'auto'`
`appview`	An appview base URL. TYPE: `str or None` DEFAULT: `None`
`resolver`	An injected identity resolver. TYPE: `IdentityResolver or None` DEFAULT: `None`
`pds_client`	An injected PDS client. TYPE: `PdsClient or None` DEFAULT: `None`
`appview_client`	An injected appview client. TYPE: `AppviewClient or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple of DatasetSummary`	The datasets in the seed that reference the ontology.

Link queries¶

Cross-repo, ref-anchored queries anchored on a content reference (a corpus, an eprint) rather than a repository, so an appview that indexes the network can answer them across actors.

lairs.discovery.links ¶

Cross-repo, ref-anchored link queries.

These queries are anchored on a content reference (a corpus, an eprint) rather than a repository, so an appview that indexes the network answers them across every repo: who, anywhere, asserts membership in this corpus, or links this eprint to data. They require an appview endpoint or client, since a PDS can only answer for its own repository.

members_of_corpus ¶

members_of_corpus(
    corpus_uri: str,
    *,
    appview: str | None = None,
    appview_client: AppviewClient | None = None,
    split: str | None = None,
) -> tuple[Membership, ...]

List membership records that point at a corpus, across all repos.

PARAMETER	DESCRIPTION
`corpus_uri`	The corpus AT-URI to find members of. TYPE: `str`
`appview`	An appview base URL. TYPE: `str or None` DEFAULT: `None`
`appview_client`	An injected appview client. TYPE: `AppviewClient or None` DEFAULT: `None`
`split`	Restrict to a dataset split (for example `"train"`). TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple of lairs.records._generated.corpus.Membership`	The membership records asserted for the corpus.

RAISES	DESCRIPTION
`ValueError`	If no appview endpoint or client is available.

datasets_for_eprint ¶

datasets_for_eprint(
    eprint_uri: str,
    *,
    appview: str | None = None,
    appview_client: AppviewClient | None = None,
    data_kind: str | None = None,
) -> tuple[DataLink, ...]

List data links that point at an eprint, across all repos.

PARAMETER	DESCRIPTION
`eprint_uri`	The eprint AT-URI to find data links for. TYPE: `str`
`appview`	An appview base URL. TYPE: `str or None` DEFAULT: `None`
`appview_client`	An injected appview client. TYPE: `AppviewClient or None` DEFAULT: `None`
`data_kind`	Restrict to a data-kind slug. TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple of lairs.records._generated.eprint.DataLink`	The data-link records that reference the eprint.

RAISES	DESCRIPTION
`ValueError`	If no appview endpoint or client is available.

The discovery index¶

A searchable index of dataset cards over a panproto Repository. The DiscoveryIndex is a thin behavioral wrapper around the repository, which remains the source of truth.

lairs.discovery.index ¶

The discovery index: dataset cards over a panproto Repository.

DiscoveryIndex is a thin behavioral wrapper around the panproto-backed lairs.store.repository.Repository. The Repository is the source of truth: it stores DatasetCard, SyncCursor, and RepoCrawlState records under the local lairs.index.* namespace, versioned and content-addressed. Re-saving an unchanged card is a no-op at commit time, so dedup and idempotent re-crawl are free, and repo.diff answers "what datasets changed between two snapshots".

CardDiff ¶

Bases: Model

Dataset cards added, removed, or changed between two index revisions.

The members are corpus AT-URIs when the card is resolvable in the current index, falling back to the card's own index URI otherwise (for example a removed card).

ATTRIBUTE	DESCRIPTION
`added`	Corpora whose card appeared between the revisions. TYPE: `tuple of str`
`removed`	Corpora whose card disappeared between the revisions. TYPE: `tuple of str`
`changed`	Corpora whose card content changed between the revisions. TYPE: `tuple of str`

DiscoveryIndex ¶

DiscoveryIndex(repo: Repository)

A panproto-backed store of dataset cards and crawl bookkeeping.

PARAMETER	DESCRIPTION
`repo`	The backing Repository that holds the index records. TYPE: `Repository`

repo `property` ¶

repo: Repository

Return the backing Repository.

RETURNS	DESCRIPTION
`Repository`	The Repository that holds the index records.

init `classmethod` ¶

init(path: Path) -> Self

Create a new index Repository at path.

PARAMETER	DESCRIPTION
`path`	The directory to create the index Repository in. TYPE: `Path`

RETURNS	DESCRIPTION
`DiscoveryIndex`	The new index.

open `classmethod` ¶

open(path: Path) -> Self

Open an existing index Repository at path.

PARAMETER	DESCRIPTION
`path`	The directory of an existing index Repository. TYPE: `Path`

RETURNS	DESCRIPTION
`DiscoveryIndex`	The opened index.

put_card ¶

put_card(card: DatasetCard) -> str

Stage a dataset card, keyed by its deterministic index URI.

PARAMETER	DESCRIPTION
`card`	The card to store. TYPE: `DatasetCard`

RETURNS	DESCRIPTION
`str`	The card's index AT-URI.

get_card ¶

get_card(corpus_uri: str) -> DatasetCard | None

Load the card for a corpus, or None when it is not indexed.

PARAMETER	DESCRIPTION
`corpus_uri`	The corpus AT-URI. TYPE: `str`

RETURNS	DESCRIPTION
`DatasetCard or None`	The stored card, or `None`.

remove_card ¶

remove_card(corpus_uri: str) -> bool

Remove a corpus's card from the index, returning whether one existed.

Stages the card's removal through the backing Repository so the card is absent from :meth:cards, :meth:get_card, and search, and a revision-to-revision :meth:diff_cards reports it in removed once the removal is committed. Removing a card that is not indexed is a no-op.

PARAMETER	DESCRIPTION
`corpus_uri`	The corpus AT-URI whose card to remove. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	`True` if a card was removed, `False` if none was indexed.

cards ¶

cards() -> list[DatasetCard]

Load every dataset card in the index.

RETURNS	DESCRIPTION
`list of DatasetCard`	All stored cards, in index key order.

card_pool ¶

card_pool() -> ModelPool

Load every card into a ModelPool keyed by its index URI.

RETURNS	DESCRIPTION
`ModelPool`	A pool of the index's cards, for cross-reference traversal.

get_cursor ¶

get_cursor(relay: str) -> SyncCursor | None

Load the firehose cursor for a relay, or None.

PARAMETER	DESCRIPTION
`relay`	The firehose endpoint. TYPE: `str`

RETURNS	DESCRIPTION
`SyncCursor or None`	The stored cursor, or `None`.

put_cursor ¶

put_cursor(cursor: SyncCursor) -> None

Stage a firehose cursor.

PARAMETER	DESCRIPTION
`cursor`	The cursor to store. TYPE: `SyncCursor`

get_crawl_state ¶

get_crawl_state(did: str) -> RepoCrawlState | None

Load the crawl state for a repository, or None.

PARAMETER	DESCRIPTION
`did`	The repository DID. TYPE: `str`

RETURNS	DESCRIPTION
`RepoCrawlState or None`	The stored crawl state, or `None`.

put_crawl_state ¶

put_crawl_state(state: RepoCrawlState) -> None

Stage a repository crawl state.

PARAMETER	DESCRIPTION
`state`	The crawl state to store. TYPE: `RepoCrawlState`

commit ¶

commit(message: str) -> str

Commit the staged index records.

PARAMETER	DESCRIPTION
`message`	The commit message. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The new commit revision.

tag ¶

tag(name: str, *, revision: str | None = None) -> None

Tag an index revision (the head by default).

PARAMETER	DESCRIPTION
`name`	The tag name. TYPE: `str`
`revision`	The revision to tag; defaults to the head. TYPE: `str or None` DEFAULT: `None`

head ¶

head() -> str | None

Return the head commit revision, or None when empty.

RETURNS	DESCRIPTION
`str or None`	The head revision.

log ¶

log() -> list[dict[str, JsonValue]]

Return the commit log, newest first.

RETURNS	DESCRIPTION
`list of dict`	The commit log entries.

diff_cards ¶

diff_cards(base: str, head: str) -> CardDiff

Diff dataset cards between two index revisions.

PARAMETER	DESCRIPTION
`base`	The base revision. TYPE: `str`
`head`	The head revision. TYPE: `str`

RETURNS	DESCRIPTION
`CardDiff`	The added, removed, and changed corpora between the revisions.

Index ingest¶

The backfill crawl and firehose tail that populate the index. Both pipelines write through the panproto Repository, recording one dataset card per discovered corpus along with the resumable cursor and per-repo crawl state.

lairs.discovery.ingest ¶

Index ingest pipelines: backfill crawl and firehose tail.

Both pipelines write through the panproto Repository: every discovered corpus becomes a DatasetCard, and the resumable cursor and per-repo crawl state are records too, so progress is versioned. Cards are only re-written when their content changes, preserving content-addressed dedup; bounds and skips are always logged in the returned CrawlReport.

RepoDescriber ¶

Bases: Protocol

A source of repository descriptions (satisfied by PdsClient).

describe_repo ¶

describe_repo(repo: str) -> RepoDescription

Return the description of a repository.

CorpusLister ¶

Bases: Protocol

A source of a repository's records (satisfied by PdsClient).

list_records ¶

list_records(
    repo: str, collection: str
) -> Iterator[RecordEnvelope]

Enumerate a repository's records in a collection.

build_index ¶

build_index(
    index: DiscoveryIndex,
    dids: Iterable[str],
    *,
    describe: RepoDescriber,
    list_corpora: CorpusLister,
    endpoint: str,
    max_repos: int | None = None,
    message: str = "backfill crawl",
) -> CrawlReport

Crawl repositories on one endpoint and index their corpora.

For each DID, describe_repo reveals whether the repo holds the corpus collection; if so, its corpora are listed and turned into cards. Per-repo crawl state is recorded so a re-run resumes, and a max_repos bound is logged rather than silently applied.

PARAMETER	DESCRIPTION
`index`	The index to write into. TYPE: `DiscoveryIndex`
`dids`	The repository DIDs to crawl (for example `PdsClient.list_repos()`). TYPE: `collections.abc.Iterable of str`
`describe`	A repository-description source bound to `endpoint`. TYPE: `RepoDescriber`
`list_corpora`	A record-listing source bound to `endpoint`. TYPE: `CorpusLister`
`endpoint`	The PDS or relay endpoint the repos are read from. TYPE: `str`
`max_repos`	A bound on repositories visited; hitting it is logged. TYPE: `int or None` DEFAULT: `None`
`message`	The commit message for the crawl snapshot. TYPE: `str` DEFAULT: `'backfill crawl'`

RETURNS	DESCRIPTION
`CrawlReport`	Counts and skip reasons for the crawl.

update_index ¶

update_index(
    index: DiscoveryIndex,
    relay: str,
    *,
    source_endpoint: str | None = None,
    limit: int | None = None,
    commit_every: int = _DEFAULT_COMMIT_EVERY,
) -> CrawlReport

Tail a relay's firehose, refreshing cards for corpus commits.

Resumes from the stored SyncCursor for the relay, indexes each corpus create or update, removes the card for each corpus delete (so the local index does not drift stale), and checkpoints the cursor and commits every commit_every events. limit bounds the events processed, which makes a live tail testable. A delete of a corpus that is not indexed is logged in skipped rather than removed; a removed card is reported in CrawlReport.cards_removed.

PARAMETER	DESCRIPTION
`index`	The index to write into. TYPE: `DiscoveryIndex`
`relay`	The firehose endpoint. TYPE: `str`
`source_endpoint`	The endpoint recorded as each card's provenance; defaults to `relay`. TYPE: `str or None` DEFAULT: `None`
`limit`	A bound on events processed; `None` tails indefinitely. TYPE: `int or None` DEFAULT: `None`
`commit_every`	How many events to process between cursor checkpoints. TYPE: `int` DEFAULT: `_DEFAULT_COMMIT_EVERY`

RETURNS	DESCRIPTION
`CrawlReport`	Counts and skip reasons for the firehose pass.

Search¶

In-memory search over the discovery index: the primary, dependency-free query path that loads dataset cards, filters them with plain predicates, and ranks the matches.

lairs.discovery.query ¶

In-memory search over the discovery index.

The primary, dependency-free query path: load DatasetCard records and filter them with plain predicates, then rank. Discovery is at dataset scale (thousands of corpora), so a linear scan is fast and an index server is unwarranted; the optional DuckDB accelerator (see accelerator) is only for larger scans.

SearchQuery ¶

Bases: Model

A structured, serializable query over dataset cards.

ATTRIBUTE	DESCRIPTION
`text`	A case-insensitive substring matched against name and description. TYPE: `str or None`
`domain`	A domain slug to match. TYPE: `str or None`
`language`	A language tag matched against the primary or listed languages. TYPE: `str or None`
`license`	A license identifier to match. TYPE: `str or None`
`min_expressions`	Keep cards with at least this many expressions. TYPE: `int or None`
`max_expressions`	Keep cards with at most this many expressions. TYPE: `int or None`
`annotation_metric`	Keep cards declaring this quality metric. TYPE: `str or None`
`min_annotation_rounds`	Keep cards declaring at least this many annotation rounds. TYPE: `int or None`

SearchHit ¶

Bases: Model

A matched dataset card with its ranking score.

ATTRIBUTE	DESCRIPTION
`card`	The matched card. TYPE: `DatasetCard`
`score`	The ranking score; higher ranks first. TYPE: `float`

search ¶

search(
    cards: Iterable[DatasetCard], query: SearchQuery
) -> list[SearchHit]

Filter and rank dataset cards against a query.

PARAMETER	DESCRIPTION
`cards`	The cards to search (for example `DiscoveryIndex.cards()`). TYPE: `collections.abc.Iterable of DatasetCard`
`query`	The query to apply. TYPE: `SearchQuery`

RETURNS	DESCRIPTION
`list of SearchHit`	The matching cards, ranked by score then name.

Query accelerator¶

A rebuildable DuckDB pre-filter over the index. Cards are materialized to Parquet and pre-filtered with SQL, then the matching cards are loaded from the index and ranked by the in-memory scorer, so the result is identical to the plain search.

lairs.discovery.accelerator ¶

DuckDB query accelerator for the discovery index.

A rebuildable, derived view over the panproto index: cards are materialized to Parquet and pre-filtered with DuckDB SQL, then the matching cards are loaded from the index (the source of truth) and ranked by the in-memory scorer, so the result is identical to query.search. The DuckDB pre-filter is a relaxation of the full predicate (it never excludes a true match), and the final search pass applies the exact predicate and ranking. The Parquet is never authoritative and can be rebuilt from the index at any time.

This module imports DuckDB and pyarrow at module top; it is reached explicitly as lairs.discovery.accelerator so that importing lairs does not pull DuckDB into every process.

materialize_cards ¶

materialize_cards(
    index: DiscoveryIndex, out_dir: Path
) -> Path

Write the index's cards to a Parquet view, returning its path.

The view is derived and rebuildable: it is regenerated from index.cards() on each call and is never the source of truth.

PARAMETER	DESCRIPTION
`index`	The index whose cards to materialize. TYPE: `DiscoveryIndex`
`out_dir`	The directory to write the Parquet view into. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	The path of the written Parquet file.

search_accelerated ¶

search_accelerated(
    index: DiscoveryIndex,
    query: SearchQuery,
    *,
    out_dir: Path,
) -> list[SearchHit]

Search the index through the DuckDB-accelerated pre-filter.

Materializes the card view, narrows it with a DuckDB SQL pre-filter, loads the surviving cards from the index, and ranks them with the same scorer as :func:lairs.discovery.query.search, so the result is identical to an in-memory search over every card.

PARAMETER	DESCRIPTION
`index`	The index to search. TYPE: `DiscoveryIndex`
`query`	The query to apply. TYPE: `SearchQuery`
`out_dir`	The directory for the rebuildable Parquet view. TYPE: `Path`

RETURNS	DESCRIPTION
`list of SearchHit`	The matching cards, ranked identically to the in-memory search.

Cards¶

The index record models and the corpus-to-card builder: the DatasetCard stored per discovered corpus and the crawl report that summarizes an ingest run.

lairs.discovery.cards ¶

Index record models and the corpus-to-card builder.

The discovery index stores these dx.Model records in a panproto Repository: a DatasetCard per discovered corpus (a denormalized, searchable summary with provenance and freshness), a SyncCursor per firehose relay, and a RepoCrawlState per crawled repository. These are client-side index bookkeeping under a local lairs.index.* namespace, never published Layers records and never code-generated.

INDEX_DID `module-attribute` ¶

INDEX_DID = 'did:lairs:index'

The sentinel authority for the local discovery index's records.

CARD_NSID `module-attribute` ¶

CARD_NSID = 'lairs.index.datasetCard'

The collection NSID for dataset cards in the local index.

CURSOR_NSID `module-attribute` ¶

CURSOR_NSID = 'lairs.index.syncCursor'

The collection NSID for firehose sync cursors in the local index.

CRAWL_STATE_NSID `module-attribute` ¶

CRAWL_STATE_NSID = 'lairs.index.repoCrawlState'

The collection NSID for per-repo crawl state in the local index.

CardProvenance ¶

Bases: Model

Where a dataset card came from, for trust and refresh.

ATTRIBUTE	DESCRIPTION
`source_did`	The corpus author's DID. TYPE: `str`
`source_endpoint`	The PDS or appview base URL the card was read from. TYPE: `str`
`discovered_via`	How the card entered the index (`"firehose"`, `"crawl"`, `"seed"`). TYPE: `str`
`source_handle`	The author's handle at discovery time, when known. TYPE: `str or None`

CardFreshness ¶

Bases: Model

Firehose and crawl bookkeeping so freshness and resume are first-class.

ATTRIBUTE	DESCRIPTION
`first_seen_at`	When this corpus first entered the index. TYPE: `datetime`
`last_updated_at`	When the card content last changed. TYPE: `datetime`
`last_seen_seq`	The last firehose sequence number that touched the corpus. TYPE: `int or None`
`last_seen_rev`	The last repository commit revision observed for the corpus. TYPE: `str or None`
`record_cid`	The CID of the corpus record at the last refresh. TYPE: `str or None`

DatasetCard ¶

Bases: Model

A searchable, denormalized index entry for one corpus.

ATTRIBUTE	DESCRIPTION
`summary`	The corpus-derived listing projection. TYPE: `DatasetSummary`
`provenance`	Where the card came from. TYPE: `CardProvenance`
`freshness`	First-seen and last-updated bookkeeping. TYPE: `CardFreshness`
`annotation_rounds`	The number of annotation rounds declared, when present. TYPE: `int or None`
`adjudication_method`	The adjudication method slug, when present. TYPE: `str or None`
`redundancy_count`	The declared annotator redundancy, when present. TYPE: `int or None`
`quality_metrics`	The quality-criterion metric slugs declared for the corpus. TYPE: `tuple of str`

SyncCursor ¶

Bases: Model

A resumable firehose position for one relay.

ATTRIBUTE	DESCRIPTION
`relay`	The firehose endpoint this cursor is for. TYPE: `str`
`seq`	The last fully-processed firehose sequence number. TYPE: `int`
`updated_at`	When the cursor was last written. TYPE: `datetime`

RepoCrawlState ¶

Bases: Model

Per-repository crawl bookkeeping so a re-run skips finished repos.

ATTRIBUTE	DESCRIPTION
`did`	The crawled repository DID. TYPE: `str`
`endpoint`	The PDS endpoint the repo was crawled from. TYPE: `str`
`has_layers_corpus`	Whether the repo carried the corpus collection. TYPE: `bool`
`corpora_found`	The number of corpora indexed from the repo. TYPE: `int`
`last_crawled_at`	When the repo was last crawled. TYPE: `datetime`
`repos_cursor`	A `listRepos` pagination checkpoint, when crawling a relay. TYPE: `str or None`

CrawlReport ¶

Bases: Model

A summary of a crawl or firehose pass, logging every skip.

ATTRIBUTE	DESCRIPTION
`repos_seen`	The number of repositories visited. TYPE: `int`
`repos_with_corpora`	The number of repositories that held a corpus collection. TYPE: `int`
`cards_built`	The number of cards built or refreshed. TYPE: `int`
`cards_unchanged`	The number of cards that were already current (dedup hits). TYPE: `int`
`cards_removed`	The number of cards removed in response to a corpus-deletion commit. TYPE: `int`
`skipped`	Human-readable skip reasons, including any bound that was hit. TYPE: `tuple of str`
`revision`	The commit revision the pass produced, when any. TYPE: `str or None`

card_uri ¶

card_uri(corpus_uri: str) -> str

Build the deterministic index AT-URI for a corpus's card.

The same corpus always maps to the same card key, so re-indexing is idempotent and content-addressed dedup falls out for free.

PARAMETER	DESCRIPTION
`corpus_uri`	The corpus AT-URI. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The card's index AT-URI under the local `lairs.index.*` namespace.

card_from_corpus ¶

card_from_corpus(
    corpus_uri: str,
    corpus: Corpus,
    *,
    provenance: CardProvenance,
    freshness: CardFreshness,
) -> DatasetCard

Build a DatasetCard from a discovered corpus and its provenance.

PARAMETER	DESCRIPTION
`corpus_uri`	The corpus AT-URI. TYPE: `str`
`corpus`	The discovered corpus record. TYPE: `Corpus`
`provenance`	Where the corpus was discovered. TYPE: `CardProvenance`
`freshness`	First-seen and last-updated bookkeeping for the card. TYPE: `CardFreshness`

RETURNS	DESCRIPTION
`DatasetCard`	The denormalized index card.

Result models¶

The value types for discovery results: a denormalized corpus summary, a repository table of contents, a collection count, and a facet filter.

lairs.discovery.models ¶

Discovery result models.

dx.Model value types for dataset discovery: a denormalized corpus summary, a repository table of contents, and a facet filter. These are the shapes the discovery API returns and the CLI renders. DatasetSummary is also reused as the corpus-derived core of the Tier 3 index card.

DatasetSummary ¶

Bases: Model

A denormalized corpus card for discovery listings.

A flat, readable projection of a pub.layers.corpus.corpus record plus the actor and source it was found through, so a listing renders one row per dataset without dumping records.

ATTRIBUTE	DESCRIPTION
`uri`	The corpus AT-URI. TYPE: `str`
`did`	The owning repository DID. TYPE: `str`
`name`	The corpus name. TYPE: `str`
`handle`	The owning handle, when it was resolved. TYPE: `str or None`
`description`	The corpus description. TYPE: `str or None`
`domain`	The corpus domain slug. TYPE: `str or None`
`domain_uri`	The AT-URI of the corpus domain definition node. TYPE: `str or None`
`language`	The primary BCP-47 language tag. TYPE: `str or None`
`languages`	All languages represented in the corpus. TYPE: `tuple of str`
`license`	The license identifier. TYPE: `str or None`
`version`	The corpus version label. TYPE: `str or None`
`expression_count`	The number of expressions in the corpus. TYPE: `int or None`
`created_at`	The ISO 8601 creation timestamp. TYPE: `str or None`
`ontology_refs`	The ontology AT-URIs the corpus uses. TYPE: `tuple of str`
`eprint_refs`	The eprint AT-URIs the corpus links. TYPE: `tuple of str`
`has_adjudication`	Whether the corpus declares an adjudication step. TYPE: `bool`
`source_endpoint`	The PDS or appview the summary was read from. TYPE: `str or None`

CollectionCount ¶

Bases: Model

A repository collection NSID with an optional record count.

ATTRIBUTE	DESCRIPTION
`nsid`	The collection NSID. TYPE: `str`
`count`	The number of records in the collection, when counted. TYPE: `int or None`
`is_dataset_like`	Whether the collection holds dataset-shaped records. TYPE: `bool`

RepoTableOfContents ¶

Bases: Model

An actor's repository inventory: identity plus per-collection counts.

ATTRIBUTE	DESCRIPTION
`did`	The repository DID. TYPE: `str`
`handle`	The repository handle, when known. TYPE: `str or None`
`pds_endpoint`	The PDS endpoint the inventory was read from. TYPE: `str or None`
`collections`	The collections present in the repository. TYPE: `tuple of CollectionCount`
`dataset_collections`	The dataset-like collection NSIDs, highlighted for convenience. TYPE: `tuple of str`

DatasetFilter ¶

Bases: Model

A facet and text filter over dataset summaries.

Server-side facets (language, domain) are pushed into listCorpora parameters on the appview path; the rest are applied client-side over the mapped summaries.

ATTRIBUTE	DESCRIPTION
`language`	Keep corpora whose primary or listed languages include this tag. TYPE: `str or None`
`domain`	Keep corpora with this domain slug. TYPE: `str or None`
`license`	Keep corpora with this license identifier. TYPE: `str or None`
`min_expression_count`	Keep corpora with at least this many expressions. TYPE: `int or None`
`max_expression_count`	Keep corpora with at most this many expressions. TYPE: `int or None`
`text`	Keep corpora whose name or description contains this substring. TYPE: `str or None`
`has_adjudication`	Keep corpora that do (or do not) declare an adjudication step. TYPE: `bool or None`

Summaries¶

The corpus-to-summary projection and dataset filtering: projects a generated Corpus record into the flat summary shape, evaluates a filter over a summary, and extracts the server-side facets the appview supports.

lairs.discovery.summary ¶

Corpus-to-summary mapping and dataset filtering.

Projects a generated Corpus record (or a record envelope carrying one) into the flat DatasetSummary discovery shape, evaluates a DatasetFilter over a summary, and extracts the server-side facets listCorpora supports.

summary_from_corpus ¶

summary_from_corpus(
    corpus: Corpus,
    *,
    uri: str,
    did: str,
    handle: str | None = None,
    source_endpoint: str | None = None,
) -> DatasetSummary

Project a corpus record into a DatasetSummary.

PARAMETER	DESCRIPTION
`corpus`	The corpus record to project. TYPE: `Corpus`
`uri`	The corpus AT-URI. TYPE: `str`
`did`	The owning repository DID. TYPE: `str`
`handle`	The owning handle, when known. TYPE: `str or None` DEFAULT: `None`
`source_endpoint`	The PDS or appview the corpus was read from. TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DatasetSummary`	The flat discovery summary.

corpus_from_value ¶

corpus_from_value(value: JsonValue) -> Corpus | None

Decode a record value into a Corpus, or None on failure.

The wire-only $type discriminator is dropped before validation, since the generated models do not declare it.

PARAMETER	DESCRIPTION
`value`	The record value to decode. TYPE: `JsonValue`

RETURNS	DESCRIPTION
`Corpus or None`	The decoded corpus, or `None` when the value is not a decodable corpus.

summary_from_envelope ¶

summary_from_envelope(
    envelope: RecordEnvelope,
    *,
    did: str | None = None,
    handle: str | None = None,
    source_endpoint: str | None = None,
) -> DatasetSummary | None

Decode a corpus envelope into a DatasetSummary.

Returns None when the envelope is not a corpus record or its value does not validate, so a single bad record does not abort a listing.

PARAMETER	DESCRIPTION
`envelope`	The record envelope to decode. TYPE: `RecordEnvelope`
`did`	The owning DID; derived from the envelope URI when omitted. TYPE: `str or None` DEFAULT: `None`
`handle`	The owning handle, when known. TYPE: `str or None` DEFAULT: `None`
`source_endpoint`	The PDS or appview the envelope was read from. TYPE: `str or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DatasetSummary or None`	The summary, or `None` when the record is not a decodable corpus.

matches ¶

matches(
    summary: DatasetSummary, flt: DatasetFilter | None
) -> bool

Return whether a summary satisfies a filter.

PARAMETER	DESCRIPTION
`summary`	The summary to test. TYPE: `DatasetSummary`
`flt`	The filter; `None` matches everything. TYPE: `DatasetFilter or None`

RETURNS	DESCRIPTION
`bool`	`True` when the summary passes every set facet.

listcorpora_params ¶

listcorpora_params(
    repo: str, flt: DatasetFilter | None
) -> QueryParams

Build listCorpora query parameters, pushing the server-side facets.

Only language and domain are server-side facets on listCorpora; every other facet is applied client-side over the mapped summaries.

PARAMETER	DESCRIPTION
`repo`	The repository DID or handle to list. TYPE: `str`
`flt`	The filter whose server-side facets to push. TYPE: `DatasetFilter or None`

RETURNS	DESCRIPTION
`QueryParams`	The query parameters for `corpus.listCorpora`.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Dataset discovery¶

Single-actor discovery¶

lairs.discovery.actor ¶

list_datasets ¶

table_of_contents ¶

Federated discovery¶

lairs.discovery.federated ¶

discover_datasets ¶

datasets_using_ontology ¶

Link queries¶

lairs.discovery.links ¶

members_of_corpus ¶

datasets_for_eprint ¶

The discovery index¶

lairs.discovery.index ¶

CardDiff ¶

DiscoveryIndex ¶

repo property ¶

init classmethod ¶

open classmethod ¶

put_card ¶

get_card ¶

remove_card ¶

cards ¶

card_pool ¶

get_cursor ¶

put_cursor ¶

get_crawl_state ¶

put_crawl_state ¶

commit ¶

tag ¶

head ¶

log ¶

diff_cards ¶

Index ingest¶

lairs.discovery.ingest ¶

RepoDescriber ¶

describe_repo ¶

CorpusLister ¶

list_records ¶

build_index ¶

update_index ¶

Search¶

lairs.discovery.query ¶

SearchQuery ¶

SearchHit ¶

search ¶

Query accelerator¶

lairs.discovery.accelerator ¶

materialize_cards ¶

search_accelerated ¶

Cards¶

lairs.discovery.cards ¶

INDEX_DID module-attribute ¶

CARD_NSID module-attribute ¶

CURSOR_NSID module-attribute ¶

CRAWL_STATE_NSID module-attribute ¶

CardProvenance ¶

CardFreshness ¶

DatasetCard ¶

SyncCursor ¶

RepoCrawlState ¶

CrawlReport ¶

card_uri ¶

card_from_corpus ¶

Result models¶

lairs.discovery.models ¶

DatasetSummary ¶

CollectionCount ¶

RepoTableOfContents ¶

DatasetFilter ¶

Summaries¶

lairs.discovery.summary ¶

summary_from_corpus ¶

corpus_from_value ¶

summary_from_envelope ¶

matches ¶

listcorpora_params ¶

repo `property` ¶

init `classmethod` ¶

open `classmethod` ¶

INDEX_DID `module-attribute` ¶

CARD_NSID `module-attribute` ¶

CURSOR_NSID `module-attribute` ¶

CRAWL_STATE_NSID `module-attribute` ¶