Dataset discovery

Discovery composes identity resolution, the PDS and appview clients, and the panproto store into a discovery surface over the Layers network. It spans three tiers: list a single actor's datasets and repository table of contents, fan out over a seed of actors and resolve cross-repo, ref-anchored link queries, and build a local, searchable index from the firehose and a backfill crawl. The discovery shapes carry over the same record envelopes used to read a PDS; see the ATProto reference.

Single-actor discovery

List an actor's datasets and repository table of contents. Resolves a handle or DID through the identity resolver and lists the actor's corpora as summary rows, preferring an appview when one is available.

lairs.discovery.actor

Single-actor discovery: list an actor's datasets and repo table of contents.

Resolves a handle or DID through IdentityResolver and lists the actor's corpora as DatasetSummary rows, preferring an appview when one is available and falling back to direct PDS enumeration. table_of_contents reads a repo's collection inventory through describe_repo without dumping records.

list_datasets

list_datasets(
    actor: str,
    *,
    source: str = "auto",
    appview: str | None = None,
    filters: DatasetFilter | None = None,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
    appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]

List an actor's datasets as summaries.

Resolves actor (handle or DID), lists its corpora through an appview when available (server-side language/domain facets) or direct PDS enumeration otherwise, maps each to a DatasetSummary, and applies the remaining facets client-side.

PARAMETER DESCRIPTION
actor

A handle or DID to list datasets for.

TYPE: str

source

One of "auto", "pds", or "appview".

TYPE: str DEFAULT: 'auto'

appview

An appview base URL; enables the appview path under auto.

TYPE: str or None DEFAULT: None

filters

Facet and text filters.

TYPE: DatasetFilter or None DEFAULT: None

resolver

An injected identity resolver.

TYPE: IdentityResolver or None DEFAULT: None

pds_client

An injected PDS client.

TYPE: PdsClient or None DEFAULT: None

appview_client

An injected appview client.

TYPE: AppviewClient or None DEFAULT: None

RETURNS DESCRIPTION
tuple of DatasetSummary

The matching dataset summaries, in source order.

RAISES DESCRIPTION
ValueError

If source is unknown, or a required endpoint or client is missing.

table_of_contents

table_of_contents(
    actor: str,
    *,
    source: str = "auto",
    counts: bool = False,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
) -> RepoTableOfContents

Read an actor's repository inventory.

Uses describe_repo to list the collections present in the repo without enumerating records. Counts are filled only when counts is set, since counting drains every collection. This path is always PDS-backed; the source argument is accepted for API symmetry and validated.

PARAMETER DESCRIPTION
actor

A handle or DID.

TYPE: str

source

Accepted for symmetry with list_datasets; the inventory is PDS-backed.

TYPE: str DEFAULT: 'auto'

counts

Whether to fill per-collection record counts (drains each collection).

TYPE: bool DEFAULT: False

resolver

An injected identity resolver.

TYPE: IdentityResolver or None DEFAULT: None

pds_client

An injected PDS client.

TYPE: PdsClient or None DEFAULT: None

RETURNS DESCRIPTION
RepoTableOfContents

The repository inventory.

RAISES DESCRIPTION
ValueError

If source is unknown, or no PDS endpoint or client is available.

Federated discovery

Network discovery without a central index: given a seed of handles or DIDs, list every actor's datasets and merge them, and answer ontology-anchored queries across the seed.

lairs.discovery.federated

Federated discovery: fan out over a seed of actors.

Network discovery without a central index: given a seed of handles or DIDs (a lab roster, a leaderboard, a curated list), list every actor's datasets and merge them. Per-actor transport and resolution failures are isolated so one unreachable actor does not abort the sweep.

discover_datasets

discover_datasets(
    actors: Sequence[str],
    *,
    source: str = "auto",
    appview: str | None = None,
    filters: DatasetFilter | None = None,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
    appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]

List datasets across a seed of actors, deduplicated by corpus AT-URI.

Each actor is listed through :func:lairs.discovery.actor.list_datasets. A single resolver is shared across the whole seed so identity lookups (handle to DID, and the DID document that carries the PDS endpoint) are cached and the underlying HTTP client is opened once: an injected resolver is reused as is, and when none is given a throwaway resolver is created for the sweep and closed before returning. Duplicate corpora (the same AT-URI seen via more than one actor or source) collapse to the first occurrence. A per-actor transport or resolution failure is skipped, so the sweep is best-effort; a ValueError (an unknown source or a missing endpoint) propagates.

PARAMETER DESCRIPTION
actors

The seed handles or DIDs to search across.

TYPE: collections.abc.Sequence of str

source

One of "auto", "pds", or "appview".

TYPE: str DEFAULT: 'auto'

appview

An appview base URL; enables the appview path under auto.

TYPE: str or None DEFAULT: None

filters

Facet and text filters, applied per actor.

TYPE: DatasetFilter or None DEFAULT: None

resolver

An injected identity resolver, shared across the seed.

TYPE: IdentityResolver or None DEFAULT: None

pds_client

An injected PDS client.

TYPE: PdsClient or None DEFAULT: None

appview_client

An injected appview client.

TYPE: AppviewClient or None DEFAULT: None

RETURNS DESCRIPTION
tuple of DatasetSummary

The merged, deduplicated summaries, in seed-then-corpus order.

datasets_using_ontology

datasets_using_ontology(
    ontology_uri: str,
    actors: Sequence[str],
    *,
    source: str = "auto",
    appview: str | None = None,
    resolver: IdentityResolver | None = None,
    pds_client: PdsClient | None = None,
    appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]

Find datasets that use a given ontology, across a seed of actors.

The lexicons offer no ontology-to-corpus query, so this fans out over the seed (see :func:discover_datasets) and keeps the corpora whose ontology_refs contain ontology_uri. Cross-repo reach is therefore bounded by the seed; the Tier 3 index resolves this generally.

PARAMETER DESCRIPTION
ontology_uri

The ontology AT-URI to match against each corpus's ontology_refs.

TYPE: str

actors

The seed handles or DIDs to search across.

TYPE: collections.abc.Sequence of str

source

One of "auto", "pds", or "appview".

TYPE: str DEFAULT: 'auto'

appview

An appview base URL.

TYPE: str or None DEFAULT: None

resolver

An injected identity resolver.

TYPE: IdentityResolver or None DEFAULT: None

pds_client

An injected PDS client.

TYPE: PdsClient or None DEFAULT: None

appview_client

An injected appview client.

TYPE: AppviewClient or None DEFAULT: None

RETURNS DESCRIPTION
tuple of DatasetSummary

The datasets in the seed that reference the ontology.

Cross-repo, ref-anchored queries anchored on a content reference (a corpus, an eprint) rather than a repository, so an appview that indexes the network can answer them across actors.

Cross-repo, ref-anchored link queries.

These queries are anchored on a content reference (a corpus, an eprint) rather than a repository, so an appview that indexes the network answers them across every repo: who, anywhere, asserts membership in this corpus, or links this eprint to data. They require an appview endpoint or client, since a PDS can only answer for its own repository.

members_of_corpus

members_of_corpus(
    corpus_uri: str,
    *,
    appview: str | None = None,
    appview_client: AppviewClient | None = None,
    split: str | None = None,
) -> tuple[Membership, ...]

List membership records that point at a corpus, across all repos.

PARAMETER DESCRIPTION
corpus_uri

The corpus AT-URI to find members of.

TYPE: str

appview

An appview base URL.

TYPE: str or None DEFAULT: None

appview_client

An injected appview client.

TYPE: AppviewClient or None DEFAULT: None

split

Restrict to a dataset split (for example "train").

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
tuple of lairs.records._generated.corpus.Membership

The membership records asserted for the corpus.

RAISES DESCRIPTION
ValueError

If no appview endpoint or client is available.

datasets_for_eprint

datasets_for_eprint(
    eprint_uri: str,
    *,
    appview: str | None = None,
    appview_client: AppviewClient | None = None,
    data_kind: str | None = None,
) -> tuple[DataLink, ...]

List data links that point at an eprint, across all repos.

PARAMETER DESCRIPTION
eprint_uri

The eprint AT-URI to find data links for.

TYPE: str

appview

An appview base URL.

TYPE: str or None DEFAULT: None

appview_client

An injected appview client.

TYPE: AppviewClient or None DEFAULT: None

data_kind

Restrict to a data-kind slug.

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
tuple of lairs.records._generated.eprint.DataLink

The data-link records that reference the eprint.

RAISES DESCRIPTION
ValueError

If no appview endpoint or client is available.

The discovery index

A searchable index of dataset cards over a panproto Repository. The DiscoveryIndex is a thin behavioral wrapper around the repository, which remains the source of truth.

lairs.discovery.index

The discovery index: dataset cards over a panproto Repository.

DiscoveryIndex is a thin behavioral wrapper around the panproto-backed lairs.store.repository.Repository. The Repository is the source of truth: it stores DatasetCard, SyncCursor, and RepoCrawlState records under the local lairs.index.* namespace, versioned and content-addressed. Re-saving an unchanged card is a no-op at commit time, so dedup and idempotent re-crawl are free, and repo.diff answers "what datasets changed between two snapshots".

CardDiff

Bases: Model

Dataset cards added, removed, or changed between two index revisions.

The members are corpus AT-URIs when the card is resolvable in the current index, falling back to the card's own index URI otherwise (for example a removed card).

ATTRIBUTE DESCRIPTION
added

Corpora whose card appeared between the revisions.

TYPE: tuple of str

removed

Corpora whose card disappeared between the revisions.

TYPE: tuple of str

changed

Corpora whose card content changed between the revisions.

TYPE: tuple of str

DiscoveryIndex

DiscoveryIndex(repo: Repository)

A panproto-backed store of dataset cards and crawl bookkeeping.

PARAMETER DESCRIPTION
repo

The backing Repository that holds the index records.

TYPE: Repository

repo property

repo: Repository

Return the backing Repository.

RETURNS DESCRIPTION
Repository

The Repository that holds the index records.

init classmethod

init(path: Path) -> Self

Create a new index Repository at path.

PARAMETER DESCRIPTION
path

The directory to create the index Repository in.

TYPE: Path

RETURNS DESCRIPTION
DiscoveryIndex

The new index.

open classmethod

open(path: Path) -> Self

Open an existing index Repository at path.

PARAMETER DESCRIPTION
path

The directory of an existing index Repository.

TYPE: Path

RETURNS DESCRIPTION
DiscoveryIndex

The opened index.

put_card

put_card(card: DatasetCard) -> str

Stage a dataset card, keyed by its deterministic index URI.

PARAMETER DESCRIPTION
card

The card to store.

TYPE: DatasetCard

RETURNS DESCRIPTION
str

The card's index AT-URI.

get_card

get_card(corpus_uri: str) -> DatasetCard | None

Load the card for a corpus, or None when it is not indexed.

PARAMETER DESCRIPTION
corpus_uri

The corpus AT-URI.

TYPE: str

RETURNS DESCRIPTION
DatasetCard or None

The stored card, or None.

remove_card

remove_card(corpus_uri: str) -> bool

Remove a corpus's card from the index, returning whether one existed.

Stages the card's removal through the backing Repository so the card is absent from :meth:cards, :meth:get_card, and search, and a revision-to-revision :meth:diff_cards reports it in removed once the removal is committed. Removing a card that is not indexed is a no-op.

PARAMETER DESCRIPTION
corpus_uri

The corpus AT-URI whose card to remove.

TYPE: str

RETURNS DESCRIPTION
bool

True if a card was removed, False if none was indexed.

cards

cards() -> list[DatasetCard]

Load every dataset card in the index.

RETURNS DESCRIPTION
list of DatasetCard

All stored cards, in index key order.

card_pool

card_pool() -> ModelPool

Load every card into a ModelPool keyed by its index URI.

RETURNS DESCRIPTION
ModelPool

A pool of the index's cards, for cross-reference traversal.

get_cursor

get_cursor(relay: str) -> SyncCursor | None

Load the firehose cursor for a relay, or None.

PARAMETER DESCRIPTION
relay

The firehose endpoint.

TYPE: str

RETURNS DESCRIPTION
SyncCursor or None

The stored cursor, or None.

put_cursor

put_cursor(cursor: SyncCursor) -> None

Stage a firehose cursor.

PARAMETER DESCRIPTION
cursor

The cursor to store.

TYPE: SyncCursor

get_crawl_state

get_crawl_state(did: str) -> RepoCrawlState | None

Load the crawl state for a repository, or None.

PARAMETER DESCRIPTION
did

The repository DID.

TYPE: str

RETURNS DESCRIPTION
RepoCrawlState or None

The stored crawl state, or None.

put_crawl_state

put_crawl_state(state: RepoCrawlState) -> None

Stage a repository crawl state.

PARAMETER DESCRIPTION
state

The crawl state to store.

TYPE: RepoCrawlState

commit

commit(message: str) -> str

Commit the staged index records.

PARAMETER DESCRIPTION
message

The commit message.

TYPE: str

RETURNS DESCRIPTION
str

The new commit revision.

tag

tag(name: str, *, revision: str | None = None) -> None

Tag an index revision (the head by default).

PARAMETER DESCRIPTION
name

The tag name.

TYPE: str

revision

The revision to tag; defaults to the head.

TYPE: str or None DEFAULT: None

head

head() -> str | None

Return the head commit revision, or None when empty.

RETURNS DESCRIPTION
str or None

The head revision.

log

log() -> list[dict[str, JsonValue]]

Return the commit log, newest first.

RETURNS DESCRIPTION
list of dict

The commit log entries.

diff_cards

diff_cards(base: str, head: str) -> CardDiff

Diff dataset cards between two index revisions.

PARAMETER DESCRIPTION
base

The base revision.

TYPE: str

head

The head revision.

TYPE: str

RETURNS DESCRIPTION
CardDiff

The added, removed, and changed corpora between the revisions.

Index ingest

The backfill crawl and firehose tail that populate the index. Both pipelines write through the panproto Repository, recording one dataset card per discovered corpus along with the resumable cursor and per-repo crawl state.

lairs.discovery.ingest

Index ingest pipelines: backfill crawl and firehose tail.

Both pipelines write through the panproto Repository: every discovered corpus becomes a DatasetCard, and the resumable cursor and per-repo crawl state are records too, so progress is versioned. Cards are only re-written when their content changes, preserving content-addressed dedup; bounds and skips are always logged in the returned CrawlReport.

RepoDescriber

Bases: Protocol

A source of repository descriptions (satisfied by PdsClient).

describe_repo

describe_repo(repo: str) -> RepoDescription

Return the description of a repository.

CorpusLister

Bases: Protocol

A source of a repository's records (satisfied by PdsClient).

list_records

list_records(
    repo: str, collection: str
) -> Iterator[RecordEnvelope]

Enumerate a repository's records in a collection.

build_index

build_index(
    index: DiscoveryIndex,
    dids: Iterable[str],
    *,
    describe: RepoDescriber,
    list_corpora: CorpusLister,
    endpoint: str,
    max_repos: int | None = None,
    message: str = "backfill crawl",
) -> CrawlReport

Crawl repositories on one endpoint and index their corpora.

For each DID, describe_repo reveals whether the repo holds the corpus collection; if so, its corpora are listed and turned into cards. Per-repo crawl state is recorded so a re-run resumes, and a max_repos bound is logged rather than silently applied.

PARAMETER DESCRIPTION
index

The index to write into.

TYPE: DiscoveryIndex

dids

The repository DIDs to crawl (for example PdsClient.list_repos()).

TYPE: collections.abc.Iterable of str

describe

A repository-description source bound to endpoint.

TYPE: RepoDescriber

list_corpora

A record-listing source bound to endpoint.

TYPE: CorpusLister

endpoint

The PDS or relay endpoint the repos are read from.

TYPE: str

max_repos

A bound on repositories visited; hitting it is logged.

TYPE: int or None DEFAULT: None

message

The commit message for the crawl snapshot.

TYPE: str DEFAULT: 'backfill crawl'

RETURNS DESCRIPTION
CrawlReport

Counts and skip reasons for the crawl.

update_index

update_index(
    index: DiscoveryIndex,
    relay: str,
    *,
    source_endpoint: str | None = None,
    limit: int | None = None,
    commit_every: int = _DEFAULT_COMMIT_EVERY,
) -> CrawlReport

Tail a relay's firehose, refreshing cards for corpus commits.

Resumes from the stored SyncCursor for the relay, indexes each corpus create or update, removes the card for each corpus delete (so the local index does not drift stale), and checkpoints the cursor and commits every commit_every events. limit bounds the events processed, which makes a live tail testable. A delete of a corpus that is not indexed is logged in skipped rather than removed; a removed card is reported in CrawlReport.cards_removed.

PARAMETER DESCRIPTION
index

The index to write into.

TYPE: DiscoveryIndex

relay

The firehose endpoint.

TYPE: str

source_endpoint

The endpoint recorded as each card's provenance; defaults to relay.

TYPE: str or None DEFAULT: None

limit

A bound on events processed; None tails indefinitely.

TYPE: int or None DEFAULT: None

commit_every

How many events to process between cursor checkpoints.

TYPE: int DEFAULT: _DEFAULT_COMMIT_EVERY

RETURNS DESCRIPTION
CrawlReport

Counts and skip reasons for the firehose pass.

In-memory search over the discovery index: the primary, dependency-free query path that loads dataset cards, filters them with plain predicates, and ranks the matches.

lairs.discovery.query

In-memory search over the discovery index.

The primary, dependency-free query path: load DatasetCard records and filter them with plain predicates, then rank. Discovery is at dataset scale (thousands of corpora), so a linear scan is fast and an index server is unwarranted; the optional DuckDB accelerator (see accelerator) is only for larger scans.

SearchQuery

Bases: Model

A structured, serializable query over dataset cards.

ATTRIBUTE DESCRIPTION
text

A case-insensitive substring matched against name and description.

TYPE: str or None

domain

A domain slug to match.

TYPE: str or None

language

A language tag matched against the primary or listed languages.

TYPE: str or None

license

A license identifier to match.

TYPE: str or None

min_expressions

Keep cards with at least this many expressions.

TYPE: int or None

max_expressions

Keep cards with at most this many expressions.

TYPE: int or None

annotation_metric

Keep cards declaring this quality metric.

TYPE: str or None

min_annotation_rounds

Keep cards declaring at least this many annotation rounds.

TYPE: int or None

SearchHit

Bases: Model

A matched dataset card with its ranking score.

ATTRIBUTE DESCRIPTION
card

The matched card.

TYPE: DatasetCard

score

The ranking score; higher ranks first.

TYPE: float

search

search(
    cards: Iterable[DatasetCard], query: SearchQuery
) -> list[SearchHit]

Filter and rank dataset cards against a query.

PARAMETER DESCRIPTION
cards

The cards to search (for example DiscoveryIndex.cards()).

TYPE: collections.abc.Iterable of DatasetCard

query

The query to apply.

TYPE: SearchQuery

RETURNS DESCRIPTION
list of SearchHit

The matching cards, ranked by score then name.

Query accelerator

A rebuildable DuckDB pre-filter over the index. Cards are materialized to Parquet and pre-filtered with SQL, then the matching cards are loaded from the index and ranked by the in-memory scorer, so the result is identical to the plain search.

lairs.discovery.accelerator

DuckDB query accelerator for the discovery index.

A rebuildable, derived view over the panproto index: cards are materialized to Parquet and pre-filtered with DuckDB SQL, then the matching cards are loaded from the index (the source of truth) and ranked by the in-memory scorer, so the result is identical to query.search. The DuckDB pre-filter is a relaxation of the full predicate (it never excludes a true match), and the final search pass applies the exact predicate and ranking. The Parquet is never authoritative and can be rebuilt from the index at any time.

This module imports DuckDB and pyarrow at module top; it is reached explicitly as lairs.discovery.accelerator so that importing lairs does not pull DuckDB into every process.

materialize_cards

materialize_cards(
    index: DiscoveryIndex, out_dir: Path
) -> Path

Write the index's cards to a Parquet view, returning its path.

The view is derived and rebuildable: it is regenerated from index.cards() on each call and is never the source of truth.

PARAMETER DESCRIPTION
index

The index whose cards to materialize.

TYPE: DiscoveryIndex

out_dir

The directory to write the Parquet view into.

TYPE: Path

RETURNS DESCRIPTION
Path

The path of the written Parquet file.

search_accelerated

search_accelerated(
    index: DiscoveryIndex,
    query: SearchQuery,
    *,
    out_dir: Path,
) -> list[SearchHit]

Search the index through the DuckDB-accelerated pre-filter.

Materializes the card view, narrows it with a DuckDB SQL pre-filter, loads the surviving cards from the index, and ranks them with the same scorer as :func:lairs.discovery.query.search, so the result is identical to an in-memory search over every card.

PARAMETER DESCRIPTION
index

The index to search.

TYPE: DiscoveryIndex

query

The query to apply.

TYPE: SearchQuery

out_dir

The directory for the rebuildable Parquet view.

TYPE: Path

RETURNS DESCRIPTION
list of SearchHit

The matching cards, ranked identically to the in-memory search.

Cards

The index record models and the corpus-to-card builder: the DatasetCard stored per discovered corpus and the crawl report that summarizes an ingest run.

lairs.discovery.cards

Index record models and the corpus-to-card builder.

The discovery index stores these dx.Model records in a panproto Repository: a DatasetCard per discovered corpus (a denormalized, searchable summary with provenance and freshness), a SyncCursor per firehose relay, and a RepoCrawlState per crawled repository. These are client-side index bookkeeping under a local lairs.index.* namespace, never published Layers records and never code-generated.

INDEX_DID module-attribute

INDEX_DID = 'did:lairs:index'

The sentinel authority for the local discovery index's records.

CARD_NSID module-attribute

CARD_NSID = 'lairs.index.datasetCard'

The collection NSID for dataset cards in the local index.

CURSOR_NSID module-attribute

CURSOR_NSID = 'lairs.index.syncCursor'

The collection NSID for firehose sync cursors in the local index.

CRAWL_STATE_NSID module-attribute

CRAWL_STATE_NSID = 'lairs.index.repoCrawlState'

The collection NSID for per-repo crawl state in the local index.

CardProvenance

Bases: Model

Where a dataset card came from, for trust and refresh.

ATTRIBUTE DESCRIPTION
source_did

The corpus author's DID.

TYPE: str

source_endpoint

The PDS or appview base URL the card was read from.

TYPE: str

discovered_via

How the card entered the index ("firehose", "crawl", "seed").

TYPE: str

source_handle

The author's handle at discovery time, when known.

TYPE: str or None

CardFreshness

Bases: Model

Firehose and crawl bookkeeping so freshness and resume are first-class.

ATTRIBUTE DESCRIPTION
first_seen_at

When this corpus first entered the index.

TYPE: datetime

last_updated_at

When the card content last changed.

TYPE: datetime

last_seen_seq

The last firehose sequence number that touched the corpus.

TYPE: int or None

last_seen_rev

The last repository commit revision observed for the corpus.

TYPE: str or None

record_cid

The CID of the corpus record at the last refresh.

TYPE: str or None

DatasetCard

Bases: Model

A searchable, denormalized index entry for one corpus.

ATTRIBUTE DESCRIPTION
summary

The corpus-derived listing projection.

TYPE: DatasetSummary

provenance

Where the card came from.

TYPE: CardProvenance

freshness

First-seen and last-updated bookkeeping.

TYPE: CardFreshness

annotation_rounds

The number of annotation rounds declared, when present.

TYPE: int or None

adjudication_method

The adjudication method slug, when present.

TYPE: str or None

redundancy_count

The declared annotator redundancy, when present.

TYPE: int or None

quality_metrics

The quality-criterion metric slugs declared for the corpus.

TYPE: tuple of str

SyncCursor

Bases: Model

A resumable firehose position for one relay.

ATTRIBUTE DESCRIPTION
relay

The firehose endpoint this cursor is for.

TYPE: str

seq

The last fully-processed firehose sequence number.

TYPE: int

updated_at

When the cursor was last written.

TYPE: datetime

RepoCrawlState

Bases: Model

Per-repository crawl bookkeeping so a re-run skips finished repos.

ATTRIBUTE DESCRIPTION
did

The crawled repository DID.

TYPE: str

endpoint

The PDS endpoint the repo was crawled from.

TYPE: str

has_layers_corpus

Whether the repo carried the corpus collection.

TYPE: bool

corpora_found

The number of corpora indexed from the repo.

TYPE: int

last_crawled_at

When the repo was last crawled.

TYPE: datetime

repos_cursor

A listRepos pagination checkpoint, when crawling a relay.

TYPE: str or None

CrawlReport

Bases: Model

A summary of a crawl or firehose pass, logging every skip.

ATTRIBUTE DESCRIPTION
repos_seen

The number of repositories visited.

TYPE: int

repos_with_corpora

The number of repositories that held a corpus collection.

TYPE: int

cards_built

The number of cards built or refreshed.

TYPE: int

cards_unchanged

The number of cards that were already current (dedup hits).

TYPE: int

cards_removed

The number of cards removed in response to a corpus-deletion commit.

TYPE: int

skipped

Human-readable skip reasons, including any bound that was hit.

TYPE: tuple of str

revision

The commit revision the pass produced, when any.

TYPE: str or None

card_uri

card_uri(corpus_uri: str) -> str

Build the deterministic index AT-URI for a corpus's card.

The same corpus always maps to the same card key, so re-indexing is idempotent and content-addressed dedup falls out for free.

PARAMETER DESCRIPTION
corpus_uri

The corpus AT-URI.

TYPE: str

RETURNS DESCRIPTION
str

The card's index AT-URI under the local lairs.index.* namespace.

card_from_corpus

card_from_corpus(
    corpus_uri: str,
    corpus: Corpus,
    *,
    provenance: CardProvenance,
    freshness: CardFreshness,
) -> DatasetCard

Build a DatasetCard from a discovered corpus and its provenance.

PARAMETER DESCRIPTION
corpus_uri

The corpus AT-URI.

TYPE: str

corpus

The discovered corpus record.

TYPE: Corpus

provenance

Where the corpus was discovered.

TYPE: CardProvenance

freshness

First-seen and last-updated bookkeeping for the card.

TYPE: CardFreshness

RETURNS DESCRIPTION
DatasetCard

The denormalized index card.

Result models

The value types for discovery results: a denormalized corpus summary, a repository table of contents, a collection count, and a facet filter.

lairs.discovery.models

Discovery result models.

dx.Model value types for dataset discovery: a denormalized corpus summary, a repository table of contents, and a facet filter. These are the shapes the discovery API returns and the CLI renders. DatasetSummary is also reused as the corpus-derived core of the Tier 3 index card.

DatasetSummary

Bases: Model

A denormalized corpus card for discovery listings.

A flat, readable projection of a pub.layers.corpus.corpus record plus the actor and source it was found through, so a listing renders one row per dataset without dumping records.

ATTRIBUTE DESCRIPTION
uri

The corpus AT-URI.

TYPE: str

did

The owning repository DID.

TYPE: str

name

The corpus name.

TYPE: str

handle

The owning handle, when it was resolved.

TYPE: str or None

description

The corpus description.

TYPE: str or None

domain

The corpus domain slug.

TYPE: str or None

domain_uri

The AT-URI of the corpus domain definition node.

TYPE: str or None

language

The primary BCP-47 language tag.

TYPE: str or None

languages

All languages represented in the corpus.

TYPE: tuple of str

license

The license identifier.

TYPE: str or None

version

The corpus version label.

TYPE: str or None

expression_count

The number of expressions in the corpus.

TYPE: int or None

created_at

The ISO 8601 creation timestamp.

TYPE: str or None

ontology_refs

The ontology AT-URIs the corpus uses.

TYPE: tuple of str

eprint_refs

The eprint AT-URIs the corpus links.

TYPE: tuple of str

has_adjudication

Whether the corpus declares an adjudication step.

TYPE: bool

source_endpoint

The PDS or appview the summary was read from.

TYPE: str or None

CollectionCount

Bases: Model

A repository collection NSID with an optional record count.

ATTRIBUTE DESCRIPTION
nsid

The collection NSID.

TYPE: str

count

The number of records in the collection, when counted.

TYPE: int or None

is_dataset_like

Whether the collection holds dataset-shaped records.

TYPE: bool

RepoTableOfContents

Bases: Model

An actor's repository inventory: identity plus per-collection counts.

ATTRIBUTE DESCRIPTION
did

The repository DID.

TYPE: str

handle

The repository handle, when known.

TYPE: str or None

pds_endpoint

The PDS endpoint the inventory was read from.

TYPE: str or None

collections

The collections present in the repository.

TYPE: tuple of CollectionCount

dataset_collections

The dataset-like collection NSIDs, highlighted for convenience.

TYPE: tuple of str

DatasetFilter

Bases: Model

A facet and text filter over dataset summaries.

Server-side facets (language, domain) are pushed into listCorpora parameters on the appview path; the rest are applied client-side over the mapped summaries.

ATTRIBUTE DESCRIPTION
language

Keep corpora whose primary or listed languages include this tag.

TYPE: str or None

domain

Keep corpora with this domain slug.

TYPE: str or None

license

Keep corpora with this license identifier.

TYPE: str or None

min_expression_count

Keep corpora with at least this many expressions.

TYPE: int or None

max_expression_count

Keep corpora with at most this many expressions.

TYPE: int or None

text

Keep corpora whose name or description contains this substring.

TYPE: str or None

has_adjudication

Keep corpora that do (or do not) declare an adjudication step.

TYPE: bool or None

Summaries

The corpus-to-summary projection and dataset filtering: projects a generated Corpus record into the flat summary shape, evaluates a filter over a summary, and extracts the server-side facets the appview supports.

lairs.discovery.summary

Corpus-to-summary mapping and dataset filtering.

Projects a generated Corpus record (or a record envelope carrying one) into the flat DatasetSummary discovery shape, evaluates a DatasetFilter over a summary, and extracts the server-side facets listCorpora supports.

summary_from_corpus

summary_from_corpus(
    corpus: Corpus,
    *,
    uri: str,
    did: str,
    handle: str | None = None,
    source_endpoint: str | None = None,
) -> DatasetSummary

Project a corpus record into a DatasetSummary.

PARAMETER DESCRIPTION
corpus

The corpus record to project.

TYPE: Corpus

uri

The corpus AT-URI.

TYPE: str

did

The owning repository DID.

TYPE: str

handle

The owning handle, when known.

TYPE: str or None DEFAULT: None

source_endpoint

The PDS or appview the corpus was read from.

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
DatasetSummary

The flat discovery summary.

corpus_from_value

corpus_from_value(value: JsonValue) -> Corpus | None

Decode a record value into a Corpus, or None on failure.

The wire-only $type discriminator is dropped before validation, since the generated models do not declare it.

PARAMETER DESCRIPTION
value

The record value to decode.

TYPE: JsonValue

RETURNS DESCRIPTION
Corpus or None

The decoded corpus, or None when the value is not a decodable corpus.

summary_from_envelope

summary_from_envelope(
    envelope: RecordEnvelope,
    *,
    did: str | None = None,
    handle: str | None = None,
    source_endpoint: str | None = None,
) -> DatasetSummary | None

Decode a corpus envelope into a DatasetSummary.

Returns None when the envelope is not a corpus record or its value does not validate, so a single bad record does not abort a listing.

PARAMETER DESCRIPTION
envelope

The record envelope to decode.

TYPE: RecordEnvelope

did

The owning DID; derived from the envelope URI when omitted.

TYPE: str or None DEFAULT: None

handle

The owning handle, when known.

TYPE: str or None DEFAULT: None

source_endpoint

The PDS or appview the envelope was read from.

TYPE: str or None DEFAULT: None

RETURNS DESCRIPTION
DatasetSummary or None

The summary, or None when the record is not a decodable corpus.

matches

matches(
    summary: DatasetSummary, flt: DatasetFilter | None
) -> bool

Return whether a summary satisfies a filter.

PARAMETER DESCRIPTION
summary

The summary to test.

TYPE: DatasetSummary

flt

The filter; None matches everything.

TYPE: DatasetFilter or None

RETURNS DESCRIPTION
bool

True when the summary passes every set facet.

listcorpora_params

listcorpora_params(
    repo: str, flt: DatasetFilter | None
) -> QueryParams

Build listCorpora query parameters, pushing the server-side facets.

Only language and domain are server-side facets on listCorpora; every other facet is applied client-side over the mapped summaries.

PARAMETER DESCRIPTION
repo

The repository DID or handle to list.

TYPE: str

flt

The filter whose server-side facets to push.

TYPE: DatasetFilter or None

RETURNS DESCRIPTION
QueryParams

The query parameters for corpus.listCorpora.