Dataset discovery¶
Discovery composes identity resolution, the PDS and appview clients, and the panproto store into a discovery surface over the Layers network. It spans three tiers: list a single actor's datasets and repository table of contents, fan out over a seed of actors and resolve cross-repo, ref-anchored link queries, and build a local, searchable index from the firehose and a backfill crawl. The discovery shapes carry over the same record envelopes used to read a PDS; see the ATProto reference.
Single-actor discovery¶
List an actor's datasets and repository table of contents. Resolves a handle or DID through the identity resolver and lists the actor's corpora as summary rows, preferring an appview when one is available.
lairs.discovery.actor ¶
Single-actor discovery: list an actor's datasets and repo table of contents.
Resolves a handle or DID through IdentityResolver and lists the actor's
corpora as DatasetSummary rows, preferring an appview when one is available
and falling back to direct PDS enumeration. table_of_contents reads a repo's
collection inventory through describe_repo without dumping records.
list_datasets ¶
list_datasets(
actor: str,
*,
source: str = "auto",
appview: str | None = None,
filters: DatasetFilter | None = None,
resolver: IdentityResolver | None = None,
pds_client: PdsClient | None = None,
appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]
List an actor's datasets as summaries.
Resolves actor (handle or DID), lists its corpora through an appview when
available (server-side language/domain facets) or direct PDS
enumeration otherwise, maps each to a DatasetSummary, and applies the
remaining facets client-side.
| PARAMETER | DESCRIPTION |
|---|---|
actor
|
A handle or DID to list datasets for.
TYPE:
|
source
|
One of
TYPE:
|
appview
|
An appview base URL; enables the appview path under
TYPE:
|
filters
|
Facet and text filters.
TYPE:
|
resolver
|
An injected identity resolver.
TYPE:
|
pds_client
|
An injected PDS client.
TYPE:
|
appview_client
|
An injected appview client.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of DatasetSummary
|
The matching dataset summaries, in source order. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
table_of_contents ¶
table_of_contents(
actor: str,
*,
source: str = "auto",
counts: bool = False,
resolver: IdentityResolver | None = None,
pds_client: PdsClient | None = None,
) -> RepoTableOfContents
Read an actor's repository inventory.
Uses describe_repo to list the collections present in the repo without
enumerating records. Counts are filled only when counts is set, since
counting drains every collection. This path is always PDS-backed; the
source argument is accepted for API symmetry and validated.
| PARAMETER | DESCRIPTION |
|---|---|
actor
|
A handle or DID.
TYPE:
|
source
|
Accepted for symmetry with
TYPE:
|
counts
|
Whether to fill per-collection record counts (drains each collection).
TYPE:
|
resolver
|
An injected identity resolver.
TYPE:
|
pds_client
|
An injected PDS client.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RepoTableOfContents
|
The repository inventory. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
Federated discovery¶
Network discovery without a central index: given a seed of handles or DIDs, list every actor's datasets and merge them, and answer ontology-anchored queries across the seed.
lairs.discovery.federated ¶
Federated discovery: fan out over a seed of actors.
Network discovery without a central index: given a seed of handles or DIDs (a lab roster, a leaderboard, a curated list), list every actor's datasets and merge them. Per-actor transport and resolution failures are isolated so one unreachable actor does not abort the sweep.
discover_datasets ¶
discover_datasets(
actors: Sequence[str],
*,
source: str = "auto",
appview: str | None = None,
filters: DatasetFilter | None = None,
resolver: IdentityResolver | None = None,
pds_client: PdsClient | None = None,
appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]
List datasets across a seed of actors, deduplicated by corpus AT-URI.
Each actor is listed through :func:lairs.discovery.actor.list_datasets.
A single resolver is shared across the whole seed so identity lookups (handle
to DID, and the DID document that carries the PDS endpoint) are cached and the
underlying HTTP client is opened once: an injected resolver is reused as
is, and when none is given a throwaway resolver is created for the sweep and
closed before returning. Duplicate corpora (the same AT-URI seen via more than
one actor or source) collapse to the first occurrence. A per-actor transport
or resolution failure is skipped, so the sweep is best-effort; a ValueError
(an unknown source or a missing endpoint) propagates.
| PARAMETER | DESCRIPTION |
|---|---|
actors
|
The seed handles or DIDs to search across.
TYPE:
|
source
|
One of
TYPE:
|
appview
|
An appview base URL; enables the appview path under
TYPE:
|
filters
|
Facet and text filters, applied per actor.
TYPE:
|
resolver
|
An injected identity resolver, shared across the seed.
TYPE:
|
pds_client
|
An injected PDS client.
TYPE:
|
appview_client
|
An injected appview client.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of DatasetSummary
|
The merged, deduplicated summaries, in seed-then-corpus order. |
datasets_using_ontology ¶
datasets_using_ontology(
ontology_uri: str,
actors: Sequence[str],
*,
source: str = "auto",
appview: str | None = None,
resolver: IdentityResolver | None = None,
pds_client: PdsClient | None = None,
appview_client: AppviewClient | None = None,
) -> tuple[DatasetSummary, ...]
Find datasets that use a given ontology, across a seed of actors.
The lexicons offer no ontology-to-corpus query, so this fans out over the
seed (see :func:discover_datasets) and keeps the corpora whose
ontology_refs contain ontology_uri. Cross-repo reach is therefore
bounded by the seed; the Tier 3 index resolves this generally.
| PARAMETER | DESCRIPTION |
|---|---|
ontology_uri
|
The ontology AT-URI to match against each corpus's
TYPE:
|
actors
|
The seed handles or DIDs to search across.
TYPE:
|
source
|
One of
TYPE:
|
appview
|
An appview base URL.
TYPE:
|
resolver
|
An injected identity resolver.
TYPE:
|
pds_client
|
An injected PDS client.
TYPE:
|
appview_client
|
An injected appview client.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of DatasetSummary
|
The datasets in the seed that reference the ontology. |
Link queries¶
Cross-repo, ref-anchored queries anchored on a content reference (a corpus, an eprint) rather than a repository, so an appview that indexes the network can answer them across actors.
lairs.discovery.links ¶
Cross-repo, ref-anchored link queries.
These queries are anchored on a content reference (a corpus, an eprint) rather than a repository, so an appview that indexes the network answers them across every repo: who, anywhere, asserts membership in this corpus, or links this eprint to data. They require an appview endpoint or client, since a PDS can only answer for its own repository.
members_of_corpus ¶
members_of_corpus(
corpus_uri: str,
*,
appview: str | None = None,
appview_client: AppviewClient | None = None,
split: str | None = None,
) -> tuple[Membership, ...]
List membership records that point at a corpus, across all repos.
| PARAMETER | DESCRIPTION |
|---|---|
corpus_uri
|
The corpus AT-URI to find members of.
TYPE:
|
appview
|
An appview base URL.
TYPE:
|
appview_client
|
An injected appview client.
TYPE:
|
split
|
Restrict to a dataset split (for example
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of lairs.records._generated.corpus.Membership
|
The membership records asserted for the corpus. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no appview endpoint or client is available. |
datasets_for_eprint ¶
datasets_for_eprint(
eprint_uri: str,
*,
appview: str | None = None,
appview_client: AppviewClient | None = None,
data_kind: str | None = None,
) -> tuple[DataLink, ...]
List data links that point at an eprint, across all repos.
| PARAMETER | DESCRIPTION |
|---|---|
eprint_uri
|
The eprint AT-URI to find data links for.
TYPE:
|
appview
|
An appview base URL.
TYPE:
|
appview_client
|
An injected appview client.
TYPE:
|
data_kind
|
Restrict to a data-kind slug.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple of lairs.records._generated.eprint.DataLink
|
The data-link records that reference the eprint. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no appview endpoint or client is available. |
The discovery index¶
A searchable index of dataset cards over a panproto Repository. The
DiscoveryIndex is a thin behavioral wrapper around the repository,
which remains the source of truth.
lairs.discovery.index ¶
The discovery index: dataset cards over a panproto Repository.
DiscoveryIndex is a thin behavioral wrapper around the panproto-backed
lairs.store.repository.Repository. The Repository is the source of truth: it
stores DatasetCard, SyncCursor, and RepoCrawlState records under the
local lairs.index.* namespace, versioned and content-addressed. Re-saving an
unchanged card is a no-op at commit time, so dedup and idempotent re-crawl are
free, and repo.diff answers "what datasets changed between two snapshots".
CardDiff ¶
Bases: Model
Dataset cards added, removed, or changed between two index revisions.
The members are corpus AT-URIs when the card is resolvable in the current index, falling back to the card's own index URI otherwise (for example a removed card).
| ATTRIBUTE | DESCRIPTION |
|---|---|
added |
Corpora whose card appeared between the revisions.
TYPE:
|
removed |
Corpora whose card disappeared between the revisions.
TYPE:
|
changed |
Corpora whose card content changed between the revisions.
TYPE:
|
DiscoveryIndex ¶
DiscoveryIndex(repo: Repository)
A panproto-backed store of dataset cards and crawl bookkeeping.
| PARAMETER | DESCRIPTION |
|---|---|
repo
|
The backing Repository that holds the index records.
TYPE:
|
repo
property
¶
repo: Repository
Return the backing Repository.
| RETURNS | DESCRIPTION |
|---|---|
Repository
|
The Repository that holds the index records. |
init
classmethod
¶
Create a new index Repository at path.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
The directory to create the index Repository in.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DiscoveryIndex
|
The new index. |
open
classmethod
¶
Open an existing index Repository at path.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
The directory of an existing index Repository.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DiscoveryIndex
|
The opened index. |
put_card ¶
put_card(card: DatasetCard) -> str
Stage a dataset card, keyed by its deterministic index URI.
| PARAMETER | DESCRIPTION |
|---|---|
card
|
The card to store.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The card's index AT-URI. |
get_card ¶
get_card(corpus_uri: str) -> DatasetCard | None
Load the card for a corpus, or None when it is not indexed.
| PARAMETER | DESCRIPTION |
|---|---|
corpus_uri
|
The corpus AT-URI.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DatasetCard or None
|
The stored card, or |
remove_card ¶
Remove a corpus's card from the index, returning whether one existed.
Stages the card's removal through the backing Repository so the card is
absent from :meth:cards, :meth:get_card, and search, and a
revision-to-revision :meth:diff_cards reports it in removed once the
removal is committed. Removing a card that is not indexed is a no-op.
| PARAMETER | DESCRIPTION |
|---|---|
corpus_uri
|
The corpus AT-URI whose card to remove.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
cards ¶
cards() -> list[DatasetCard]
Load every dataset card in the index.
| RETURNS | DESCRIPTION |
|---|---|
list of DatasetCard
|
All stored cards, in index key order. |
card_pool ¶
card_pool() -> ModelPool
Load every card into a ModelPool keyed by its index URI.
| RETURNS | DESCRIPTION |
|---|---|
ModelPool
|
A pool of the index's cards, for cross-reference traversal. |
get_cursor ¶
get_cursor(relay: str) -> SyncCursor | None
Load the firehose cursor for a relay, or None.
| PARAMETER | DESCRIPTION |
|---|---|
relay
|
The firehose endpoint.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
SyncCursor or None
|
The stored cursor, or |
put_cursor ¶
put_cursor(cursor: SyncCursor) -> None
Stage a firehose cursor.
| PARAMETER | DESCRIPTION |
|---|---|
cursor
|
The cursor to store.
TYPE:
|
get_crawl_state ¶
get_crawl_state(did: str) -> RepoCrawlState | None
Load the crawl state for a repository, or None.
| PARAMETER | DESCRIPTION |
|---|---|
did
|
The repository DID.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RepoCrawlState or None
|
The stored crawl state, or |
put_crawl_state ¶
put_crawl_state(state: RepoCrawlState) -> None
Stage a repository crawl state.
| PARAMETER | DESCRIPTION |
|---|---|
state
|
The crawl state to store.
TYPE:
|
commit ¶
Commit the staged index records.
| PARAMETER | DESCRIPTION |
|---|---|
message
|
The commit message.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The new commit revision. |
tag ¶
Tag an index revision (the head by default).
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The tag name.
TYPE:
|
revision
|
The revision to tag; defaults to the head.
TYPE:
|
head ¶
Return the head commit revision, or None when empty.
| RETURNS | DESCRIPTION |
|---|---|
str or None
|
The head revision. |
Index ingest¶
The backfill crawl and firehose tail that populate the index. Both pipelines write through the panproto Repository, recording one dataset card per discovered corpus along with the resumable cursor and per-repo crawl state.
lairs.discovery.ingest ¶
Index ingest pipelines: backfill crawl and firehose tail.
Both pipelines write through the panproto Repository: every discovered corpus
becomes a DatasetCard, and the resumable cursor and per-repo crawl state are
records too, so progress is versioned. Cards are only re-written when their
content changes, preserving content-addressed dedup; bounds and skips are always
logged in the returned CrawlReport.
RepoDescriber ¶
Bases: Protocol
A source of repository descriptions (satisfied by PdsClient).
CorpusLister ¶
Bases: Protocol
A source of a repository's records (satisfied by PdsClient).
list_records ¶
list_records(
repo: str, collection: str
) -> Iterator[RecordEnvelope]
Enumerate a repository's records in a collection.
build_index ¶
build_index(
index: DiscoveryIndex,
dids: Iterable[str],
*,
describe: RepoDescriber,
list_corpora: CorpusLister,
endpoint: str,
max_repos: int | None = None,
message: str = "backfill crawl",
) -> CrawlReport
Crawl repositories on one endpoint and index their corpora.
For each DID, describe_repo reveals whether the repo holds the corpus
collection; if so, its corpora are listed and turned into cards. Per-repo
crawl state is recorded so a re-run resumes, and a max_repos bound is
logged rather than silently applied.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index to write into.
TYPE:
|
dids
|
The repository DIDs to crawl (for example
TYPE:
|
describe
|
A repository-description source bound to
TYPE:
|
list_corpora
|
A record-listing source bound to
TYPE:
|
endpoint
|
The PDS or relay endpoint the repos are read from.
TYPE:
|
max_repos
|
A bound on repositories visited; hitting it is logged.
TYPE:
|
message
|
The commit message for the crawl snapshot.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CrawlReport
|
Counts and skip reasons for the crawl. |
update_index ¶
update_index(
index: DiscoveryIndex,
relay: str,
*,
source_endpoint: str | None = None,
limit: int | None = None,
commit_every: int = _DEFAULT_COMMIT_EVERY,
) -> CrawlReport
Tail a relay's firehose, refreshing cards for corpus commits.
Resumes from the stored SyncCursor for the relay, indexes each corpus
create or update, removes the card for each corpus delete (so the local index
does not drift stale), and checkpoints the cursor and commits every
commit_every events. limit bounds the events processed, which makes
a live tail testable. A delete of a corpus that is not indexed is logged in
skipped rather than removed; a removed card is reported in
CrawlReport.cards_removed.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index to write into.
TYPE:
|
relay
|
The firehose endpoint.
TYPE:
|
source_endpoint
|
The endpoint recorded as each card's provenance; defaults to
TYPE:
|
limit
|
A bound on events processed;
TYPE:
|
commit_every
|
How many events to process between cursor checkpoints.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CrawlReport
|
Counts and skip reasons for the firehose pass. |
Search¶
In-memory search over the discovery index: the primary, dependency-free query path that loads dataset cards, filters them with plain predicates, and ranks the matches.
lairs.discovery.query ¶
In-memory search over the discovery index.
The primary, dependency-free query path: load DatasetCard records and filter
them with plain predicates, then rank. Discovery is at dataset scale (thousands
of corpora), so a linear scan is fast and an index server is unwarranted; the
optional DuckDB accelerator (see accelerator) is only for larger scans.
SearchQuery ¶
Bases: Model
A structured, serializable query over dataset cards.
| ATTRIBUTE | DESCRIPTION |
|---|---|
text |
A case-insensitive substring matched against name and description.
TYPE:
|
domain |
A domain slug to match.
TYPE:
|
language |
A language tag matched against the primary or listed languages.
TYPE:
|
license |
A license identifier to match.
TYPE:
|
min_expressions |
Keep cards with at least this many expressions.
TYPE:
|
max_expressions |
Keep cards with at most this many expressions.
TYPE:
|
annotation_metric |
Keep cards declaring this quality metric.
TYPE:
|
min_annotation_rounds |
Keep cards declaring at least this many annotation rounds.
TYPE:
|
SearchHit ¶
Bases: Model
A matched dataset card with its ranking score.
| ATTRIBUTE | DESCRIPTION |
|---|---|
card |
The matched card.
TYPE:
|
score |
The ranking score; higher ranks first.
TYPE:
|
search ¶
search(
cards: Iterable[DatasetCard], query: SearchQuery
) -> list[SearchHit]
Filter and rank dataset cards against a query.
| PARAMETER | DESCRIPTION |
|---|---|
cards
|
The cards to search (for example
TYPE:
|
query
|
The query to apply.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of SearchHit
|
The matching cards, ranked by score then name. |
Query accelerator¶
A rebuildable DuckDB pre-filter over the index. Cards are materialized to Parquet and pre-filtered with SQL, then the matching cards are loaded from the index and ranked by the in-memory scorer, so the result is identical to the plain search.
lairs.discovery.accelerator ¶
DuckDB query accelerator for the discovery index.
A rebuildable, derived view over the panproto index: cards are materialized to
Parquet and pre-filtered with DuckDB SQL, then the matching cards are loaded
from the index (the source of truth) and ranked by the in-memory scorer, so the
result is identical to query.search. The DuckDB pre-filter is a relaxation
of the full predicate (it never excludes a true match), and the final
search pass applies the exact predicate and ranking. The Parquet is never
authoritative and can be rebuilt from the index at any time.
This module imports DuckDB and pyarrow at module top; it is reached explicitly
as lairs.discovery.accelerator so that importing lairs does not pull
DuckDB into every process.
materialize_cards ¶
materialize_cards(
index: DiscoveryIndex, out_dir: Path
) -> Path
Write the index's cards to a Parquet view, returning its path.
The view is derived and rebuildable: it is regenerated from index.cards()
on each call and is never the source of truth.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index whose cards to materialize.
TYPE:
|
out_dir
|
The directory to write the Parquet view into.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
The path of the written Parquet file. |
search_accelerated ¶
search_accelerated(
index: DiscoveryIndex,
query: SearchQuery,
*,
out_dir: Path,
) -> list[SearchHit]
Search the index through the DuckDB-accelerated pre-filter.
Materializes the card view, narrows it with a DuckDB SQL pre-filter, loads
the surviving cards from the index, and ranks them with the same scorer as
:func:lairs.discovery.query.search, so the result is identical to an
in-memory search over every card.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index to search.
TYPE:
|
query
|
The query to apply.
TYPE:
|
out_dir
|
The directory for the rebuildable Parquet view.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of SearchHit
|
The matching cards, ranked identically to the in-memory search. |
Cards¶
The index record models and the corpus-to-card builder: the DatasetCard
stored per discovered corpus and the crawl report that summarizes an
ingest run.
lairs.discovery.cards ¶
Index record models and the corpus-to-card builder.
The discovery index stores these dx.Model records in a panproto Repository:
a DatasetCard per discovered corpus (a denormalized, searchable summary with
provenance and freshness), a SyncCursor per firehose relay, and a
RepoCrawlState per crawled repository. These are client-side index
bookkeeping under a local lairs.index.* namespace, never published Layers
records and never code-generated.
INDEX_DID
module-attribute
¶
The sentinel authority for the local discovery index's records.
CARD_NSID
module-attribute
¶
The collection NSID for dataset cards in the local index.
CURSOR_NSID
module-attribute
¶
The collection NSID for firehose sync cursors in the local index.
CRAWL_STATE_NSID
module-attribute
¶
The collection NSID for per-repo crawl state in the local index.
CardProvenance ¶
Bases: Model
Where a dataset card came from, for trust and refresh.
| ATTRIBUTE | DESCRIPTION |
|---|---|
source_did |
The corpus author's DID.
TYPE:
|
source_endpoint |
The PDS or appview base URL the card was read from.
TYPE:
|
discovered_via |
How the card entered the index (
TYPE:
|
source_handle |
The author's handle at discovery time, when known.
TYPE:
|
CardFreshness ¶
Bases: Model
Firehose and crawl bookkeeping so freshness and resume are first-class.
| ATTRIBUTE | DESCRIPTION |
|---|---|
first_seen_at |
When this corpus first entered the index.
TYPE:
|
last_updated_at |
When the card content last changed.
TYPE:
|
last_seen_seq |
The last firehose sequence number that touched the corpus.
TYPE:
|
last_seen_rev |
The last repository commit revision observed for the corpus.
TYPE:
|
record_cid |
The CID of the corpus record at the last refresh.
TYPE:
|
DatasetCard ¶
Bases: Model
A searchable, denormalized index entry for one corpus.
| ATTRIBUTE | DESCRIPTION |
|---|---|
summary |
The corpus-derived listing projection.
TYPE:
|
provenance |
Where the card came from.
TYPE:
|
freshness |
First-seen and last-updated bookkeeping.
TYPE:
|
annotation_rounds |
The number of annotation rounds declared, when present.
TYPE:
|
adjudication_method |
The adjudication method slug, when present.
TYPE:
|
redundancy_count |
The declared annotator redundancy, when present.
TYPE:
|
quality_metrics |
The quality-criterion metric slugs declared for the corpus.
TYPE:
|
SyncCursor ¶
Bases: Model
A resumable firehose position for one relay.
| ATTRIBUTE | DESCRIPTION |
|---|---|
relay |
The firehose endpoint this cursor is for.
TYPE:
|
seq |
The last fully-processed firehose sequence number.
TYPE:
|
updated_at |
When the cursor was last written.
TYPE:
|
RepoCrawlState ¶
Bases: Model
Per-repository crawl bookkeeping so a re-run skips finished repos.
| ATTRIBUTE | DESCRIPTION |
|---|---|
did |
The crawled repository DID.
TYPE:
|
endpoint |
The PDS endpoint the repo was crawled from.
TYPE:
|
has_layers_corpus |
Whether the repo carried the corpus collection.
TYPE:
|
corpora_found |
The number of corpora indexed from the repo.
TYPE:
|
last_crawled_at |
When the repo was last crawled.
TYPE:
|
repos_cursor |
A
TYPE:
|
CrawlReport ¶
Bases: Model
A summary of a crawl or firehose pass, logging every skip.
| ATTRIBUTE | DESCRIPTION |
|---|---|
repos_seen |
The number of repositories visited.
TYPE:
|
repos_with_corpora |
The number of repositories that held a corpus collection.
TYPE:
|
cards_built |
The number of cards built or refreshed.
TYPE:
|
cards_unchanged |
The number of cards that were already current (dedup hits).
TYPE:
|
cards_removed |
The number of cards removed in response to a corpus-deletion commit.
TYPE:
|
skipped |
Human-readable skip reasons, including any bound that was hit.
TYPE:
|
revision |
The commit revision the pass produced, when any.
TYPE:
|
card_uri ¶
Build the deterministic index AT-URI for a corpus's card.
The same corpus always maps to the same card key, so re-indexing is idempotent and content-addressed dedup falls out for free.
| PARAMETER | DESCRIPTION |
|---|---|
corpus_uri
|
The corpus AT-URI.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The card's index AT-URI under the local |
card_from_corpus ¶
card_from_corpus(
corpus_uri: str,
corpus: Corpus,
*,
provenance: CardProvenance,
freshness: CardFreshness,
) -> DatasetCard
Build a DatasetCard from a discovered corpus and its provenance.
| PARAMETER | DESCRIPTION |
|---|---|
corpus_uri
|
The corpus AT-URI.
TYPE:
|
corpus
|
The discovered corpus record.
TYPE:
|
provenance
|
Where the corpus was discovered.
TYPE:
|
freshness
|
First-seen and last-updated bookkeeping for the card.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DatasetCard
|
The denormalized index card. |
Result models¶
The value types for discovery results: a denormalized corpus summary, a repository table of contents, a collection count, and a facet filter.
lairs.discovery.models ¶
Discovery result models.
dx.Model value types for dataset discovery: a denormalized corpus summary, a
repository table of contents, and a facet filter. These are the shapes the
discovery API returns and the CLI renders. DatasetSummary is also reused as
the corpus-derived core of the Tier 3 index card.
DatasetSummary ¶
Bases: Model
A denormalized corpus card for discovery listings.
A flat, readable projection of a pub.layers.corpus.corpus record plus the
actor and source it was found through, so a listing renders one row per
dataset without dumping records.
| ATTRIBUTE | DESCRIPTION |
|---|---|
uri |
The corpus AT-URI.
TYPE:
|
did |
The owning repository DID.
TYPE:
|
name |
The corpus name.
TYPE:
|
handle |
The owning handle, when it was resolved.
TYPE:
|
description |
The corpus description.
TYPE:
|
domain |
The corpus domain slug.
TYPE:
|
domain_uri |
The AT-URI of the corpus domain definition node.
TYPE:
|
language |
The primary BCP-47 language tag.
TYPE:
|
languages |
All languages represented in the corpus.
TYPE:
|
license |
The license identifier.
TYPE:
|
version |
The corpus version label.
TYPE:
|
expression_count |
The number of expressions in the corpus.
TYPE:
|
created_at |
The ISO 8601 creation timestamp.
TYPE:
|
ontology_refs |
The ontology AT-URIs the corpus uses.
TYPE:
|
eprint_refs |
The eprint AT-URIs the corpus links.
TYPE:
|
has_adjudication |
Whether the corpus declares an adjudication step.
TYPE:
|
source_endpoint |
The PDS or appview the summary was read from.
TYPE:
|
CollectionCount ¶
Bases: Model
A repository collection NSID with an optional record count.
| ATTRIBUTE | DESCRIPTION |
|---|---|
nsid |
The collection NSID.
TYPE:
|
count |
The number of records in the collection, when counted.
TYPE:
|
is_dataset_like |
Whether the collection holds dataset-shaped records.
TYPE:
|
RepoTableOfContents ¶
Bases: Model
An actor's repository inventory: identity plus per-collection counts.
| ATTRIBUTE | DESCRIPTION |
|---|---|
did |
The repository DID.
TYPE:
|
handle |
The repository handle, when known.
TYPE:
|
pds_endpoint |
The PDS endpoint the inventory was read from.
TYPE:
|
collections |
The collections present in the repository.
TYPE:
|
dataset_collections |
The dataset-like collection NSIDs, highlighted for convenience.
TYPE:
|
DatasetFilter ¶
Bases: Model
A facet and text filter over dataset summaries.
Server-side facets (language, domain) are pushed into listCorpora
parameters on the appview path; the rest are applied client-side over the
mapped summaries.
| ATTRIBUTE | DESCRIPTION |
|---|---|
language |
Keep corpora whose primary or listed languages include this tag.
TYPE:
|
domain |
Keep corpora with this domain slug.
TYPE:
|
license |
Keep corpora with this license identifier.
TYPE:
|
min_expression_count |
Keep corpora with at least this many expressions.
TYPE:
|
max_expression_count |
Keep corpora with at most this many expressions.
TYPE:
|
text |
Keep corpora whose name or description contains this substring.
TYPE:
|
has_adjudication |
Keep corpora that do (or do not) declare an adjudication step.
TYPE:
|
Summaries¶
The corpus-to-summary projection and dataset filtering: projects a
generated Corpus record into the flat summary shape, evaluates a filter
over a summary, and extracts the server-side facets the appview supports.
lairs.discovery.summary ¶
Corpus-to-summary mapping and dataset filtering.
Projects a generated Corpus record (or a record envelope carrying one) into
the flat DatasetSummary discovery shape, evaluates a DatasetFilter over a
summary, and extracts the server-side facets listCorpora supports.
summary_from_corpus ¶
summary_from_corpus(
corpus: Corpus,
*,
uri: str,
did: str,
handle: str | None = None,
source_endpoint: str | None = None,
) -> DatasetSummary
Project a corpus record into a DatasetSummary.
| PARAMETER | DESCRIPTION |
|---|---|
corpus
|
The corpus record to project.
TYPE:
|
uri
|
The corpus AT-URI.
TYPE:
|
did
|
The owning repository DID.
TYPE:
|
handle
|
The owning handle, when known.
TYPE:
|
source_endpoint
|
The PDS or appview the corpus was read from.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DatasetSummary
|
The flat discovery summary. |
corpus_from_value ¶
Decode a record value into a Corpus, or None on failure.
The wire-only $type discriminator is dropped before validation, since
the generated models do not declare it.
| PARAMETER | DESCRIPTION |
|---|---|
value
|
The record value to decode.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Corpus or None
|
The decoded corpus, or |
summary_from_envelope ¶
summary_from_envelope(
envelope: RecordEnvelope,
*,
did: str | None = None,
handle: str | None = None,
source_endpoint: str | None = None,
) -> DatasetSummary | None
Decode a corpus envelope into a DatasetSummary.
Returns None when the envelope is not a corpus record or its value does
not validate, so a single bad record does not abort a listing.
| PARAMETER | DESCRIPTION |
|---|---|
envelope
|
The record envelope to decode.
TYPE:
|
did
|
The owning DID; derived from the envelope URI when omitted.
TYPE:
|
handle
|
The owning handle, when known.
TYPE:
|
source_endpoint
|
The PDS or appview the envelope was read from.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DatasetSummary or None
|
The summary, or |
matches ¶
matches(
summary: DatasetSummary, flt: DatasetFilter | None
) -> bool
Return whether a summary satisfies a filter.
| PARAMETER | DESCRIPTION |
|---|---|
summary
|
The summary to test.
TYPE:
|
flt
|
The filter;
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
listcorpora_params ¶
listcorpora_params(
repo: str, flt: DatasetFilter | None
) -> QueryParams
Build listCorpora query parameters, pushing the server-side facets.
Only language and domain are server-side facets on listCorpora;
every other facet is applied client-side over the mapped summaries.
| PARAMETER | DESCRIPTION |
|---|---|
repo
|
The repository DID or handle to list.
TYPE:
|
flt
|
The filter whose server-side facets to push.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
QueryParams
|
The query parameters for |