Evaluation

What Origin's published retrieval numbers mean, how they are generated, and what they do not claim.

Qi-Xuan LuUpdated Jun 1, 20266 min read

At a glance

The public numbers are retrieval metrics, not end-to-end answer quality claims.

The current README snapshot reports 93.6% Recall@5 on LongMemEval oracle and 70.0% Recall@5 on LoCoMo locomo10.

Current snapshot

Origin's README publishes a compact retrieval snapshot for the shipped hybrid retrieval path. The snapshot uses BGE-Base-EN-v1.5-Q embeddings, FTS5, Reciprocal Rank Fusion, and the latest shipped local rerank path when enabled.

The numbers below are the public README snapshot from the daemon repository. They should be read as retrieval-only metrics over the stated fixtures.

README snapshot

Benchmark                         Recall@5   MRR     NDCG@10
LongMemEval oracle, 500 Q          93.6%     0.857   0.883
LoCoMo locomo10                    70.0%     0.647   0.684

What is measured

The headline metrics are Recall@5, MRR, and NDCG@10. They measure whether the retrieval layer surfaces relevant context near the top of the result set.

This is useful because Origin's job is to bring the right memories, pages, decisions, and graph context into the next agent session. It does not prove that a downstream model will always answer correctly.

What is not measured

The published table is not a full product-quality score, a guarantee of answer quality, or a latency benchmark. It is also not a cross-product claim unless the comparison page explicitly states the protocol.

Single-run snapshots are useful for development direction. Public claims that compare improvements should be regenerated under the current schema and, for headline claims, backed by repeated runs or documented methodology.

Where the harness lives

The eval harness lives in crates/origin-core/src/eval. The workflow docs live in docs/eval in the Origin repository.

Slow GPU or API-backed evals are manual. Normal CI avoids running heavy model benchmarks because hosted runners do not provide the right hardware, secrets, or cost profile.

LoCoMo and LongMemEval use Recall@5, MRR, and NDCG@10 headline fields.
The README updater reads a local metrics JSON and writes the tracked README snapshot.
Raw local baseline artifacts stay outside git; the repository keeps the methodology and curated snapshot.

How to rerun

Clone the repository, follow the eval docs, and run the appropriate ignored eval harness command for the benchmark you want to reproduce. Expect slow runs when local models or judges are involved.

When comparing two retrieval modes, regenerate both sides under the same schema and fixture revision. Cross-schema comparisons are treated as invalid by the repository's eval discipline.

Open eval docs on GitHub

Desktop App Status

Understand how the optional Origin desktop app relates to the daemon, CLI, MCP server, and Claude Code plugin.