Evaluation
Evaluation
What Origin's published retrieval numbers mean, how they are generated, and what they do not claim.
At a glance
01
The public numbers are retrieval metrics, not end-to-end answer quality claims.
02
The current README snapshot reports 93.6% Recall@5 on LongMemEval oracle and 70.0% Recall@5 on LoCoMo locomo10.
01
Current snapshot
Origin's README publishes a compact retrieval snapshot for the shipped hybrid retrieval path. The snapshot uses BGE-Base-EN-v1.5-Q embeddings, FTS5, Reciprocal Rank Fusion, and the latest shipped local rerank path when enabled.
The numbers below are the public README snapshot from the daemon repository. They should be read as retrieval-only metrics over the stated fixtures.
README snapshot
Benchmark Recall@5 MRR NDCG@10
LongMemEval oracle, 500 Q 93.6% 0.857 0.883
LoCoMo locomo10 70.0% 0.647 0.68402
What is measured
The headline metrics are Recall@5, MRR, and NDCG@10. They measure whether the retrieval layer surfaces relevant context near the top of the result set.
This is useful because Origin's job is to bring the right memories, pages, decisions, and graph context into the next agent session. It does not prove that a downstream model will always answer correctly.
03
What is not measured
The published table is not a full product-quality score, a guarantee of answer quality, or a latency benchmark. It is also not a cross-product claim unless the comparison page explicitly states the protocol.
Single-run snapshots are useful for development direction. Public claims that compare improvements should be regenerated under the current schema and, for headline claims, backed by repeated runs or documented methodology.
04
Where the harness lives
The eval harness lives in crates/origin-core/src/eval. The workflow docs live in docs/eval in the Origin repository.
Slow GPU or API-backed evals are manual. Normal CI avoids running heavy model benchmarks because hosted runners do not provide the right hardware, secrets, or cost profile.
- LoCoMo and LongMemEval use Recall@5, MRR, and NDCG@10 headline fields.
- The README updater reads a local metrics JSON and writes the tracked README snapshot.
- Raw local baseline artifacts stay outside git; the repository keeps the methodology and curated snapshot.
05
How to rerun
Clone the repository, follow the eval docs, and run the appropriate ignored eval harness command for the benchmark you want to reproduce. Expect slow runs when local models or judges are involved.
When comparing two retrieval modes, regenerate both sides under the same schema and fixture revision. Cross-schema comparisons are treated as invalid by the repository's eval discipline.
Next
Desktop App Status
Understand how the optional Origin desktop app relates to the daemon, CLI, MCP server, and Claude Code plugin.
Read next