Public Benchmark Suite
Overview
The prx benchmark suite measures search quality (NDCG@10) across 200 labeled queries on 8 public repositories. It’s designed to be reproducible, honest, and runnable by anyone.
- 200 queries across 8 repos
- 6 languages: Python, Rust, TypeScript, Java, Go
- 3 size tiers: small (< 500 files), medium (500-10K files), large (> 10K files)
- All repos pinned by commit SHA
- Ground truth in
benchmarks/repos/
Running the benchmark
# Run against the standard dataset
prx bench-ndcg benchmarks/dataset.json
# Human-readable output
prx bench-ndcg benchmarks/dataset.json --plain
The benchmark loads the index once and runs all queries against cached data. A 50-query suite runs in 0.23 seconds.
Dataset format
The dataset is a JSON file with labeled queries:
{
"repo": "pallets/flask",
"commit": "abc123...",
"queries": [
{
"query": "request context handling",
"relevant_files": [
"src/flask/ctx.py",
"src/flask/globals.py"
],
"query_type": "semantic"
}
]
}
Each query has a set of ground-truth relevant files. NDCG@10 measures how well prx ranks those files in the top 10 results.
Interpreting results
The output reports NDCG@10 per repo and overall, broken down by search mode:
{
"repo": "flask",
"queries": 25,
"ndcg10": 0.710,
"symbol_ndcg10": 0.805,
"semantic_ndcg10": 0.662,
"misses": 0
}
ndcg10: hybrid search (the default)symbol_ndcg10: literal/symbol search onlysemantic_ndcg10: semantic search onlymisses: queries where no relevant file appeared in the top 10
A miss means the relevant file wasn’t in the top 10 at all. Misses are the most actionable signal for improving search quality.
v0.5.7 results
| Repo | Language | Size | Files | NDCG@10 | Misses |
|---|---|---|---|---|---|
| Flask | Python | small | 259 | 0.710 | 0 |
| ripgrep | Rust | small | 239 | 0.493 | 4 |
| fastify | TypeScript | small | 417 | 0.432 | 5 |
| cargo | Rust | medium | 2,815 | 0.379 | 7 |
| kafka | Java | medium | 7,231 | 0.354 | 11 |
| django | Python | medium | 5,690 | 0.262 | 9 |
| terraform | Go | large | 5,323 | 0.287 | 9 |
| vscode | TypeScript | large | 14,643 | 0.208 | 16 |
Overall average: 0.391. Symbol search average: 0.681.
CI regression gate
The benchmark suite runs in CI on every release. A regression in NDCG@10 of more than 0.02 on any repo blocks the release.
To run the CI check locally:
prx bench-ndcg benchmarks/dataset.json --threshold 0.02
Returns exit code 0 if no regression, exit code 1 if any repo regressed beyond the threshold.
Adding queries
To add queries to the dataset, add entries to the relevant repo’s query list in benchmarks/repos/<repo>/queries.json. Each query needs:
- A natural language query string
- A list of ground-truth relevant files (relative paths)
- A query type (
semantic,symbol, orarchitecture)
Ground truth is determined by human judgment: which files would a developer actually want to find for this query?
See also: Search Quality, Indexing Performance