Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Search Quality

What NDCG@10 means

NDCG (Normalized Discounted Cumulative Gain) at rank 10 measures how well a search system ranks relevant results in the top 10 positions. A score of 1.0 means every relevant result is at the top. A score of 0.0 means no relevant results appear in the top 10.

For code search, a query like “authentication middleware” has a set of ground-truth relevant files. NDCG@10 measures whether those files appear near the top of prx’s results.

The metric is standard in information retrieval research. It penalizes relevant results that appear lower in the ranking more than those that appear at the top.

Benchmark results (v0.5.7)

200 labeled queries across 8 public repositories, 6 languages, 3 size tiers. All repos pinned by commit SHA. Ground truth in benchmarks/repos/.

RepoLanguageFilesNDCG@10SymbolSemantic
FlaskPython2590.7100.8050.662
ripgrepRust2390.4930.8100.356
fastifyTypeScript4170.4320.8220.321
cargoRust2,8150.3790.7050.285
kafkaJava7,2310.3540.9340.191
djangoPython5,6900.2620.4950.211
terraformGo5,3230.2870.2380.319
vscodeTypeScript14,6430.2080.6390.080

Summary by size tier:

TierAvg NDCG@10
Small (< 500 files)0.545
Medium (500-10K files)0.332
Large (> 10K files)0.248
Overall0.391
Symbol search avg0.681
Semantic search avg0.303

Symbol vs semantic analysis

Symbol search is consistently strong (avg 0.681) across all codebase sizes. When you search for a known identifier, function name, or type name, prx finds it reliably.

Semantic search degrades at scale. The 32M embedded model (potion-retrieval-32M) works well on codebases under ~3K files. On larger codebases, the embedding space becomes crowded and relevance scores compress. The vscode semantic score (0.080) reflects this limitation clearly.

The hybrid search combines both: symbol search anchors precision, semantic search adds recall for natural language queries. The combined NDCG@10 is consistently better than either alone.

Known limitations

Semantic search at scale. The embedded 32M-parameter model is optimized for speed and binary size, not maximum retrieval quality. On codebases with 10K+ files, semantic search quality drops significantly. For large repos, use --literal for known identifiers and rely on symbol search.

Architecture queries on large repos. The architecture_ndcg10 scores in the benchmark data show 0.000 for kafka, django, and vscode. High-level architectural queries (“where is the plugin system?”) are hard for any embedding model on large codebases.

Import graph coverage. Import extraction covers 10 language families via tree-sitter AST queries. Languages outside this set don’t get proximity boosting. The graph is also a best-effort extraction: dynamic imports, conditional imports, and generated code may not be captured.

Planned improvements

Code-specific model tiers are planned for v0.6.0. A larger model (or a model fine-tuned on code) would improve semantic search quality on large codebases without changing the binary’s offline/no-server design.

These are honest numbers on codebases we didn’t write and don’t tune for. The benchmark dataset and methodology are public so you can verify them independently.

See also: Public Benchmark Suite, Indexing Performance