agent-bench
LIVE · FASTAPI + K8S CORPORA · 3 PROVIDERS

Production RAG, benchmarked honestly — including the model-size floor where agentic retrieval breaks down.

A custom tool-calling orchestrator and a LangChain baseline, evaluated on the same 27-question FastAPI golden set (plus a 6-question Kubernetes set) across OpenAI, Anthropic, and a self-hosted Mistral-7B. Every stage is instrumented. The interesting finding isn't which pipeline wins — it's where both fail.

Built by Jane Yeung · Munich · Open to AI/ML roles in Germany

API models
1.00
OpenAI gpt-4o-mini and Anthropic claude-haiku-4-5, 27/27 correct citations.
Self-hosted · 7B
0.14
Mistral-7B on 8K context — agentic retrieval can't recover from a weak first pass.
R@5 0.83–0.86 across 4 configs 27 FastAPI + 6 K8s questions 2 corpora · FastAPI · Kubernetes 6.6× cost delta · custom vs LangChain (Anthropic)
Try the demo Source on GitHub

Live pipeline

Ask a question. Watch every stage — injection check, hybrid retrieval, rerank, iterative tool-calls, LLM synthesis, output validation — with real latencies and token counts.

Provider
Corpus
running on OpenAI · FastAPI corpus
session · local-dev demo data open live demo ↗ idle
Pick an example chip above — or type a question. Press Enter to send.
Pipeline idle · schematic
injection_check
regex + classifier, tiered
~3ms
retrieval
FAISS + BM25 + RRF, top-20
~40ms
reranking
cross-encoder, top-5
~60ms
llm_synthesis
tool-calling loop · max 3 iter
~800ms
output_validation
post-stream · monitored, not gated ?
~12ms
latency tokens cost
Retrieval waiting
The top-5 reranked chunks land here, with RRF-normalized scores.
Security 3 layers
Mapped against the OWASP LLM Top 10 (2025) — named residual risks for LLM01, scope limits for LLM02 → SECURITY.md ↗
Injection
regex + classifier
PII redact
context only
Output
monitored
Try a guardrail
5 of 10 OWASP demoable · 3 infrastructure-layer · 2 out of scope · SECURITY.md has the full mapping

Three findings

27 FastAPI + 6 K8s · custom + langchain · 3 providers
01 / orchestration

Retrieval dominates orchestration.

custom · oai0.83
langchain · oai0.86
custom · anth0.84
langchain · anth0.84
max spread0.03

R@5 spans only 0.03 across all four Custom × LangChain × OpenAI × Anthropic configs with identical retrieval stacks. The orchestration layer is interchangeable; FAISS + BM25 + RRF + cross-encoder is what matters.

comparison_custom_vs_langchain.md ↗
02 / cost

LangChain's Anthropic adapter carries a 6.6× cost tax.

custom$0.0007
langchain$0.0046

Same model (claude-haiku-4-5), same retrieval, same 27-question FastAPI set. The multiplier comes from LangChain's prompt construction in the Anthropic tool-calling adapter — extra system prompt and tool schema re-sends on every iteration.

docs/provider_comparison.md ↗
03 / model-size floor

There's a model-size floor for agentic retrieval — and a 7B model falls off it.

1.00
gpt-4o-mini
1.00
haiku-4-5
0.14
mistral-7B · citation
0.05
mistral-7B · R@5
Three of the four bars are citation accuracy. The rightmost shows Mistral-7B's R@5 (0.05) on the same axis — both retrieval and citation collapse together.

Not because the model is bad — because 8K context forces top_k=3, single-iteration retrieval that can't recover from a weak first pass. This is a context-window + iteration-budget effect, not a claim about Mistral-7B's general capability. The chart above isolates the failure: both layers (retrieval R@5 and citation accuracy) collapse together.

docs/provider_comparison.md ↗

Request log

cached — previous session · 6 queries
#QuestionProviderInjection ChunksRerankedPIIOutput IterTokensLatencyCost
queries 6 avg latency 984ms total tokens 14,220 total cost $0.0081 blocked 1