LIVE · FASTAPI + K8S CORPORA · 3 PROVIDERS

Production RAG, benchmarked honestly — including the model-size floor where agentic retrieval breaks down.

A custom tool-calling orchestrator and a LangChain baseline, evaluated on the same 27-question FastAPI golden set (plus a 6-question Kubernetes set) across OpenAI, Anthropic, and a self-hosted Mistral-7B. Every stage is instrumented. The interesting finding isn't which pipeline wins — it's where both fail.

Built by Jane Yeung · Munich · Open to AI/ML roles in Germany

API models

1.00

OpenAI gpt-4o-mini and Anthropic claude-haiku-4-5, 27/27 correct citations.

Self-hosted · 7B

0.14

Mistral-7B on 8K context — agentic retrieval can't recover from a weak first pass.

R@5 0.83–0.86 across 4 configs 27 FastAPI + 6 K8s questions 2 corpora · FastAPI · Kubernetes 6.6× cost delta · custom vs LangChain (Anthropic)

Try the demo ↓ Source on GitHub ↗

Live pipeline

Ask a question. Watch every stage — injection check, hybrid retrieval, rerank, iterative tool-calls, LLM synthesis, output validation — with real latencies and token counts.

session · local-dev demo data open live demo ↗ idle

Pick an example chip above — or type a question. Press Enter to send.

Pipeline idle · schematic

injection_check

regex + classifier, tiered

~3ms

retrieval

FAISS + BM25 + RRF, top-20

~40ms

reranking

cross-encoder, top-5

~60ms

llm_synthesis

tool-calling loop · max 3 iter

~800ms

output_validation

post-stream · monitored, not gated ?

~12ms

latency — tokens — cost —

Retrieval waiting

The top-5 reranked chunks land here, with RRF-normalized scores.

Security 3 layers

Mapped against the OWASP LLM Top 10 (2025) — named residual risks for LLM01, scope limits for LLM02 → SECURITY.md ↗

Injection

—

regex + classifier

PII redact

—

context only

Output

—

monitored

Try a guardrail

5 of 10 OWASP demoable · 3 infrastructure-layer · 2 out of scope · SECURITY.md has the full mapping

Three findings

27 FastAPI + 6 K8s · custom + langchain · 3 providers

01 / orchestration

Retrieval dominates orchestration.

custom · oai0.83

langchain · oai0.86

custom · anth0.84

langchain · anth0.84

max spread0.03

R@5 spans only 0.03 across all four Custom × LangChain × OpenAI × Anthropic configs with identical retrieval stacks. The orchestration layer is interchangeable; FAISS + BM25 + RRF + cross-encoder is what matters.

comparison_custom_vs_langchain.md ↗

02 / cost

LangChain's Anthropic adapter carries a 6.6× cost tax.

custom$0.0007

langchain$0.0046

Same model (claude-haiku-4-5), same retrieval, same 27-question FastAPI set. The multiplier comes from LangChain's prompt construction in the Anthropic tool-calling adapter — extra system prompt and tool schema re-sends on every iteration.

docs/provider_comparison.md ↗

03 / model-size floor

There's a model-size floor for agentic retrieval — and a 7B model falls off it.

1.00

gpt-4o-mini

1.00

haiku-4-5

0.14

mistral-7B · citation

0.05

mistral-7B · R@5

Three of the four bars are citation accuracy. The rightmost shows Mistral-7B's R@5 (0.05) on the same axis — both retrieval and citation collapse together.

Not because the model is bad — because 8K context forces top_k=3, single-iteration retrieval that can't recover from a weak first pass. This is a context-window + iteration-budget effect, not a claim about Mistral-7B's general capability. The chart above isolates the failure: both layers (retrieval R@5 and citation accuracy) collapse together.

docs/provider_comparison.md ↗

Production RAG, benchmarked honestly — including the model-size floor where agentic retrieval breaks down.

Live pipeline

Three findings

Retrieval dominates orchestration.

LangChain's Anthropic adapter carries a 6.6× cost tax.

There's a model-size floor for agentic retrieval — and a 7B model falls off it.

Request log

Tweaks