A custom tool-calling orchestrator and a LangChain baseline, evaluated on the same 27-question FastAPI golden set (plus a 6-question Kubernetes set) across OpenAI, Anthropic, and a self-hosted Mistral-7B. Every stage is instrumented. The interesting finding isn't which pipeline wins — it's where both fail.
Ask a question. Watch every stage — injection check, hybrid retrieval, rerank, iterative tool-calls, LLM synthesis, output validation — with real latencies and token counts.
R@5 spans only 0.03 across all four Custom × LangChain × OpenAI × Anthropic configs with identical retrieval stacks. The orchestration layer is interchangeable; FAISS + BM25 + RRF + cross-encoder is what matters.
comparison_custom_vs_langchain.md ↗Same model (claude-haiku-4-5), same retrieval, same 27-question FastAPI set. The multiplier comes from LangChain's prompt construction in the Anthropic tool-calling adapter — extra system prompt and tool schema re-sends on every iteration.
docs/provider_comparison.md ↗Not because the model is bad — because 8K context forces top_k=3, single-iteration retrieval that can't recover from a weak first pass. This is a context-window + iteration-budget effect, not a claim about Mistral-7B's general capability. The chart above isolates the failure: both layers (retrieval R@5 and citation accuracy) collapse together.
docs/provider_comparison.md ↗| # | Question | Provider | Injection | Chunks | Reranked | PII | Output | Iter | Tokens | Latency | Cost |
|---|