Skip to content

Blog

Evaluating RAG: Metrics for Every Stage of a Production RAG System

Part 1 of the Production RAG series

A RAG system with broken filters can run for months before anyone notices. The pipeline returns answers, the latency dashboards stay green, and the only sign something is wrong is that the answers themselves are subtly wrong. "Subtly wrong" doesn't page anyone.

Better logs won't catch this. Evaluation will, but only if it covers each stage of the pipeline with its own metric. This article is the reference I wish I'd had when I was figuring out which metrics actually matter.

The Guardians — Why Agent Security Is Not LLM Safety

Part 4 of the Engineering the Agentic Stack series

In 2024 we shipped guardrails. NeMo Guardrails, Bedrock Guardrails, and a handful of similar products wrapped the input and output of a model call and asked one question: is the model producing the right thing? Toxic output, PII leak, jailbreak, off-topic. Filter, redact, refuse. The threat was easy to see because there were only two places to look: input and output.

Then we gave the model a tool loop, a filesystem, a shell, a Model Context Protocol (MCP) registry, and the authority to act. The threat model changed underneath us and most of the 2024 guardrails didn't notice. Six serious incidents in eighteen months (EchoLeak, the Amazon Q Developer extension compromise, the Azure MCP Server disclosure, Claude Code CVE-2025-59536, the axios 1.14.1 remote-access trojan, and the Trivy Actions tag hijack, each walked through below) were not addressable by a better output filter. The output was fine. The system was compromised.

The Definitive Guide to NER in 2026: Encoders, LLMs, and the 3-Tier Production Architecture

Two years ago, picking an NER approach meant choosing between speed (encoder models) and accuracy (LLMs). That trade-off is gone. A 300M-parameter GLiNER model matches the zero-shot accuracy of a 13B UniNER while running 100x faster. A newer bi-encoder variant scales to millions of entity types with a 130x throughput advantage over the original cross-encoder. The production pattern that emerged: use LLMs to label data, fine-tune compact encoders, deploy with ONNX or Rust.

I built the companion repo and benchmarked every major approach. Encoders won the production NER market. LLMs are still essential, but as training data generators rather than inference engines. This guide covers the papers, benchmarks, and deployment patterns behind that shift.

Companion repo: ner-field-guide — runnable demos for GLiNER, ONNX export, LLM-as-teacher pipeline, and structured extraction with Instructor.

The Hands — Tool Ergonomics and the Agent-Computer Interface

Part 3 of the Engineering the Agentic Stack series

Part 1 covered reasoning loops, Part 2 covered memory. This post is about tools: how agents interact with the world, and why tool design matters more than most teams realize.

The tool market shifted in 2025–2026. MCP became the de facto standard, but production teams keep running into its security gaps and token overhead. The other approach picking up adoption is agents that write and execute code instead of calling JSON-defined tools. Anthropic measured a 98.7% token reduction on a representative workflow, and the CodeAct paper reports 20% higher task success rates.

This post covers all five tool modalities, the ACI design principles that govern them, and practical patterns I apply in the Market Analyst Agent.

The LLM Engineering Field Guide: 45 Concepts Every Practitioner Needs

Production LLM systems pull from GPU hardware, systems engineering, and ML theory all at the same time. The same handful of concepts shows up whether you are tuning TTFT for a chatbot or configuring DeepSpeed ZeRO for a fine-tuning run. This guide collects them in one place.

TL;DR: 45 concepts in eight parts: hardware, inference fundamentals, inference optimizations, model architecture, training and alignment, scaling and deployment, applications, and production operations. Each entry has the definition, why it matters, the numbers, and links to the concepts it connects to. Data is from 2024 through early 2026, with sources.

This guide assumes familiarity with basic ML (backpropagation, gradient descent, softmax) and some systems knowledge (memory hierarchies, networking basics).

The Definitive Guide to OCR in 2026: From Pipelines to VLMs

The model that tops the most popular OCR benchmark ranks near-last when real users judge quality. The field is moving fast enough that the benchmarks haven't caught up — and OCR matters more than it used to, because every RAG pipeline and document assistant has to push text through it first.

Vision-Language Models (VLMs) lead text extraction on complex documents, with 3–4× lower Character Error Rate (CER) than traditional engines on noisy scans, receipts, and distorted text. On the OCR Arena leaderboard in early 2026, Gemini 3 Flash sits at the top, followed by Gemini 3 Pro, Claude Opus 4.6, and GPT-5.2. General-purpose frontier models are now beating dedicated OCR systems on real documents. Open-source models like dots.ocr (1.7B params, 100+ languages) and Qwen3-VL (2B–235B) get you most of the way there at near zero inference cost.

Traditional engines are still the fastest and cheapest option for clean printed documents, and nothing wins on every scenario. This guide covers the benchmarks, the models, the metrics, and how to actually deploy this in production.

Companion repo: The OCR Gauntlet — a single-notebook demo comparing 5 document types across 3 model tiers (Tesseract → dots.ocr → Gemini 3 Flash) to expose the quality cliffs and cost trade-offs.

Benchmarking Modern Data Processing Engines

For years, Pandas handled in-memory tabular work and Apache Spark handled the distributed kind. That split worked fine when the data was structured.

Modern workloads aren't always structured anymore. Image, audio, and video pipelines now sit alongside the tabular ones, and once GPU inference enters the picture, the JVM's garbage collection and Python's GIL stop being annoyances and start being the bottleneck. That's a big part of why a new generation of engines built on Rust and Apache Arrow has taken off.

I've moved most of my single-node pipelines to Polars over the last year, and I wanted to see how it actually stacks up against the rest of the new stack. So I benchmarked Polars, DataFusion, Daft, Ray Data, and Spark on real datasets — NYC taxi trips for the tabular side, Food-101 images for the multimodal side.

All of the code lives in the engine-comparison-demo repository. Clone it and run the benchmarks on your own hardware — your numbers will be different from mine, and that's the point.

The Cortex: Architecting Memory for AI Agents

Part 2 of the Engineering the Agentic Stack series

State is what separates a chatbot from an agent. Without memory, every interaction starts from zero. The agent cannot pause and resume, cannot learn from past sessions, cannot personalize. Part 1 covered the cognitive engine that decides how an agent thinks. This post is about the infrastructure that determines what it remembers.

I'll walk through the memory architecture of the Market Analyst Agent, showing how hot and cold memory layers work together for checkpointing, pause/resume, and cross-session learning. Then I'll cover a third tier — document memory — that doesn't get much theoretical treatment but turns up everywhere in practice.

Building a Modern Search Ranking Stack: From Embeddings to LLM-Powered Relevance

Search stopped being a string-matching problem a while ago. When someone types "wireless headphones" into a product search engine, they expect more than items containing those two words. They want the best result given semantic relevance, product quality, user preferences, and availability. The gap between what BM25 returns and what users actually want has changed how search systems get built.

This post walks through a modern search ranking stack: a multi-stage pipeline combining sparse lexical retrieval, dense semantic embeddings, reciprocal rank fusion, cross-encoder reranking, and LLM listwise ranking. I built a working demo that benchmarks each stage on the Amazon ESCI product search dataset, so every layer's contribution shows up in real numbers.

The Cognitive Engine: Choosing the Right Reasoning Loop

Part 1 of the Engineering the Agentic Stack series

Building useful AI agents is mostly system design now, not prompt engineering. The biggest single decision is how the agent thinks: the reasoning loop you put at the center of it.

This post walks through three loops worth knowing (ReAct, ReWOO, Plan-and-Execute) and how to pick between them. The running example is a Market Analyst Agent I built in LangGraph, with the full code on GitHub.