2026

April 20, 2026

The Guardians — Why Agent Security Is Not LLM Safety

Part 4 of the Engineering the Agentic Stack series

In 2024 we shipped guardrails. NeMo Guardrails, Bedrock Guardrails, and a handful of similar products wrapped the input and output of a model call and asked one question: is the model producing the right thing? Toxic output, PII leak, jailbreak, off-topic. Filter, redact, refuse. The threat was easy to see because there were only two places to look: input and output.

Then we gave the model a tool loop, a filesystem, a shell, a Model Context Protocol (MCP) registry, and the authority to act. The threat model changed underneath us and most of the 2024 guardrails didn't notice. Six serious incidents in eighteen months (EchoLeak, the Amazon Q Developer extension compromise, the Azure MCP Server disclosure, Claude Code CVE-2025-59536, the axios 1.14.1 remote-access trojan, and the Trivy Actions tag hijack, each walked through below) were not addressable by a better output filter. The output was fine. The system was compromised.

April 2, 2026

The Definitive Guide to NER in 2026: Encoders, LLMs, and the 3-Tier Production Architecture

Two years ago, picking an NER approach meant choosing between speed (encoder models) and accuracy (LLMs). That trade-off is gone — and it didn't even put up a fight. A 300M-parameter GLiNER model now matches the zero-shot accuracy of a 13B UniNER — while running 100x faster. A newer bi-encoder variant scales to millions of entity types with a 130x throughput advantage over the original cross-encoder. The production pattern that emerged: use LLMs to label data, fine-tune compact encoders, deploy with ONNX or Rust.

I built the companion repo and benchmarked every major approach myself. Encoders have won the production battle. LLMs are now indispensable — not for inference, but as training data generators. This guide covers the papers, benchmarks, and deployment patterns behind that shift.

Companion repo: ner-field-guide — runnable demos for GLiNER, ONNX export, LLM-as-teacher pipeline, and structured extraction with Instructor.

March 24, 2026

The Hands — Tool Ergonomics and the Agent-Computer Interface

Part 3 of the Engineering the Agentic Stack series

Part 1 covered reasoning loops, Part 2 covered memory. This post is about tools — how agents interact with the world, and why tool design matters more than most teams realize.

The tool landscape shifted dramatically in 2025–2026. MCP won the standards war — but production teams are discovering its security gaps and token overhead the hard way. The alternative gaining traction: agents that write and execute code instead of calling JSON-defined tools, achieving 98.7% token reductions and 20% higher task success rates.

This post covers all five tool modalities, the ACI design principles that govern them, and practical patterns I apply in the Market Analyst Agent.

March 13, 2026

The LLM Engineering Field Guide: 45 Concepts Every Practitioner Needs

Every production LLM system sits at the intersection of GPU physics, systems engineering, and ML theory. Whether you are optimizing Time to First Token for a chatbot or configuring DeepSpeed ZeRO for a fine-tuning run, the same interconnected set of concepts keeps surfacing. This guide distills them into a single practical reference.

TL;DR: I organized the 45 essential concepts every LLM practitioner encounters into eight parts — hardware foundations, inference fundamentals, inference optimizations, model architecture, training and alignment, scaling and deployment, applications, and production operations. Each entry covers what the concept is, why it matters, the key numbers, and how it connects to everything else. All data reflects the 2024–early 2026 landscape with verified references.

This guide assumes familiarity with basic ML concepts (backpropagation, gradient descent, softmax) and some systems knowledge (memory hierarchies, networking basics).

March 4, 2026

The Definitive Guide to OCR in 2026: From Pipelines to VLMs

The model that tops the most popular OCR benchmark ranks near-last when real users judge quality. That single fact captures the state of OCR in 2026: the field has shifted so fundamentally that our evaluation methods haven't caught up. And it matters more than ever — OCR is now the front door to every RAG pipeline and AI assistant.

Vision-Language Models (VLMs) now dominate text extraction from complex documents, achieving 3–4× lower Character Error Rate (CER) than traditional engines on noisy scans, receipts, and distorted text. The OCR Arena leaderboard (early 2026) tells the definitive story: Gemini 3 Flash leads, followed by Gemini 3 Pro, Claude Opus 4.6, and GPT-5.2. These general-purpose frontier models are outperforming dedicated OCR systems. Meanwhile, open-source models like dots.ocr (1.7B params, 100+ languages) and Qwen3-VL (2B–235B) deliver remarkable quality at near zero cost.

Yet traditional engines remain the fastest, cheapest option for clean printed documents, and no single approach dominates every scenario. Let's break down the entire OCR ecosystem today: the benchmarks, the models, the metrics, and how to actually deploy this in production.

🔧 Companion repo: The OCR Gauntlet — a runnable single-notebook demo comparing 5 document types across 3 model tiers (Tesseract → dots.ocr → Gemini 3 Flash) to expose the quality cliffs and cost trade-offs.

February 21, 2026

Benchmarking Modern Data Processing Engines

Historically, data processing often relied on Pandas for in-memory workloads and Apache Spark for distributed processing. This approach was highly effective for structured tabular data.

However, modern workloads increasingly involve multimodal data, such as images, audio, and video frames. In these scenarios, traditional challenges like the JVM's garbage collection overhead and Python's Global Interpreter Lock can become performance bottlenecks, especially when integrating GPU inference. This shift has accelerated the adoption of engines built on Rust and Apache Arrow.

Having transitioned many of my own single-instance pipelines to Polars, I wanted to objectively evaluate how it compares to other modern engines. This article presents a comparison of several Rust-based compute frameworks using real-world datasets to provide a practical guide for choosing the right tool.

The entire benchmark suite and runnable code can be found in the engine-comparison-demo repository. I encourage you to clone it, run the benchmarks on your own hardware, and verify the results yourself.

February 14, 2026

The Cortex — Architecting Memory for AI Agents

Part 2 of the Engineering the Agentic Stack series

State is what separates a chatbot from an agent. Without memory, every interaction starts from zero — the agent cannot pause and resume, cannot learn from past sessions, cannot personalize. In Part 1, I covered the cognitive engine that decides how an agent thinks. This post tackles the infrastructure that determines what it remembers.

I'll walk through the memory architecture of the Market Analyst Agent, showing how hot and cold memory layers work together to support checkpointing, pause/resume workflows, and cross-session learning — and why a third tier of document-based memory is becoming essential for agents that manage their own knowledge.

February 8, 2026

Building a Modern Search Ranking Stack: From Embeddings to LLM-Powered Relevance

Search is no longer a string-matching problem. A query for "wireless headphones" on a product search engine is not just about finding items containing those two words — it is about surfacing the best result based on semantic relevance, product quality, user preferences, and real-time availability. The gap between BM25 keyword matching and what users actually expect has forced a complete rethinking of search architecture.

This post walks through the anatomy of a modern search ranking stack: a multi-stage pipeline that combines sparse lexical retrieval, dense semantic embeddings, reciprocal rank fusion, cross-encoder reranking, and LLM-powered listwise ranking. I built a working demo that benchmarks each stage on the Amazon ESCI product search dataset — proving the value of every layer with real numbers.

January 31, 2026

The Cognitive Engine: Choosing the Right Reasoning Loop

Part 1 of the Engineering the Agentic Stack series

Building production AI agents is no longer about prompt engineering—it's about system engineering. The difference between a demo that impresses and a product that delivers comes down to one critical decision: how your agent thinks.

This post introduces three reasoning loop architectures and shows you how to choose between them. I'll use a production-grade Market Analyst Agent as the running example, with code you can use today.

January 21, 2026

mHC: How DeepSeek Scaled Residual Connections Without Breaking Training

The success of modern deep learning rests on a deceptively simple idea: the residual connection. Yet after a decade of stacking layers deeper and deeper, researchers at DeepSeek asked a different question—what if we could scale width instead? Their answer, Manifold-Constrained Hyper-Connections (mHC), solves a fundamental instability problem that has blocked this path for years.

In this post, I'll break down the evolution from basic residuals to mHC, explaining why each step was necessary and how exactly DeepSeek's solution works at scale.