AI Engineering

April 2, 2026

The Definitive Guide to NER in 2026: Encoders, LLMs, and the 3-Tier Production Architecture

Two years ago, picking an NER approach meant choosing between speed (encoder models) and accuracy (LLMs). That trade-off is gone — and it didn't even put up a fight. A 300M-parameter GLiNER model now matches the zero-shot accuracy of a 13B UniNER — while running 100x faster. A newer bi-encoder variant scales to millions of entity types with a 130x throughput advantage over the original cross-encoder. The production pattern that emerged: use LLMs to label data, fine-tune compact encoders, deploy with ONNX or Rust.

I built the companion repo and benchmarked every major approach myself. Encoders have won the production battle. LLMs are now indispensable — not for inference, but as training data generators. This guide covers the papers, benchmarks, and deployment patterns behind that shift.

Companion repo: ner-field-guide — runnable demos for GLiNER, ONNX export, LLM-as-teacher pipeline, and structured extraction with Instructor.

March 13, 2026

The LLM Engineering Field Guide: 45 Concepts Every Practitioner Needs

Every production LLM system sits at the intersection of GPU physics, systems engineering, and ML theory. Whether you are optimizing Time to First Token for a chatbot or configuring DeepSpeed ZeRO for a fine-tuning run, the same interconnected set of concepts keeps surfacing. This guide distills them into a single practical reference.

TL;DR: I organized the 45 essential concepts every LLM practitioner encounters into eight parts — hardware foundations, inference fundamentals, inference optimizations, model architecture, training and alignment, scaling and deployment, applications, and production operations. Each entry covers what the concept is, why it matters, the key numbers, and how it connects to everything else. All data reflects the 2024–early 2026 landscape with verified references.

This guide assumes familiarity with basic ML concepts (backpropagation, gradient descent, softmax) and some systems knowledge (memory hierarchies, networking basics).

March 4, 2026

The Definitive Guide to OCR in 2026: From Pipelines to VLMs

The model that tops the most popular OCR benchmark ranks near-last when real users judge quality. That single fact captures the state of OCR in 2026: the field has shifted so fundamentally that our evaluation methods haven't caught up. And it matters more than ever — OCR is now the front door to every RAG pipeline and AI assistant.

Vision-Language Models (VLMs) now dominate text extraction from complex documents, achieving 3–4× lower Character Error Rate (CER) than traditional engines on noisy scans, receipts, and distorted text. The OCR Arena leaderboard (early 2026) tells the definitive story: Gemini 3 Flash leads, followed by Gemini 3 Pro, Claude Opus 4.6, and GPT-5.2. These general-purpose frontier models are outperforming dedicated OCR systems. Meanwhile, open-source models like dots.ocr (1.7B params, 100+ languages) and Qwen3-VL (2B–235B) deliver remarkable quality at near zero cost.

Yet traditional engines remain the fastest, cheapest option for clean printed documents, and no single approach dominates every scenario. Let's break down the entire OCR ecosystem today: the benchmarks, the models, the metrics, and how to actually deploy this in production.

🔧 Companion repo: The OCR Gauntlet — a runnable single-notebook demo comparing 5 document types across 3 model tiers (Tesseract → dots.ocr → Gemini 3 Flash) to expose the quality cliffs and cost trade-offs.

January 4, 2026

The Complete Guide to LLM Fine-Tuning in 2025: From Theory to Production

Fine-tuning has become the secret weapon for building specialized AI applications. While general-purpose models like GPT-4 and Claude excel at broad tasks, fine-tuning transforms them into laser-focused experts for your specific domain. This guide walks you through everything you need to know—from understanding when to fine-tune to deploying your custom model.

October 22, 2025

LoRAX Playbook - Orchestrating Thousands of LoRA Adapters on Kubernetes

Serving dozens of fine-tuned large language models used to mean provisioning one GPU per model. LoRAX (LoRA eXchange) flips that math on its head: keep a single base model in memory and hot-swap lightweight LoRA adapters per request.

This guide shows you how LoRAX achieves near-constant cost per token regardless of how many fine-tunes you're serving. We'll cover:

What LoRA is and why it matters.
LoRAX vs. vLLM: When to use which.
Kubernetes Deployment: A production-ready Helm guide.
API Usage: REST, Python, and OpenAI-compatible examples.

May 11, 2025

Choosing the Right Open-Source LLM Variant & File Format

Why do open-source LLMs have so many confusing names?

You've probably seen model names like Llama-3.1-8B-Instruct.Q4_K_M.gguf or Mistral-7B-v0.3-A3B.awq and wondered what all those suffixes mean. It looks like a secret code, but the short answer is: they tell you two critical things.

Open-source LLMs vary along two independent dimensions:

Model variant – the suffix in the name (-Instruct, -Distill, -A3B, etc.) describes how the model was trained and what it's optimized for.
File format – the extension (.gguf, .gptq, .awq, etc.) describes how the weights are stored and where they run best (CPU, GPU, mobile, etc.).

The variant determines what the model can do; the format determines where it runs efficiently.

LLM Variant vs Format

Understanding both dimensions helps you avoid downloading 20 GB of the wrong model at midnight and then spending hours debugging CUDA errors.

May 10, 2025

Quick-guide on Running LLMs Locally on macOS

You can run capable LLMs entirely on your Mac — fast responses, total privacy, no API costs. Here's how to pick the right tool.

May 4, 2025

Scaling Large Language Models - Practical Multi-GPU and Multi-Node Strategies for 2025

Today's LLMs don't fit on a single GPU. A 70B-parameter model needs ~140GB just for weights in FP16 -- nearly 2x what an A100 can hold. Training or serving these models means distributing work across multiple GPUs, and doing it wrong wastes most of your compute budget.

This guide covers practical strategies for scaling LLMs across multiple GPUs and nodes, drawing from Hugging Face's Ultra-Scale Playbook.