Skip to content

Blog

mHC: How DeepSeek Scaled Residual Connections Without Breaking Training

Modern deep learning rests on the residual connection. After a decade of stacking layers deeper, researchers at DeepSeek asked a different question: what if we scaled width instead? Their answer, Manifold-Constrained Hyper-Connections (mHC), fixes a long-standing stability problem with width scaling.

In this post, I'll walk through the evolution from basic residuals to mHC, explaining why each step was necessary and how DeepSeek's solution actually works at scale.

Enterprise RAG Challenge 3: Winning Approaches for Autonomous AI Agents

The Enterprise RAG Challenge 3 (ECR3) just wrapped up. 524 teams, more than 341,000 agent runs, and only 0.4% of teams hit a perfect score. With the leaderboard and write-ups now public, I went through the winning solutions to figure out what the top teams did differently.

This post covers what ECR3 is, what the tasks looked like, and the patterns I kept seeing in the architectures that worked.

The Complete Guide to LLM Fine-Tuning in 2025: From Theory to Production

Most fine-tuning projects I've seen fail not in training, but in the steps before and after it: bad data, the wrong base model, no real eval. The actual training is the easy part. This guide is what I wish I'd had before my first serious fine-tune — when to do it, when not to, the methods that work today (LoRA, QLoRA, DoRA, ORPO), and how to get a model from a notebook to a serving endpoint.

Schema-Guided Reasoning on vLLM - Turning LLMs into Reliable Business Logic Engines

A retry loop is a confession. It says: I expected the model to return valid JSON, it didn't, so I'll roll the dice again. Most LLM agent code I've worked with leans heavily on retries, and the JSON still ends up broken often enough that someone wires up an alert for it.

Schema-Guided Reasoning (SGR) skips the retry game by enforcing the schema at the token level. You define the reasoning topology as a Pydantic schema, and the inference engine masks out any token that would violate it before sampling. The output is valid by construction, not by retry.

LoRAX Playbook - Orchestrating Thousands of LoRA Adapters on Kubernetes

Serving dozens of fine-tuned large language models used to mean one GPU per model. LoRAX (LoRA eXchange) keeps a single base model in memory and hot-swaps lightweight LoRA adapters per request. Cost per token stays roughly flat as you add fine-tunes.

This guide covers what LoRA is, when to pick LoRAX over vLLM, how to deploy it on Kubernetes with the official Helm chart, and how to call the REST, Python, and OpenAI-compatible APIs.

Context Engineering in the Agentic‑AI Era — and How to Cook It

TL;DR

Context engineering is the pipeline that decides what the model sees at the moment of decision: instructions, examples, knowledge, memory, skills, tools, guardrails. Most production agents either work or break at this layer. Below is the blueprint I keep coming back to, with the patterns I actually use.

The same agent that aces a 30-minute demo can collapse on day three in production. Almost always, the failure has nothing to do with the model and everything to do with what was in the context when it made the wrong call. A stale memory. A doc that's no longer relevant. A tool description that drifted. A vague instruction.

Context engineering is what you do to stop that from happening: designing what the model sees before each decision, on purpose, with a budget. This article walks through the patterns I use.

Choosing the Right Open-Source LLM Variant & File Format


Why do open-source LLMs have so many confusing names?

You've probably seen names like Llama-3.1-8B-Instruct.Q4_K_M.gguf or Mistral-7B-v0.3-A3B.awq and wondered what the suffixes are for. They look noisy, but they're actually telling you two different things at once.

Open-source LLMs vary along two independent dimensions:

  1. Model variant: the suffix in the name (-Instruct, -Distill, -A3B) describes how the model was trained and what it's optimized for.
  2. File format: the extension (.gguf, .gptq, .awq) describes how the weights are stored and where they run best (CPU, GPU, mobile).

The variant decides what the model can do. The format decides where it can run efficiently.

LLM Variant vs Format

Get those two confused and you'll spend an evening chasing CUDA errors on a model that was never going to fit your card to begin with.

Quick-guide on Local Stable-Diffusion Toolkits for macOS

A .safetensors file is a few gigabytes of weights. Once you've downloaded one, every Stable Diffusion app on macOS can run it. The thing that varies between apps is everything that wraps the file: the UI, the workflow conventions, how much you have to learn before you produce a usable image.

So picking one is mostly a UX decision, not a model decision. The range goes from drag-and-drop apps you can use without reading anything, up to node graphs that take a weekend to feel comfortable in.