Skip to content

Blog

LoRAX Playbook - Orchestrating Thousands of LoRA Adapters on Kubernetes

Serving dozens of fine-tuned large language models used to mean provisioning one GPU per model. LoRAX (LoRA eXchange) flips that math on its head: keep a single base model in memory and hot-swap lightweight LoRA adapters per request.

This guide shows you how LoRAX achieves near-constant cost per token regardless of how many fine-tunes you're serving. We'll cover:

  • What LoRA is and why it's a game-changer.
  • LoRAX vs. vLLM: When to use which.
  • Kubernetes Deployment: A production-ready Helm guide.
  • API Usage: REST, Python, and OpenAI-compatible examples.

Context Engineering in the Agentic‑AI Era — and How to Cook It

TL;DR

Context engineering (the context layer) is the pipeline that selects, structures, and governs what the model sees at the moment of decision: Instructions, Examples, Knowledge, Memory, Tools, Guardrails. Agentic systems live or die by this layer. Below is a field‑tested blueprint and patterns.

The problem: You build an agent. It works in demos, fails in production. Why? The model gets the wrong context at the wrong time—stale memory, irrelevant docs, no safety checks, ambiguous instructions.

The fix: Design the context layer deliberately. This guide shows you how.

Choosing the Right Open-Source LLM Variant & File Format


Why do open-source LLMs have so many confusing names?

You've probably seen model names like Llama-3.1-8B-Instruct.Q4_K_M.gguf or Mistral-7B-v0.3-A3B.awq and wondered what all those suffixes mean. It looks like a secret code, but the short answer is: they tell you two critical things.

Open-source LLMs vary along two independent dimensions:

  1. Model variant – the suffix in the name (-Instruct, -Distill, -A3B, etc.) describes how the model was trained and what it's optimized for.
  2. File format – the extension (.gguf, .gptq, .awq, etc.) describes how the weights are stored and where they run best (CPU, GPU, mobile, etc.).

Think of it like this: the model variant is the recipe, and the file format is the container. You can put the same soup (recipe) into a thermos, a bowl, or a takeout box (container) depending on where you plan to eat it.

LLM Variant vs Format

Understanding both dimensions helps you avoid downloading 20 GB of the wrong model at midnight and then spending hours debugging CUDA errors.

Quick-guide on Local Stable-Diffusion Toolkits for macOS

Running generative AI models locally is a game-changer. It means zero cloud costs, no censorship, total privacy, and unlimited experimentation. Whether you're generating character portraits, architectural concepts, or just having fun, your Mac is more than capable of handling the workload thanks to Apple Silicon.

But with so many tools available, where do you start?

Below is a practical guide to the best macOS-ready interfaces. Each tool wraps the same powerful Stable Diffusion models but offers a completely different experience—from "Apple-like" simplicity to "developer-grade" control.

Quick-guide on Running LLMs Locally on macOS

Running Large Language Models (LLMs) locally on your Mac is a game-changer. It means faster responses, complete privacy, and zero API bills. But with so many tools popping up every week, which one should you choose?

This guide breaks down the top options—from dead-simple menu bar apps to full-control command-line tools. We'll cover what makes each special, their trade-offs, and how to get started.

The Ultimate Guide to pyproject.toml

TL;DR

Think of pyproject.toml as the package.json for Python. It's a single configuration file that holds your project's metadata, dependencies, and tool settings. Whether you use .venv, pyenv, or uv, this one file simplifies development and makes collaboration smoother.

Mastering Zsh Startup: ~/.zprofile vs ~/.zshrc 🚀

If you've ever wondered why your terminal feels slow, or why your environment variables aren't loading where you expect them to, you're likely battling the Zsh startup order.

The distinction between ~/.zprofile and ~/.zshrc is one of the most common sources of confusion for developers moving to Zsh (especially on macOS).

TL;DR ⚡

  • ~/.zprofile is for Environment Setup. It runs once when you log in (or open a terminal tab on macOS). Put your PATH, EDITOR, and language version managers (like fnm, pyenv) here.
  • ~/.zshrc is for Interactive Configuration. It runs every time you start a new shell instance. Put your aliases, prompt themes, and key bindings here.

MLOps in the Age of Foundation Models. Evolving Infrastructure for LLMs and Beyond

The field of machine learning has undergone a seismic shift with the rise of large-scale foundation models. From giant language models like GPT-4 to image diffusion models like Stable Diffusion, these powerful models have fundamentally changed how we build and operate ML systems.

In this post, I'll explore how ML infrastructure and MLOps practices have evolved to support foundation models. We'll contrast the "classic" era of MLOps with modern paradigms, examine what's changed, and look at the new patterns and workflows that have emerged. Think of it as upgrading from a standard toolbox to a fully automated factory—the principles are similar, but the scale and complexity are on a different level.