Skip to content

Blog

LoRAX Playbook - Orchestrating Thousands of LoRA Adapters on Kubernetes

Serving dozens of fine-tuned large language models used to mean provisioning one GPU per model. LoRAX (LoRA eXchange) flips that math on its head: keep a single base model in memory and hot-swap lightweight LoRA adapters per request.

This guide shows you how LoRAX achieves near-constant cost per token regardless of how many fine-tunes you're serving, when it makes sense to use it, how to deploy it on Kubernetes with Helm, and how to call it through REST, Python, and OpenAI-compatible APIs.

Context Engineering in the Agentic‑AI Era — and How to Cook It

TL;DR

Context engineering (the context layer) is the pipeline that selects, structures, and governs what the model sees at the moment of decision: Instructions, Examples, Knowledge, Memory, Tools, Guardrails. Agentic systems live or die by this layer. Below is a field‑tested blueprint and patterns.

The problem: You build an agent. It works in demos, fails in production. Why? The model gets the wrong context at the wrong time—stale memory, irrelevant docs, no safety checks, ambiguous instructions.

The fix: Design the context layer deliberately. This guide shows you how.

Building a Custom FeatureStoreLite MCP Server Using uv

A step-by-step guide that shows how to create your own lightweight feature store MCP server from scratch using FastMCP, run it through uv, and integrate it with Claude Desktop. This is a practical example of building a useful MCP server that ML engineers can actually use.

Choosing the Right Open-Source LLM Variant & File Format


Why do open-source LLMs have so many confusing names?

You've probably seen model names like Llama-3.1-8B-Instruct.Q4_K_M.gguf or Mistral-7B-v0.3-A3B.awq and wondered what all those suffixes mean. The short answer: they tell you two critical things.

Open-source LLMs vary along two independent dimensions:

  1. Model variant – the suffix in the name (-Instruct, -Distill, -A3B, etc.) describes how the model was trained and what it's optimized for.
  2. File format – the extension (.gguf, .gptq, .awq, etc.) describes how the weights are stored and where they run best (CPU, GPU, mobile, etc.).

Think of it like this: the model variant is the recipe, and the file format is the container. Same recipe, different containers for different kitchens.

graph LR
    A[Model Release] --> B[Model Variant<br/>What it does]
    A --> C[File Format<br/>Where it runs]
    B --> D[Base<br/>Instruct<br/>Distill<br/>QAT<br/>MoE]
    C --> E[GGUF<br/>GPTQ<br/>AWQ<br/>EXL2<br/>Safetensors]
    D --> F[Pick based on<br/>your use case]
    E --> G[Pick based on<br/>your hardware]

Understanding both dimensions helps you avoid downloading 20 GB of the wrong model at midnight and then spending hours debugging CUDA errors.

Quick-guide on Local Stable-Diffusion Toolkits for macOS

Running generative-AI models on-device means zero cloud costs, no upload limits, and full control of your checkpoints. Whether you're generating portraits, concept art, or iterating on product designs, keeping everything local gives you privacy and unlimited generations.

Below is a practical guide to five of the most popular macOS-ready front-ends. Each tool wraps the same underlying Stable Diffusion models but offers different trade-offs between simplicity and power.

Quick-guide on Running LLMs Locally on macOS

Running large language models locally on your Mac means faster responses, complete privacy, and no API bills. But which tool should you pick?

This guide breaks down the five most popular options - from dead-simple menu bar apps to full-control command-line tools. Each comes with download links, what makes it special, and honest trade-offs.

Quick-Guide on pyproject.toml

TL;DR

Think of pyproject.toml as the package.json for Python. It's a single configuration file that holds your project's metadata, dependencies, and tool settings. Whether you use .venv, pyenv, or uv, this one file simplifies development and makes collaboration smoother.

Quick-Guide on ~/.zprofile vs ~/.zshrc 🚀

TL;DR ⚡

  • ~/.zprofile → runs once when you start a login shell (think "environment setup") 🔧
  • ~/.zshrc → runs every time you open an interactive shell (think "daily experience") 🎮

Use both strategically: put your PATH and environment variables in ~/.zprofile, and your aliases, functions, and prompt customizations in ~/.zshrc

MLOps in the Age of Foundation Models. Evolving Infrastructure for LLMs and Beyond

The field of machine learning has undergone a seismic shift with the rise of large-scale foundation models. From giant language models like GPT-4 to image diffusion models like Stable Diffusion, these powerful models have fundamentally changed how we build and operate ML systems.

In this post, I'll explore how ML infrastructure and MLOps practices have evolved to support foundation models. We'll contrast the "classic" era of MLOps with modern paradigms, examine what's changed, and look at the new patterns and workflows that have emerged.