Best Local LLM Tools for macOS
Local LLM tooling on macOS splits into four jobs: simple serving, GUI exploration, low-level inference control, and Apple Silicon-native experimentation. One tool does not need to do all four.
My default: use Ollama for a local API. Use LM Studio for model exploration. Use llama.cpp when you need GGUF runtime control. Use MLX when you want Apple Silicon-native Python work close to the model.
Recommendation table
| Tool | Best at | Use when | Main trade-off |
|---|---|---|---|
| Ollama | Simple local model server and model lifecycle | You want a local endpoint quickly | Less low-level control than llama.cpp. |
| LM Studio | GUI chat, model discovery, and local server workflows | You want to compare models without writing glue code | Desktop abstraction hides runtime details. |
| llama.cpp | GGUF inference, quantization, server flags, Metal control | You need control over context, batch, quantization, and runtime behavior | More setup and more flags. |
| MLX | Apple Silicon-native arrays and model workflows | You want Python-level experiments on M-series Macs | Smaller serving ecosystem than Ollama or llama.cpp. |
Which one should you install first?
Install Ollama first if you are building software. Many apps know how to talk to it, and the local API is enough for prototypes, tests, and small internal tools. It is the shortest path from "I need a local model" to "my app can call a local model."
Install LM Studio first if you are choosing a model. It is good for browsing models, changing settings, comparing outputs, and running an OpenAI-compatible local server without designing the workflow yourself.
Install llama.cpp first if you care about the mechanics of inference. Context length, quantization, Metal flags, prompt processing, batch sizes, and server behavior are easier to inspect when you are closer to the runtime.
Use MLX when the work is not just serving a chat model. It fits Apple Silicon-native model experiments, conversion, fine-tuning, and Python workflows where unified memory is part of the design.
Workflow matrix
| Workflow | Default | Why |
|---|---|---|
| Local API for an app | Ollama | Stable developer ergonomics and broad integration support. |
| Manual model comparison | LM Studio | GUI makes prompt and model comparison faster. |
| Performance debugging | llama.cpp | You can see and control the runtime knobs. |
| Quantized GGUF model serving | llama.cpp or Ollama | Use llama.cpp for control, Ollama for convenience. |
| Apple Silicon model experiments | MLX | Native array framework and model tooling for M-series Macs. |
| Nontechnical stakeholder demo | LM Studio | Easy to show and adjust interactively. |
| Repeatable engineering setup | Ollama plus a pinned model list | Easier to script than a GUI-only workflow. |
Hardware notes
Unified memory is the real constraint on Apple Silicon. A model that fits on a 64 GB MacBook Pro can be unusable on an 8 GB MacBook Air. Quantization helps, but context length can quietly dominate memory. Benchmark the actual prompt shape instead of the model name alone.
For small local tools, a 7B or 8B class model is often more useful than an overloaded larger model. For coding, long context and tool integration may matter more than raw benchmark rank. For document QA, retrieval quality usually dominates local model choice.
What not to do
Do not turn local LLM setup into a permanent benchmark project unless performance is the product. Start with Ollama or LM Studio. Prove that local inference helps. Then move down to llama.cpp or MLX when you have a concrete reason.
Do not compare models only in a chat UI if the real workload is structured extraction, code editing, or RAG answer synthesis. Write a tiny eval script with representative prompts.
Deeper reading
- Local LLMs on macOS covers the hands-on setup.
- Open-Source LLM Variants and File Formats explains GGUF, GPTQ, AWQ, base models, and instruct models.
- MacBook Setup for AI Engineering covers the broader workstation setup.