Ir para o conteúdo

Best Local LLM Tools for macOS

Local LLM tooling on macOS splits into four jobs: simple serving, GUI exploration, low-level inference control, and Apple Silicon-native experimentation. One tool does not need to do all four.

My default: use Ollama for a local API. Use LM Studio for model exploration. Use llama.cpp when you need GGUF runtime control. Use MLX when you want Apple Silicon-native Python work close to the model.

Recommendation table

Tool Best at Use when Main trade-off
Ollama Simple local model server and model lifecycle You want a local endpoint quickly Less low-level control than llama.cpp.
LM Studio GUI chat, model discovery, and local server workflows You want to compare models without writing glue code Desktop abstraction hides runtime details.
llama.cpp GGUF inference, quantization, server flags, Metal control You need control over context, batch, quantization, and runtime behavior More setup and more flags.
MLX Apple Silicon-native arrays and model workflows You want Python-level experiments on M-series Macs Smaller serving ecosystem than Ollama or llama.cpp.

Which one should you install first?

Install Ollama first if you are building software. Many apps know how to talk to it, and the local API is enough for prototypes, tests, and small internal tools. It is the shortest path from "I need a local model" to "my app can call a local model."

Install LM Studio first if you are choosing a model. It is good for browsing models, changing settings, comparing outputs, and running an OpenAI-compatible local server without designing the workflow yourself.

Install llama.cpp first if you care about the mechanics of inference. Context length, quantization, Metal flags, prompt processing, batch sizes, and server behavior are easier to inspect when you are closer to the runtime.

Use MLX when the work is not just serving a chat model. It fits Apple Silicon-native model experiments, conversion, fine-tuning, and Python workflows where unified memory is part of the design.

Workflow matrix

Workflow Default Why
Local API for an app Ollama Stable developer ergonomics and broad integration support.
Manual model comparison LM Studio GUI makes prompt and model comparison faster.
Performance debugging llama.cpp You can see and control the runtime knobs.
Quantized GGUF model serving llama.cpp or Ollama Use llama.cpp for control, Ollama for convenience.
Apple Silicon model experiments MLX Native array framework and model tooling for M-series Macs.
Nontechnical stakeholder demo LM Studio Easy to show and adjust interactively.
Repeatable engineering setup Ollama plus a pinned model list Easier to script than a GUI-only workflow.

Hardware notes

Unified memory is the real constraint on Apple Silicon. A model that fits on a 64 GB MacBook Pro can be unusable on an 8 GB MacBook Air. Quantization helps, but context length can quietly dominate memory. Benchmark the actual prompt shape instead of the model name alone.

For small local tools, a 7B or 8B class model is often more useful than an overloaded larger model. For coding, long context and tool integration may matter more than raw benchmark rank. For document QA, retrieval quality usually dominates local model choice.

What not to do

Do not turn local LLM setup into a permanent benchmark project unless performance is the product. Start with Ollama or LM Studio. Prove that local inference helps. Then move down to llama.cpp or MLX when you have a concrete reason.

Do not compare models only in a chat UI if the real workload is structured extraction, code editing, or RAG answer synthesis. Write a tiny eval script with representative prompts.

Deeper reading

References